scripod.com

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

How I AI

23 HOURS AGO
How I AI

How I AI

23 HOURS AGO

Shownote

In this episode, I sit down with Ankur Goyal, founder and CEO of Braintrust, the AI evals and observability platform used by teams like Notion, Stripe, Vercel, and Zapier. This one is for the senior engineers, staff engineers, VPs of engineering, and CTOs ...

Highlights

This podcast explores how AI agents are transforming the work of senior engineers and technical leaders, moving beyond simple code generation to tackle complex infrastructure and architecture problems. The conversation demystifies the concept of evals, presenting them as a modern, quantifiable version of a product requirements document that allows models to determine the 'how' while engineers focus on the 'what'. The discussion also provides a practical framework for deciding which tasks to delegate to AI agents and how to capture subjective human taste into repeatable, scalable evaluation systems.
00:00
AI agents can outperform human engineers
03:01
Using AI agents and evals to optimize slow database queries
06:13
AI agents enable safe, iterative testing of complex infrastructure changes.
09:11
AI agents enable running rigorous benchmarks that even top engineers often skip
11:30
No excuse for lacking rigor or performance
14:00
The 'agent line' framework decides what tasks to delegate.
20:14
Regaining flow state through focused coding with agents
20:32
Two camps of people with AI: those having fun and those feeling anxiety
23:06
Evals encode user stories in a quantifiable way
26:02
The agent runs safely within the playground, allowing experimentation without risk.
30:21
Vibe checks lead to a whack-a-mole game
32:15
Turning expertise into a system applies it more broadly.
33:13
AI produces higher quality outcomes than manual methods
33:45
Remove features that cause confusion
35:40
Improving CI is the highest-leverage way to accelerate engineering velocity with AI
37:32
Close the session, improve the evals, and retry from scratch.

Chapters

Introduction to Ankur Goyal
00:00
Using AI agents for database optimization
03:00
Running exhaustive benchmarks with coding agents
06:10
Why staff engineers are wrong about AI limitations
09:03
The “agent line” framework for delegation
11:30
Ankur’s workflow: running 4 to 6 concurrent agents
14:00
Technical setup: foreground agents, background agents, and cloud environments
17:16
Spending time with AI tools
20:32
Demystifying evals
23:06
Live demo: Building an eval for documentation answers
26:02
The alternative to evals: vibe checks and whack-a-mole
30:20
Capturing designer taste in scoring functions
32:09
Quick recap
33:13
Managing velocity and throughput
33:44
Why CI/CD investment is critical for AI-accelerated teams
35:40
Ankur’s prompting strategy when agents fail
37:30
Closing thoughts and how to connect
39:10

Transcript

Claire Vo: And still, as I say, the year of our cloud, 2026, I still talk to engineers that say AI on our most complicated things cannot do a good job. Ankur Goyal: I so viscerally disagree with it. There's no staff engineer who is running as many rigorou...