scripod.com

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

How I AI

1 DAYS AGO
How I AI

How I AI

1 DAYS AGO
This podcast explores how AI agents are transforming the work of senior engineers and technical leaders, moving beyond simple code generation to tackle complex infrastructure and architecture problems. The conversation demystifies the concept of evals, presenting them as a modern, quantifiable version of a product requirements document that allows models to determine the 'how' while engineers focus on the 'what'. The discussion also provides a practical framework for deciding which tasks to delegate to AI agents and how to capture subjective human taste into repeatable, scalable evaluation systems.
Ankur Goyal argues that AI agents can outperform human engineers in rigorous benchmarking, such as testing different database indexes and execution engines to optimize slow queries, a task often skipped due to its tedious nature. He introduces the 'agent line' framework for deciding which decisions to hand off to an agent, emphasizing that AI maintains focus on hard problems where human attention decays. The conversation demystifies evals as a methodology for defining success, encoding 'what good looks like' so a model can figure out the 'how'. A key insight is turning personal taste, like a designer's aesthetic, into a repeatable scoring function to scale quality beyond one person. For AI-accelerated teams, fixing CI is the highest-leverage way to speed up engineering velocity, as the primary job becomes building a feedback loop to turn real-world data into evals. When agents fail, the strategy is to close the session, improve the evals, and retry from scratch.
00:00
00:00
AI agents can outperform human engineers
03:01
03:01
Using AI agents and evals to optimize slow database queries
06:13
06:13
AI agents enable safe, iterative testing of complex infrastructure changes.
09:11
09:11
AI agents enable running rigorous benchmarks that even top engineers often skip
11:30
11:30
No excuse for lacking rigor or performance
14:00
14:00
The 'agent line' framework decides what tasks to delegate.
20:14
20:14
Regaining flow state through focused coding with agents
20:32
20:32
Two camps of people with AI: those having fun and those feeling anxiety
23:06
23:06
Evals encode user stories in a quantifiable way
26:02
26:02
The agent runs safely within the playground, allowing experimentation without risk.
30:21
30:21
Vibe checks lead to a whack-a-mole game
32:15
32:15
Turning expertise into a system applies it more broadly.
33:13
33:13
AI produces higher quality outcomes than manual methods
33:45
33:45
Remove features that cause confusion
35:40
35:40
Improving CI is the highest-leverage way to accelerate engineering velocity with AI
37:32
37:32
Close the session, improve the evals, and retry from scratch.