How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

How I AI

Jun 15

Overview Shownote Highlights Transcript Chapters Pins

This podcast explores how AI agents are transforming the work of senior engineers and technical leaders, moving beyond simple code generation to tackle complex infrastructure and architecture problems. The conversation demystifies the concept of evals, presenting them as a modern, quantifiable version of a product requirements document that allows models to determine the 'how' while engineers focus on the 'what'. The discussion also provides a practical framework for deciding which tasks to delegate to AI agents and how to capture subjective human taste into repeatable, scalable evaluation systems.

Ankur Goyal argues that AI agents can outperform human engineers in rigorous benchmarking, such as testing different database indexes and execution engines to optimize slow queries, a task often skipped due to its tedious nature. He introduces the 'agent line' framework for deciding which decisions to hand off to an agent, emphasizing that AI maintains focus on hard problems where human attention decays. The conversation demystifies evals as a methodology for defining success, encoding 'what good looks like' so a model can figure out the 'how'. A key insight is turning personal taste, like a designer's aesthetic, into a repeatable scoring function to scale quality beyond one person. For AI-accelerated teams, fixing CI is the highest-leverage way to speed up engineering velocity, as the primary job becomes building a feedback loop to turn real-world data into evals. When agents fail, the strategy is to close the session, improve the evals, and retry from scratch.