scripod.com

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

Overview

Shownote

Highlights

Transcript

Chapters

Pins

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

How I AI

Jun 15

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

How I AI

How I AI

Jun 15

Overview Shownote Highlights Transcript Chapters Pins

Shownote

In this episode, I sit down with Ankur Goyal, founder and CEO of Braintrust, the AI evals and observability platform used by teams like Notion, Stripe, Vercel, and Zapier. This one is for the senior engineers, staff engineers, VPs of engineering, and CTOs ...

Highlights

This podcast explores how AI agents are transforming the work of senior engineers and technical leaders, moving beyond simple code generation to tackle complex infrastructure and architecture problems. The conversation demystifies the concept of evals, presenting them as a modern, quantifiable version of a product requirements document that allows models to determine the 'how' while engineers focus on the 'what'. The discussion also provides a practical framework for deciding which tasks to delegate to AI agents and how to capture subjective human taste into repeatable, scalable evaluation systems.

00:00

AI agents can outperform human engineers

03:01

Using AI agents and evals to optimize slow database queries

06:13

AI agents enable safe, iterative testing of complex infrastructure changes.

09:11

AI agents enable running rigorous benchmarks that even top engineers often skip

11:30

No excuse for lacking rigor or performance

14:00

The 'agent line' framework decides what tasks to delegate.

20:14

Regaining flow state through focused coding with agents

20:32

Two camps of people with AI: those having fun and those feeling anxiety

23:06

Evals encode user stories in a quantifiable way

26:02

The agent runs safely within the playground, allowing experimentation without risk.

30:21

Vibe checks lead to a whack-a-mole game

32:15

Turning expertise into a system applies it more broadly.

33:13

AI produces higher quality outcomes than manual methods

33:45

Remove features that cause confusion

35:40

Improving CI is the highest-leverage way to accelerate engineering velocity with AI

37:32

Close the session, improve the evals, and retry from scratch.

Chapters

Introduction to Ankur Goyal

00:00

Using AI agents for database optimization

03:00

Running exhaustive benchmarks with coding agents

06:10

Why staff engineers are wrong about AI limitations

09:03

The “agent line” framework for delegation

11:30

Ankur’s workflow: running 4 to 6 concurrent agents

14:00

Technical setup: foreground agents, background agents, and cloud environments

17:16

Spending time with AI tools

20:32

Demystifying evals

23:06

Live demo: Building an eval for documentation answers

26:02

The alternative to evals: vibe checks and whack-a-mole

30:20

Capturing designer taste in scoring functions

32:09

Quick recap

33:13

Managing velocity and throughput

33:44

Why CI/CD investment is critical for AI-accelerated teams

35:40

Ankur’s prompting strategy when agents fail

37:30

Closing thoughts and how to connect

39:10

Transcript

Claire Vo: And still, as I say, the year of our cloud, 2026, I still talk to engineers that say AI on our most complicated things cannot do a good job. Ankur Goyal: I so viscerally disagree with it. There's no staff engineer who is running as many rigorou...