scripod.com

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Shownote

Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, ...

Highlights

In this insightful conversation, Hamel Husain and Shreya Shankar break down the essential practice of AI evaluations, emphasizing their role as a foundational discipline for building effective AI-powered products. They move beyond buzzwords to reveal how structured, human-led evals enable teams to systematically understand model behavior, uncover hidden failure modes, and drive meaningful product improvements.
05:00
Evals help move beyond vibe checks by providing measurable feedback for AI applications
16:01
Looking at data is crucial when doing data analysis of an LLM application
22:27
AI told a user about non-existent virtual tours, revealing a hallucination error
23:54
Manual open coding is crucial because LLMs lack context for accurate error assessment
25:23
One trusted person with domain expertise should lead AI evaluations to avoid overcomplication.
31:05
Theoretical saturation is reached when no new problem types are discovered during analysis.
31:39
Axial codes act as failure-mode labels to cluster and identify the most common AI errors.
44:39
17 conversational flow issues were identified using a pivot table analysis.
46:06
Dumb engineering errors in AI, like formatting mistakes, don't require full evals—they're obvious to fix.
51:05
LLM judges can reliably output pass or fail results for complex AI behaviors.
52:10
Use binary yes/no judgments instead of rating scales for reliable LLM evals
57:19
High agreement percentages between LLM and human judges can be misleading when errors are rare.
1:03:19
Experts can't anticipate all failure modes in LLM output validation
1:05:09
Fixing problems doesn't always require writing an eval.
1:07:41
Product managers can build profitable products using the skill set of implementing LLM judges for systematic improvement.
1:09:57
Strong opinions against AI evals often ignore their widespread practical use in development.
1:17:48
A/B tests should be powered by actual error analysis, not hypotheticals
1:18:26
Evals are essentially data science for understanding AI product performance.
1:22:30
More people should adopt structured approaches to application-specific evals.
1:23:02
There's high demand for Hamel and Shreya's Maven course.
1:29:59
The goal of evals is to improve the product, not just catch bugs.
1:33:19
AI sending factually correct emails isn't good enough—product thinking is essential for real effectiveness
1:36:30
Students get 10 months of free, unlimited access to all course-related AI content and resources
1:40:57
Hamel's life motto is 'Keep learning and think like a beginner'

Chapters

Introduction to Hamel and Shreya
00:00
What are evals?
04:57
Demo: Examining real traces from a property management AI assistant
09:56
Writing notes on errors
16:51
Why LLMs can’t replace humans in the initial error analysis
23:54
The concept of a “benevolent dictator” in the eval process
25:16
Theoretical saturation: when to stop
28:07
Using axial codes to help categorize and synthesize error notes
31:39
The results
44:39
Building an LLM-as-judge to evaluate specific failure modes
46:06
The difference between code-based evals and LLM-as-judge
48:31
Example: LLM-as-judge
52:10
Testing your LLM judge against human judgment
54:45
Why evals are the new PRDs for AI products
1:00:51
How many evals you actually need
1:05:09
What comes after evals
1:07:41
The great evals debate
1:09:57
Why dogfooding isn’t enough for most AI products
1:15:15
OpenAI’s Statsig acquisition
1:18:23
Tips and tricks for implementing evals effectively
1:22:28
The Claude Code controversy and the importance of context
1:23:02
Common misconceptions around evals
1:24:13
The time investment
1:30:37
Overview of their comprehensive evals course
1:33:38
Lightning round and final thoughts
1:37:57

Transcript

Lenny Rachitsky: To build great AI products, you need to be really good at building evals. Hamel Husain: It's the highest ROI activity you can engage in. This process is a lot of fun. Everyone that does this immediately gets addicted to it. When you're bu...