scripod.com

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Overview

Shownote

Highlights

Transcript

Chapters

Pins

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth

Sep 25

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth

Lenny's Podcast: Product | Career | Growth

Sep 25

Overview Shownote Highlights Transcript Chapters Pins

Shownote

Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, ...

Highlights

In this insightful conversation, Hamel Husain and Shreya Shankar break down the essential practice of AI evaluations, emphasizing their role as a foundational discipline for building effective AI-powered products. They move beyond buzzwords to reveal how structured, human-led evals enable teams to systematically understand model behavior, uncover hidden failure modes, and drive meaningful product improvements.

05:00

Evals help move beyond vibe checks by providing measurable feedback for AI applications

16:01

Looking at data is crucial when doing data analysis of an LLM application

22:27

AI told a user about non-existent virtual tours, revealing a hallucination error

23:54

Manual open coding is crucial because LLMs lack context for accurate error assessment

25:23

One trusted person with domain expertise should lead AI evaluations to avoid overcomplication.

31:05

Theoretical saturation is reached when no new problem types are discovered during analysis.

31:39

Axial codes act as failure-mode labels to cluster and identify the most common AI errors.

44:39

17 conversational flow issues were identified using a pivot table analysis.

46:06

Dumb engineering errors in AI, like formatting mistakes, don't require full evals—they're obvious to fix.

51:05

LLM judges can reliably output pass or fail results for complex AI behaviors.

52:10

Use binary yes/no judgments instead of rating scales for reliable LLM evals

57:19

High agreement percentages between LLM and human judges can be misleading when errors are rare.

1:03:19

Experts can't anticipate all failure modes in LLM output validation

1:05:09

Fixing problems doesn't always require writing an eval.

1:07:41

Product managers can build profitable products using the skill set of implementing LLM judges for systematic improvement.

1:09:57

Strong opinions against AI evals often ignore their widespread practical use in development.

1:17:48

A/B tests should be powered by actual error analysis, not hypotheticals

1:18:26

Evals are essentially data science for understanding AI product performance.

1:22:30

More people should adopt structured approaches to application-specific evals.

1:23:02

There's high demand for Hamel and Shreya's Maven course.

1:29:59

The goal of evals is to improve the product, not just catch bugs.

1:33:19

AI sending factually correct emails isn't good enough—product thinking is essential for real effectiveness

1:36:30

Students get 10 months of free, unlimited access to all course-related AI content and resources

1:40:57

Hamel's life motto is 'Keep learning and think like a beginner'

Chapters

Introduction to Hamel and Shreya

00:00

What are evals?

04:57

Demo: Examining real traces from a property management AI assistant

09:56

Writing notes on errors

16:51

Why LLMs can’t replace humans in the initial error analysis

23:54

The concept of a “benevolent dictator” in the eval process

25:16

Theoretical saturation: when to stop

28:07

Using axial codes to help categorize and synthesize error notes

31:39

The results

44:39

Building an LLM-as-judge to evaluate specific failure modes

46:06

The difference between code-based evals and LLM-as-judge

48:31

Example: LLM-as-judge

52:10

Testing your LLM judge against human judgment

54:45

Why evals are the new PRDs for AI products

1:00:51

How many evals you actually need

1:05:09

What comes after evals

1:07:41

The great evals debate

1:09:57

Why dogfooding isn’t enough for most AI products

1:15:15

OpenAI’s Statsig acquisition

1:18:23

Tips and tricks for implementing evals effectively

1:22:28

The Claude Code controversy and the importance of context

1:23:02

Common misconceptions around evals

1:24:13

The time investment

1:30:37

Overview of their comprehensive evals course

1:33:38

Lightning round and final thoughts

1:37:57

Transcript

Lenny Rachitsky: To build great AI products, you need to be really good at building evals. Hamel Husain: It's the highest ROI activity you can engage in. This process is a lot of fun. Everyone that does this immediately gets addicted to it. When you're bu...