scripod.com

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

Shownote

When a new AI model drops, it’s judged based on a static benchmark grid that doesn’t account for how long the model is allowed to think. How then should we measure a model’s true capability? OpenAI research scientist Noam Brown returns to talk with Sarah G...

Highlights

OpenAI research scientist Noam Brown returns to discuss how traditional AI benchmarks fail to account for test-time compute scaling, where model capability increases with the compute budget allocated for thinking. He argues that static benchmark grids are broken and explores how models can reason for extended periods on complex tasks, from building poker solvers to disproving mathematical conjectures.
00:00
Safety policies fail to account for test-time compute scaling
00:43
Broken AI evaluations and large-scale test-time compute.
03:52
Evaluate models with a budget or plot performance against test-time compute.
04:20
Projecting performance from smaller budgets is suggested.
05:36
Longer thinking improves benchmarks but practical use favors flexibility
06:54
Benchmark-maxxing can be misleading
10:55
Models could complete complex tasks like a PhD thesis in one go
11:37
Safety evaluations fail to account for test-time compute scaling
14:41
Stronger AI models can operate over longer horizons.
17:06
Latent capabilities can be unlocked with sufficient scaffolding and compute.
21:06
Test-time compute has varying effects on AI performance.
27:09
Multi-agent coordination at scale requires frontier models.
29:11
Trust AI outputs more than human experts
31:51
Lack of consensus on using benchmarks with an x-axis
35:30
Benchmarks should be evaluated with a cost axis.

Chapters

Cold Open
00:00
Noam Brown Introduction
00:43
Why Benchmarks Are Broken
01:23
Compute Budgets and Projections
04:19
How Long Should Models Think?
05:34
Benchmark-Maxxing
06:47
Using Poker Bots as Evals
08:34
Safety Evals When Model Capability Scales With Budget
11:26
Release Cycle vs. Agent Runtime
14:41
Latent Model Capability
17:06
Limits on Recursive Self-Improvement
20:59
Large-Scale Multi-Agent Coordination
27:09
Competition at the Frontier
29:11
Breaking the Benchmark Grid Equilibrium
31:51
Why Benchmarks Should be Evaluated by Cost
33:29

Transcript

Noam Brown: With GPT-3, you couldn't scale test time compute, Like, if you gave it a budget of $10 million and said, OK, well, let's see what GPT-3 can do, it really can't do that much. The Precautionary frameworks and responsible scaling policies, they do...