Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown
Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown
Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown
Shownote
Shownote
When a new AI model drops, it’s judged based on a static benchmark grid that doesn’t account for how long the model is allowed to think. How then should we measure a model’s true capability? OpenAI research scientist Noam Brown returns to talk with Sarah G...
Highlights
Highlights
OpenAI research scientist Noam Brown returns to discuss how traditional AI benchmarks fail to account for test-time compute scaling, where model capability increases with the compute budget allocated for thinking. He argues that static benchmark grids are broken and explores how models can reason for extended periods on complex tasks, from building poker solvers to disproving mathematical conjectures.
Chapters
Chapters
Cold Open
00:00Noam Brown Introduction
00:43Why Benchmarks Are Broken
01:23Compute Budgets and Projections
04:19How Long Should Models Think?
05:34Benchmark-Maxxing
06:47Using Poker Bots as Evals
08:34Safety Evals When Model Capability Scales With Budget
11:26Release Cycle vs. Agent Runtime
14:41Latent Model Capability
17:06Limits on Recursive Self-Improvement
20:59Large-Scale Multi-Agent Coordination
27:09Competition at the Frontier
29:11Breaking the Benchmark Grid Equilibrium
31:51Why Benchmarks Should be Evaluated by Cost
33:29Transcript
Transcript
Noam Brown: With GPT-3, you couldn't scale test time compute, Like, if you gave it a budget of $10 million and said, OK, well, let's see what GPT-3 can do, it really can't do that much. The Precautionary frameworks and responsible scaling policies, they do...
