scripod.com

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

OpenAI research scientist Noam Brown returns to discuss how traditional AI benchmarks fail to account for test-time compute scaling, where model capability increases with the compute budget allocated for thinking. He argues that static benchmark grids are broken and explores how models can reason for extended periods on complex tasks, from building poker solvers to disproving mathematical conjectures.
Noam Brown explains that current evaluation methods are outdated because they don't consider how long a model is allowed to think. Models like o1 can improve performance over weeks of reasoning, but benchmarks only measure static snapshots. This creates a safety gap, as dangerous capabilities may only emerge at higher compute budgets. Brown warns against benchmark-maxxing, where models game evaluations by using more test-time compute. He highlights that while models can optimize existing algorithms, they struggle to create novel ones, making a rapid intelligence explosion unlikely due to time bottlenecks. He also discusses the potential of multi-agent coordination and recommends using current models for high-stakes decisions, trusting their outputs more than human experts. Finally, he advocates for evaluating benchmarks with a cost axis to fairly compare models against those thinking longer.
00:00
00:00
Safety policies fail to account for test-time compute scaling
00:43
00:43
Broken AI evaluations and large-scale test-time compute.
03:52
03:52
Evaluate models with a budget or plot performance against test-time compute.
04:20
04:20
Projecting performance from smaller budgets is suggested.
05:36
05:36
Longer thinking improves benchmarks but practical use favors flexibility
06:54
06:54
Benchmark-maxxing can be misleading
10:55
10:55
Models could complete complex tasks like a PhD thesis in one go
11:37
11:37
Safety evaluations fail to account for test-time compute scaling
14:41
14:41
Stronger AI models can operate over longer horizons.
17:06
17:06
Latent capabilities can be unlocked with sufficient scaffolding and compute.
21:06
21:06
Test-time compute has varying effects on AI performance.
27:09
27:09
Multi-agent coordination at scale requires frontier models.
29:11
29:11
Trust AI outputs more than human experts
31:51
31:51
Lack of consensus on using benchmarks with an x-axis
35:30
35:30
Benchmarks should be evaluated with a cost axis.