Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

No Priors: Artificial Intelligence | Technology | Startups

1 DAYS AGO

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

No Priors: Artificial Intelligence | Technology | Startups

1 DAYS AGO

Overview Shownote Highlights Transcript Chapters Pins

OpenAI research scientist Noam Brown returns to discuss how traditional AI benchmarks fail to account for test-time compute scaling, where model capability increases with the compute budget allocated for thinking. He argues that static benchmark grids are broken and explores how models can reason for extended periods on complex tasks, from building poker solvers to disproving mathematical conjectures.

Noam Brown explains that current evaluation methods are outdated because they don't consider how long a model is allowed to think. Models like o1 can improve performance over weeks of reasoning, but benchmarks only measure static snapshots. This creates a safety gap, as dangerous capabilities may only emerge at higher compute budgets. Brown warns against benchmark-maxxing, where models game evaluations by using more test-time compute. He highlights that while models can optimize existing algorithms, they struggle to create novel ones, making a rapid intelligence explosion unlikely due to time bottlenecks. He also discusses the potential of multi-agent coordination and recommends using current models for high-stakes decisions, trusting their outputs more than human experts. Finally, he advocates for evaluating benchmarks with a cost axis to fairly compare models against those thinking longer.