scripod.com

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

Overview

Shownote

Highlights

Transcript

Chapters

Pins

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

No Priors: Artificial Intelligence | Technology | Startups

1 DAYS AGO

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

No Priors: Artificial Intelligence | Technology | Startups

No Priors: Artificial Intelligence | Technology | Startups

1 DAYS AGO

Overview Shownote Highlights Transcript Chapters Pins

Shownote

When a new AI model drops, it’s judged based on a static benchmark grid that doesn’t account for how long the model is allowed to think. How then should we measure a model’s true capability? OpenAI research scientist Noam Brown returns to talk with Sarah G...

Highlights

OpenAI research scientist Noam Brown returns to discuss how traditional AI benchmarks fail to account for test-time compute scaling, where model capability increases with the compute budget allocated for thinking. He argues that static benchmark grids are broken and explores how models can reason for extended periods on complex tasks, from building poker solvers to disproving mathematical conjectures.

00:00

Safety policies fail to account for test-time compute scaling

00:43

Broken AI evaluations and large-scale test-time compute.

03:52

Evaluate models with a budget or plot performance against test-time compute.

04:20

Projecting performance from smaller budgets is suggested.

05:36

Longer thinking improves benchmarks but practical use favors flexibility

06:54

Benchmark-maxxing can be misleading

10:55

Models could complete complex tasks like a PhD thesis in one go

11:37

Safety evaluations fail to account for test-time compute scaling

14:41

Stronger AI models can operate over longer horizons.

17:06

Latent capabilities can be unlocked with sufficient scaffolding and compute.

21:06

Test-time compute has varying effects on AI performance.

27:09

Multi-agent coordination at scale requires frontier models.

29:11

Trust AI outputs more than human experts

31:51

Lack of consensus on using benchmarks with an x-axis

35:30

Benchmarks should be evaluated with a cost axis.

Chapters

Cold Open

00:00

Noam Brown Introduction

00:43

Why Benchmarks Are Broken

01:23

Compute Budgets and Projections

04:19

How Long Should Models Think?

05:34

Benchmark-Maxxing

06:47

Using Poker Bots as Evals

08:34

Safety Evals When Model Capability Scales With Budget

11:26

Release Cycle vs. Agent Runtime

14:41

Latent Model Capability

17:06

Limits on Recursive Self-Improvement

20:59

Large-Scale Multi-Agent Coordination

27:09

Competition at the Frontier

29:11

Breaking the Benchmark Grid Equilibrium

31:51

Why Benchmarks Should be Evaluated by Cost

33:29

Transcript

Noam Brown: With GPT-3, you couldn't scale test time compute, Like, if you gave it a budget of $10 million and said, OK, well, let's see what GPT-3 can do, it really can't do that much. The Precautionary frameworks and responsible scaling policies, they do...