scripod.com

[AIEWF Preview] Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect

Overview

Shownote

Highlights

Transcript

Chapters

Pins

[AIEWF Preview] Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect

Latent Space: The AI Engineer Podcast

2025/05/23

[AIEWF Preview] Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect

[AIEWF Preview] Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect

Latent Space: The AI Engineer Podcast

Latent Space: The AI Engineer Podcast

2025/05/23

Overview Shownote Highlights Transcript Chapters Pins

Shownote

In an otherwise heavy week packed with Microsoft Build, Google I/O, and OpenAI io, the worst kept secret in biglab land was the launch of Claude 4, particularly the triumphant return of Opus, which many had been clamoring for. We will leave the specific Claude 4 recap to AINews, however we think that both Gemini’s progress on Deep Think this week and Claude 4 represent the next frontier of progress on inference time compute/reasoning (at last until GPT5 ships this summer). Will Brown’s talk at AIE NYC and open source work on verifiers have made him one of the most prominent voices able to publicly discuss (aka without the vaguepoasting LoRA they put on you when you join a biglab) the current state of the art in reasoning models and where current SOTA research directions lead. We discussed his latest paper on Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment and he has previewed his AIEWF talk on Agentic RL for those with the temerity to power thru bad meetup audio.

Highlights

This podcast delves into the latest advancements in AI, focusing on Claude 4 and Opus's return. It explores reasoning capabilities, tool use, and safety measures in AI models, alongside insights from Will Brown’s work on verifiers and multi-turn reinforcement learning.

00:00

Introducing Will Brown, new research lead for Prime Intellect

02:04

Claude is showing off its suite agent and tool use

04:34

Anthropic views extended thinking as tool use in cloud environments.

07:05

Reported benchmarks show reduced reward hacking in Claude and Opus.

09:38

Setting a token budget can control usage and is becoming standard.

11:04

Thinking budgets and reasoning effort may be conceptually similar

13:32

Anthropic stress-tests its models like Claude to handle dilemmas between following user instructions and common norms.

16:06

Crafting environments helps understand model behavior and constraints.

18:35

Training models with unbounded text is complex.

21:05

Systems are highly sensitive to initial conditions, impacting predictability

23:36

Academia might be the best source for future model evaluations.

26:01

Many grad students lack research taste; focus on long-term bets.

28:32

Major updates to verifier repo extend original GRP demo with multi-turn RL tool use.

31:08

Incorporating tool use into the model's reward system is crucial.

33:31

Models often box final answers to make verification easier.

36:07

Model-based RL using LLMs as judges is underexplored yet promising.

38:33

Alessio collaborating with Kyle Corbett on agentic RL course

Chapters

Introduction and Episode Overview

00:00

Discussion on Cloud 4 and its Features

02:01

Reasoning and Tool Use in AI Models

04:31

Extended Thinking in Claude and Model Differences

07:01

Speculation on Claude's Extended Thinking

09:31

Challenges and Controversies in AI Model Training

11:01

Technical Highlights and Code Trustworthiness

13:31

Token Costs and Incentives in AI Models

16:01

Thinking Budgets and AI Effort

18:31

Safety and Ethics in AI Model Development

21:01

Anthropic's Approach to AI Safety

23:31

LLM Arena and Evaluation Challenges

26:01

Developing Taste and Direction in AI Research

28:31

Recent Research and Multi-Turn RL

31:01

Tools and Incentives in AI Model Development

33:31

Challenges in Evaluating AI Model Outputs

36:01

Model-Based Rewards and Future Directions

38:31

Transcript

Will Brown: Hello, AI engineers. We're back with a quick reaction pod for Claude 4 with the new reasoning research lead for Prime Intellect, Will Brown. Will Brown's talk at AIEWF talk and open source work on verifiers have made him one of the most promine...