scripod.com

[AIEWF Preview] Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect

Shownote

In an otherwise heavy week packed with Microsoft Build, Google I/O, and OpenAI io, the worst kept secret in biglab land was the launch of Claude 4, particularly the triumphant return of Opus, which many had been clamoring for. We will leave the specific Claude 4 recap to AINews, however we think that both Gemini’s progress on Deep Think this week and Claude 4 represent the next frontier of progress on inference time compute/reasoning (at last until GPT5 ships this summer). Will Brown’s talk at AIE NYC and open source work on verifiers have made him one of the most prominent voices able to publicly discuss (aka without the vaguepoasting LoRA they put on you when you join a biglab) the current state of the art in reasoning models and where current SOTA research directions lead. We discussed his latest paper on Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment and he has previewed his AIEWF talk on Agentic RL for those with the temerity to power thru bad meetup audio.

Highlights

This podcast delves into the latest advancements in AI, focusing on Claude 4 and Opus's return. It explores reasoning capabilities, tool use, and safety measures in AI models, alongside insights from Will Brown’s work on verifiers and multi-turn reinforcement learning.
00:00
Introducing Will Brown, new research lead for Prime Intellect
02:04
Claude is showing off its suite agent and tool use
04:34
Anthropic views extended thinking as tool use in cloud environments.
07:05
Reported benchmarks show reduced reward hacking in Claude and Opus.
09:38
Setting a token budget can control usage and is becoming standard.
11:04
Thinking budgets and reasoning effort may be conceptually similar
13:32
Anthropic stress-tests its models like Claude to handle dilemmas between following user instructions and common norms.
16:06
Crafting environments helps understand model behavior and constraints.
18:35
Training models with unbounded text is complex.
21:05
Systems are highly sensitive to initial conditions, impacting predictability
23:36
Academia might be the best source for future model evaluations.
26:01
Many grad students lack research taste; focus on long-term bets.
28:32
Major updates to verifier repo extend original GRP demo with multi-turn RL tool use.
31:08
Incorporating tool use into the model's reward system is crucial.
33:31
Models often box final answers to make verification easier.
36:07
Model-based RL using LLMs as judges is underexplored yet promising.
38:33
Alessio collaborating with Kyle Corbett on agentic RL course

Chapters

Introduction and Episode Overview
00:00
Discussion on Cloud 4 and its Features
02:01
Reasoning and Tool Use in AI Models
04:31
Extended Thinking in Claude and Model Differences
07:01
Speculation on Claude's Extended Thinking
09:31
Challenges and Controversies in AI Model Training
11:01
Technical Highlights and Code Trustworthiness
13:31
Token Costs and Incentives in AI Models
16:01
Thinking Budgets and AI Effort
18:31
Safety and Ethics in AI Model Development
21:01
Anthropic's Approach to AI Safety
23:31
LLM Arena and Evaluation Challenges
26:01
Developing Taste and Direction in AI Research
28:31
Recent Research and Multi-Turn RL
31:01
Tools and Incentives in AI Model Development
33:31
Challenges in Evaluating AI Model Outputs
36:01
Model-Based Rewards and Future Directions
38:31

Transcript

Will Brown: Hello, AI engineers. We're back with a quick reaction pod for Claude 4 with the new reasoning research lead for Prime Intellect, Will Brown. Will Brown's talk at AIEWF talk and open source work on verifiers have made him one of the most promine...