[AIEWF Preview] Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect
Latent Space: The AI Engineer Podcast
2025/05/23
[AIEWF Preview] Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect
[AIEWF Preview] Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect

Latent Space: The AI Engineer Podcast
2025/05/23
This podcast delves into the latest advancements in AI, focusing on Claude 4 and Opus's return. It explores reasoning capabilities, tool use, and safety measures in AI models, alongside insights from Will Brown’s work on verifiers and multi-turn reinforcement learning.
The discussion highlights the evolution of AI agents, emphasizing Claude 4's extended thinking mode and its implications for inference time compute. Key points include managing token costs effectively through thinking budgets, ensuring code trustworthiness, and addressing ethical considerations in model development. The speakers stress the importance of stress testing to align AI behavior with societal norms. They also examine challenges in evaluating model outputs and integrating tools into reward systems, advocating for model-based rewards that enhance flexibility. Additionally, the conversation touches on Anthropic's safety approaches and the potential of academia as an unbiased evaluator of AI models. Finally, the hosts preview upcoming research directions and practical applications in agentic reinforcement learning.
00:00
00:00
Introducing Will Brown, new research lead for Prime Intellect
02:04
02:04
Claude is showing off its suite agent and tool use
04:34
04:34
Anthropic views extended thinking as tool use in cloud environments.
07:05
07:05
Reported benchmarks show reduced reward hacking in Claude and Opus.
09:38
09:38
Setting a token budget can control usage and is becoming standard.
11:04
11:04
Thinking budgets and reasoning effort may be conceptually similar
13:32
13:32
Anthropic stress-tests its models like Claude to handle dilemmas between following user instructions and common norms.
16:06
16:06
Crafting environments helps understand model behavior and constraints.
18:35
18:35
Training models with unbounded text is complex.
21:05
21:05
Systems are highly sensitive to initial conditions, impacting predictability
23:36
23:36
Academia might be the best source for future model evaluations.
26:01
26:01
Many grad students lack research taste; focus on long-term bets.
28:32
28:32
Major updates to verifier repo extend original GRP demo with multi-turn RL tool use.
31:08
31:08
Incorporating tool use into the model's reward system is crucial.
33:31
33:31
Models often box final answers to make verification easier.
36:07
36:07
Model-based RL using LLMs as judges is underexplored yet promising.
38:33
38:33
Alessio collaborating with Kyle Corbett on agentic RL course