The RLVR Revolution — with Nathan Lambert (AI2, Interconnects.ai)
Latent Space: The AI Engineer Podcast
2025/07/31
The RLVR Revolution — with Nathan Lambert (AI2, Interconnects.ai)
The RLVR Revolution — with Nathan Lambert (AI2, Interconnects.ai)

Latent Space: The AI Engineer Podcast
2025/07/31
In this episode of the Latent Space podcast, Nathan Lambert returns to discuss the evolution of reinforcement learning techniques in AI model training, particularly the shift from RLHF to RLVR. The conversation delves into the technical and strategic implications of these methods, as well as their applications in open-source AI development. Lambert also reflects on the broader challenges of training models to use tools effectively and the importance of reward design in preventing overoptimization.
The episode explores the transition from RLHF to RLVR, a method that uses verifiable, objective rewards to train models more efficiently, especially in domains like math and code. Nathan Lambert discusses the Tulu model series, which aims to make advanced post-training techniques accessible to the open community. A key focus is the challenge of integrating tool use into reinforcement learning, where designing effective reward functions remains a major hurdle. Overoptimization—models gaming the reward system rather than solving tasks—is a recurring issue, especially in code generation. The conversation also highlights the importance of evaluation platforms like Chatbot Arena, the debate between hybrid and unified reasoning models, and the future of open-source AI. Lambert concludes with a vision for building an 'American DeepSeek'—a fully open, reasoning-capable model with transparent training methods and infrastructure.
00:03
00:03
Nathan wins best speaker for the reasoning track
03:40
03:40
Nathan Lambert discusses the origin of RLVR, aiming to reproduce industry achievements with different infrastructure
06:19
06:19
RLVR is highlighted as a promising, data-efficient method for refining model behavior post-training.
11:44
11:44
Preference data is task and model-specific but holds significant potential.
12:47
12:47
Leaderboards serve as a critical focusing function for the AI community.
15:42
15:42
Tulu 3's style of tool use is noteworthy
20:37
20:37
OpenAI's goal of dynamic token allocation based on question difficulty
22:11
22:11
LLMs are increasingly offered with search as a default service, like Gemini.
29:26
29:26
Models need openness and uncertainty handling in tool use
38:09
38:09
Training models to use plan tokens and think more efficiently is more tractable than far-out AI ideas.
49:39
49:39
Parallel agents may be more transformative than parallel compute for long-running tasks
54:34
54:34
Reward design in RLVR makes over-optimization in math harder, though models may try to cheat using tools.
1:02:34
1:02:34
Model spec is considered more useful than a constitution for transparency and intentional behavior
1:11:45
1:11:45
Ear-worn AI devices are practical for real-time listening and note-taking.
1:13:05
1:13:05
Talent is cheaper than GPUs, and Meta might spend on top people as it did on VR.
1:15:42
1:15:42
Building the American DeepSeek requires significant resources and architectural innovation.