scripod.com

The RLVR Revolution — with Nathan Lambert (AI2, Interconnects.ai)

Shownote

Chapters 00:00:00 Welcome and Guest Introduction 00:01:18 Tulu, OVR, and the RLVR Journey 00:03:40 Industry Approaches to Post-Training and Preference Data 00:06:08 Understanding RLVR and Its Impact 00:06:18 Agents, Tool Use, and Training Environments 00:10:34 Open Data, Human Feedback, and Benchmarking 00:12:44 Chatbot Arena, Sycophancy, and Evaluation Platforms 00:15:42 RLHF vs RLVR: Books, Algorithms, and Future Directions 00:17:54 Frontier Models: Reasoning, Hybrid Models, and Data 00:22:11 Search, Retrieval, and Emerging Model Capabilities 00:29:23 Tool Use, Curriculum, and Model Training Challenges 00:38:06 Skills, Planning, and Abstraction in Agent Models 00:46:50 Parallelism, Verifiers, and Scaling Approaches 00:54:33 Overoptimization and Reward Design in RL 01:02:27 Open Models, Personalization, and the Model Spec 01:06:50 Open Model Ecosystem and Infrastructure 01:13:05 Meta, Hardware, and the Future of AI Competition 01:15:42 Building an Open DeepSeek and Closing Thoughts We first had Nathan on to give us his RLHF deep dive when he was joining AI2, and now he’s back to help us catch up on the evolution to RLVR (Reinforcement Learning with Verifiable Rewards), first proposed in his Tulu 3 paper. While RLHF remains foundational, RLVR has emerged as a powerful approach for training models on tasks with clear success criteria and using verifiable, objective functions as reward signals—particularly useful in domains like math, code correctness, and instruction-following. Instead of relying solely on subjective human feedback, RLVR leverages deterministic signals to guide optimization, making it more scalable and potentially more reliable across many domains. However, he notes that RLVR is still rapidly evolving, especially regarding how it handles tool use and multi-step reasoning. We also discussed the Tulu model series, a family of instruction-tuned open models developed at AI2. Tulu is designed to be a reproducible, state-of-the-art post-training recipe for the open community. Unlike frontier labs like OpenAI or Anthropic, which rely on vast and often proprietary datasets, Tulu aims to distill and democratize best practices for instruction and preference tuning. We are impressed with how small eval suites, careful task selection, and transparent methodology can rival even the best proprietary models on specific benchmarks. One of the most fascinating threads is the challenge of incorporating tool use into RL frameworks. Lambert highlights that while you can prompt a model to use tools like search or code execution, getting the model to reliably learn when and how to use them through RL is much harder. This is compounded by the difficulty of designing reward functions that avoid overoptimization—where models learn to “game” the reward signal rather than solve the underlying task. This is particularly problematic in code generation, where models might reward hack unit tests by inserting pass statements instead of correct logic. As models become more agentic and are expected to plan, retrieve, and act across multiple tools, reward design becomes a critical bottleneck. Other topics covered: - The evolution from RLHF (Reinforcement Learning from Human Feedback) to RLVR (Reinforcement Learning from Verifiable Rewards) - The goals and technical architecture of the Tulu models, including the motivation to open-source post-training recipes - Challenges of tool use in RL: verifiability, reward design, and scaling across domains - Evaluation frameworks and the role of platforms like Chatbot Arena and emerging “arena”-style benchmarks - The strategic tension between hybrid reasoning models and unified reasoning models at the frontier - Planning, abstraction, and calibration in reasoning agents and why these concepts matter - The future of open-source AI models, including DeepSeek, OLMo, and the potential for an “American DeepSeek” - The importance of model personality, character tuning, and the model spec paradigm - Overoptimization in RL settings and how it manifests in different domains (control tasks, code, math) - Industry trends in inference-time scaling and model parallelism Finally, the episode closes with a vision for the future of open-source AI. Nathan has now written up his ambition to build an “American DeepSeek”—a fully open, end-to-end reasoning-capable model with transparent training data, tools, and infrastructure. He emphasizes that open-source AI is not just about weights; it’s about releasing recipes, evaluations, and methods that lower the barrier for everyone to build and understand cutting-edge systems. It would seem the

Highlights

In this episode of the Latent Space podcast, Nathan Lambert returns to discuss the evolution of reinforcement learning techniques in AI model training, particularly the shift from RLHF to RLVR. The conversation delves into the technical and strategic implications of these methods, as well as their applications in open-source AI development. Lambert also reflects on the broader challenges of training models to use tools effectively and the importance of reward design in preventing overoptimization.
00:03
Nathan wins best speaker for the reasoning track
03:40
Nathan Lambert discusses the origin of RLVR, aiming to reproduce industry achievements with different infrastructure
06:19
RLVR is highlighted as a promising, data-efficient method for refining model behavior post-training.
11:44
Preference data is task and model-specific but holds significant potential.
12:47
Leaderboards serve as a critical focusing function for the AI community.
15:42
Tulu 3's style of tool use is noteworthy
20:37
OpenAI's goal of dynamic token allocation based on question difficulty
22:11
LLMs are increasingly offered with search as a default service, like Gemini.
29:26
Models need openness and uncertainty handling in tool use
38:09
Training models to use plan tokens and think more efficiently is more tractable than far-out AI ideas.
49:39
Parallel agents may be more transformative than parallel compute for long-running tasks
54:34
Reward design in RLVR makes over-optimization in math harder, though models may try to cheat using tools.
1:02:34
Model spec is considered more useful than a constitution for transparency and intentional behavior
1:11:45
Ear-worn AI devices are practical for real-time listening and note-taking.
1:13:05
Talent is cheaper than GPUs, and Meta might spend on top people as it did on VR.
1:15:42
Building the American DeepSeek requires significant resources and architectural innovation.

Chapters

Welcome and Guest Introduction
00:00
Tulu, OVR, and the RLVR Journey
01:18
Industry Approaches to Post-Training and Preference Data
03:40
Understanding RLVR and Its Impact
06:08
Agents, Tool Use, and Training Environments
06:18
Open Data, Human Feedback, and Benchmarking
10:34
Chatbot Arena, Sycophancy, and Evaluation Platforms
12:44
RLHF vs RLVR: Books, Algorithms, and Future Directions
15:42
Frontier Models: Reasoning, Hybrid Models, and Data
17:54
Search, Retrieval, and Emerging Model Capabilities
22:11
Tool Use, Curriculum, and Model Training Challenges
29:23
Skills, Planning, and Abstraction in Agent Models
38:06
Parallelism, Verifiers, and Scaling Approaches
46:50
Overoptimization and Reward Design in RL
54:33
Open Models, Personalization, and the Model Spec
1:02:27
Open Model Ecosystem and Infrastructure
1:06:50
Meta, Hardware, and the Future of AI Competition
1:13:05
Building an Open DeepSeek and Closing Thoughts
1:15:42

Transcript

Alessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel, and I'm joined by swyx, founder of Small AI. swyx: Hello, hello, and we're excited to welcome back Nathan Lambert from AI2. Welcome. Nathan Lambert: ...