scripod.com

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Shownote

We’re announcing AIEWF speakers this week! Take the AI Engineering Survey! Today’s guest Ethan first joined us for the LS Paper Club as the lead on NVIDIA Cosmos World Model, but then joined xAI and built Grok Imagine in 3 months: He comes back on Latent...

Highlights

Ethan He, who built NVIDIA's Cosmos world model and later led the creation of Grok Imagine at xAI in just three months, shares his journey and insights on the current state and future of video generation. He argues that the most significant advances in video models are now coming from language models and agentic systems, not from improvements in diffusion technology itself.
00:05
Ethan He discusses his background with the Latent Space community.
01:25
Built Grok Imagine from scratch in three months
03:26
Most model improvements come from fixing small bugs.
16:30
Image models are a cheaper foundation for video models.
20:53
Trade-off between temporal compression and per-frame compression.
22:10
User intention directly generates pixels
35:20
Training costs are comparable to LLMs, though infrastructure is less efficient.
40:30
Combining approaches enables few-step generation.
45:29
A full world model must be recursive.
48:34
World models are real-time, interactive, long-horizon videos.
58:33
A new video model feature uses up to seven images as conditions for generation.
1:05:56
Built Grok Imagine in three months
1:09:47
GAN training makes models increasingly realistic
1:13:12
Video diffusion models are 'dumb' and take instructions literally
1:27:32
Video agents will unlock production-grade video generation
1:31:20
Physical AI and robotics will be solved by powerful LLMs.
1:32:48
Most progress now comes from language models
1:38:09
Most people in his position would not have made that choice
1:38:43
Core principles for training large models are similar across domains

Chapters

Introduction
00:00
From NVIDIA Cosmos to xAI
01:25
Building Grok Imagine from Zero to One
03:24
How Image and Video Models Are Trained
10:07
Video Compression, VAEs, and Real-Time Tradeoffs
18:53
Generative UI, Flipbook, and Neural OS
22:10
The Cost of Training Large Video Models
32:10
Distillation, GANs, and Fast Video Inference
37:04
Audio-Video Generation and Grok Imagine 0.9
41:21
What Makes a World Model?
48:34
Reference Videos, Long Context, and Video Memory
55:51
xAI Culture, Research, and First-Principles Building
1:00:11
AI Safety, Watermarking, and Prompt Rewriting
1:09:45
Video Agents and AI-Assisted Creation
1:13:10
Why Language Models Unlock Better Video
1:27:32
Robotics, Physical AI, and Embodied World Models
1:31:15
Why Ethan Left xAI
1:32:38
Self-Managed Context and the Future of LLMs
1:34:16
Ethan’s Career Path and Closing Thoughts
1:38:43

Transcript

swyx: Okay, we're here in the studio with Ethan He, most recently of XAI. Welcome. Ethan He: Yes, thank you. Glad being here. swyx: We're also here with Vibhu. You were first coming to us or joining the latent space world because you were working on Cosm...