Why Video Agent models are next — Ethan He, xAI Grok Imagine
Latent Space: The AI Engineer Podcast
2 DAYS AGO
Why Video Agent models are next — Ethan He, xAI Grok Imagine
Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast
2 DAYS AGO
Ethan He, who built NVIDIA's Cosmos world model and later led the creation of Grok Imagine at xAI in just three months, shares his journey and insights on the current state and future of video generation. He argues that the most significant advances in video models are now coming from language models and agentic systems, not from improvements in diffusion technology itself.
Ethan He details the process of building frontier image and video systems, from synthetic data generation and VAE compression to the hidden costs of storage and I/O. He emphasizes that iteration speed and fixing small bugs in data pipelines often yield larger gains than new algorithms. A central thesis is that video models derive their intelligence primarily from LLMs, with prompt rewriting and agentic orchestration driving quality improvements. He predicts the next frontier is not a better video model but a video agent—a system that can plan, generate, edit, and iterate across a creative task using language models as the core reasoning engine. He also explores the future of generative UI, where diffusion models could replace traditional frontends, and defines world models as real-time, interactive, long-horizon video systems. Ethan left xAI to focus on LLM research, believing the bottleneck for video models is now the language and agent component.
00:05
00:05
Ethan He discusses his background with the Latent Space community.
01:25
01:25
Built Grok Imagine from scratch in three months
03:26
03:26
Most model improvements come from fixing small bugs.
16:30
16:30
Image models are a cheaper foundation for video models.
20:53
20:53
Trade-off between temporal compression and per-frame compression.
22:10
22:10
User intention directly generates pixels
35:20
35:20
Training costs are comparable to LLMs, though infrastructure is less efficient.
40:30
40:30
Combining approaches enables few-step generation.
45:29
45:29
A full world model must be recursive.
48:34
48:34
World models are real-time, interactive, long-horizon videos.
58:33
58:33
A new video model feature uses up to seven images as conditions for generation.
1:05:56
1:05:56
Built Grok Imagine in three months
1:09:47
1:09:47
GAN training makes models increasingly realistic
1:13:12
1:13:12
Video diffusion models are 'dumb' and take instructions literally
1:27:32
1:27:32
Video agents will unlock production-grade video generation
1:31:20
1:31:20
Physical AI and robotics will be solved by powerful LLMs.
1:32:48
1:32:48
Most progress now comes from language models
1:38:09
1:38:09
Most people in his position would not have made that choice
1:38:43
1:38:43
Core principles for training large models are similar across domains