Why Video Agent models are next — Ethan He, xAI Grok Imagine
Latent Space: The AI Engineer Podcast
2 DAYS AGO
Why Video Agent models are next — Ethan He, xAI Grok Imagine
Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast
2 DAYS AGO
Ethan He, a former lead on NVIDIA's Cosmos world model and the builder of xAI's Grok Imagine, shares his journey from scaling video models to focusing on language models. He argues that the next frontier for video generation is not better diffusion models, but video agents that leverage LLMs for planning, editing, and orchestration, much like the evolution of AI coding.
Ethan details building Grok Imagine from scratch in three months, emphasizing that fast iteration and fixing small data pipeline bugs drove more gains than new algorithms. He explains that video models are bootstrapped from image models and that their intelligence primarily comes from language models acting as prompt rewriters. The high cost of training video models is driven by massive storage and I/O needs, comparable to LLMs. He defines a world model as a real-time, interactive, long-horizon video, and sees video agents—using LLMs to call generative and traditional tools—as the next major trend. Ethan left xAI to focus on LLM research, predicting that language models will soon manage their own context, and that physical AI may be solved by powerful video models controlling robots as tools.
00:05
00:05
Ethan He discusses his work on NVIDIA Cosmos and the Latent Space Paper Club
01:25
01:25
Moved from NVIDIA to xAI for more compute.
03:26
03:26
Many improvements come from fixing small bugs in data and training pipelines
16:30
16:30
Video models are bootstrapped from image models.
20:53
20:53
Per-frame compression enables real-time interactivity.
22:10
22:10
Generative UI replaces traditional coding.
32:13
32:13
Storing a billion videos costs hundreds of thousands per month.
40:30
40:30
GANs use a discriminator to judge image realism.
41:21
41:21
Modality alignment is the main difficulty.
48:34
48:34
World models are real-time, interactive, long-horizon videos.
58:33
58:33
Reference-to-video is an intermediate solution to the long-context problem
1:05:56
1:05:56
Building Grok Imagine in three months
1:11:47
1:11:47
AI content is harder to detect by eye
1:13:12
1:13:12
Video diffusion models are 'dumb' and take instructions literally
1:27:32
1:27:32
Video agents will be a major trend
1:31:20
1:31:20
Physical AI will be solved by powerful video models.
1:32:48
1:32:48
Language models now drive the most impactful advances.
1:34:23
1:34:23
Models will self-modify their harnesses at test time.
1:38:43
1:38:43
Switching ML subfields is easier than perceived