scripod.com

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Shownote

We’re announcing AIEWF speakers this week! Take the AI Engineering Survey! Today’s guest Ethan first joined us for the LS Paper Club as the lead on NVIDIA Cosmos World Model, but then joined xAI and built Grok Imagine in 3 months: He comes back on Latent...

Highlights

Ethan He, a former lead on NVIDIA's Cosmos world model and the builder of xAI's Grok Imagine, shares his journey from scaling video models to focusing on language models. He argues that the next frontier for video generation is not better diffusion models, but video agents that leverage LLMs for planning, editing, and orchestration, much like the evolution of AI coding.
00:05
Ethan He discusses his work on NVIDIA Cosmos and the Latent Space Paper Club
01:25
Moved from NVIDIA to xAI for more compute.
03:26
Many improvements come from fixing small bugs in data and training pipelines
16:30
Video models are bootstrapped from image models.
20:53
Per-frame compression enables real-time interactivity.
22:10
Generative UI replaces traditional coding.
32:13
Storing a billion videos costs hundreds of thousands per month.
40:30
GANs use a discriminator to judge image realism.
41:21
Modality alignment is the main difficulty.
48:34
World models are real-time, interactive, long-horizon videos.
58:33
Reference-to-video is an intermediate solution to the long-context problem
1:05:56
Building Grok Imagine in three months
1:11:47
AI content is harder to detect by eye
1:13:12
Video diffusion models are 'dumb' and take instructions literally
1:27:32
Video agents will be a major trend
1:31:20
Physical AI will be solved by powerful video models.
1:32:48
Language models now drive the most impactful advances.
1:34:23
Models will self-modify their harnesses at test time.
1:38:43
Switching ML subfields is easier than perceived

Chapters

Introduction
00:00
From NVIDIA Cosmos to xAI
01:25
Building Grok Imagine from Zero to One
03:24
How Image and Video Models Are Trained
10:07
Video Compression, VAEs, and Real-Time Tradeoffs
18:53
Generative UI, Flipbook, and Neural OS
22:10
The Cost of Training Large Video Models
32:10
Distillation, GANs, and Fast Video Inference
37:04
Audio-Video Generation and Grok Imagine 0.9
41:21
What Makes a World Model?
48:34
Reference Videos, Long Context, and Video Memory
55:51
xAI Culture, Research, and First-Principles Building
1:00:11
AI Safety, Watermarking, and Prompt Rewriting
1:09:45
Video Agents and AI-Assisted Creation
1:13:10
Why Language Models Unlock Better Video
1:27:32
Robotics, Physical AI, and Embodied World Models
1:31:15
Why Ethan Left xAI
1:32:38
Self-Managed Context and the Future of LLMs
1:34:16
Ethan’s Career Path and Closing Thoughts
1:38:43

Transcript

swyx: Okay, we're here in the studio with Ethan He, most recently of xAI. Welcome. Ethan He: Yes, thank you. Glad being here. swyx: We're also here with Vibhu. You were first coming to us or joining the latent space world because you were working on Cosm...