Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast

Jun 01

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast

Jun 01

Overview Shownote Highlights Transcript Chapters Pins

Ethan He, a former lead on NVIDIA's Cosmos world model and the builder of xAI's Grok Imagine, shares his journey from scaling video models to focusing on language models. He argues that the next frontier for video generation is not better diffusion models, but video agents that leverage LLMs for planning, editing, and orchestration, much like the evolution of AI coding.

Ethan details building Grok Imagine from scratch in three months, emphasizing that fast iteration and fixing small data pipeline bugs drove more gains than new algorithms. He explains that video models are bootstrapped from image models and that their intelligence primarily comes from language models acting as prompt rewriters. The high cost of training video models is driven by massive storage and I/O needs, comparable to LLMs. He defines a world model as a real-time, interactive, long-horizon video, and sees video agents—using LLMs to call generative and traditional tools—as the next major trend. Ethan left xAI to focus on LLM research, predicting that language models will soon manage their own context, and that physical AI may be solved by powerful video models controlling robots as tools.