What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

The a16z Show

2025/12/05

Overview Shownote Highlights Transcript Chapters Pins

In this conversation, two leading AI researchers delve into the frontier of spatial intelligence in artificial systems, exploring how machines can move beyond language to understand and generate physical worlds. Their work challenges conventional approaches in deep learning and opens new pathways for modeling reality.

Fei-Fei Li and Justin Johnson discuss the limitations of current AI architectures, particularly language models' inability to grasp spatial and physical reasoning. They argue that true intelligence requires understanding 3D environments through interaction, not just text-based prediction. Their new model, Marble, generates explorable 3D worlds from text or images, leveraging Gaussian splats for real-time rendering and precise camera control. They emphasize that spatial intelligence—shaped by eons of evolution—is fundamentally different from linguistic intelligence and cannot be reduced to sequences. Transformers, they note, are better understood as set models, with sequence processing being an add-on. This insight informs their vision for world models that simulate physics and eliminate impossible configurations through structured reasoning, paving the way for applications in robotics, design, and simulation.

22:40

Pixels may be a more lossless representation than tokenized text in AI models

30:33

Marble generates 3D worlds from text or images and allows interactive editing in real time.

52:58

Newtonian laws are unlikely to emerge from LLMs because they operate at a different abstraction level than astrophysical data.

55:43

Transformers are natively models of sets; sequence order comes from positional embeddings.