scripod.com

Reiner Pope – The math behind how LLMs are trained and served

Overview

Shownote

Highlights

Transcript

Chapters

Pins

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast

1 DAYS AGO

Reiner Pope – The math behind how LLMs are trained and served

Reiner Pope – The math behind how LLMs are trained and served

Dwarkesh Podcast

Dwarkesh Podcast

1 DAYS AGO

Overview Shownote Highlights Transcript Chapters Pins

Shownote

Did a very different format with Reiner Pope - a blackboard lecture where he walks through how frontier LLMs are trained and served. It’s shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and so...

Highlights

In this episode, Reiner Pope delivers an insightful blackboard-style lecture unpacking the hardware-aware realities of training and serving frontier large language models—using first-principles reasoning, public pricing data, and fundamental constraints of modern GPU architecture.

15:16

The FLOPS-to-memory-bandwidth ratio (~300) is a stable hardware invariant that directly determines minimum effective batch size.

32:09

The MoE layer uses a router to dynamically assign tokens to sparse MLP experts, with expert parallelism distributing them across GPUs.

1:00:14

Pipeline parallelism reduces memory per rack but offers diminishing returns with modern hardware

1:10:08

Pipeline parallelism reduces weight memory but not activation memory; KV savings are offset by in-flight sequences

1:28:37

Current pre-training token count is about 100 times larger than the Chinchilla-optimal count

1:47:30

Cache hits are 10x cheaper than cache writes

2:10:41

Reversible transformer layers save memory via activation rematerialization

Chapters

How batch size affects token cost and speed

00:00

How MoE models are laid out across GPU racks

32:09

How pipeline parallelism spreads model layers across racks

47:12

Why Ilya said, “As we now know, pipelining is not wise.”

1:03:37

Because of RL, models may be 100x over-trained beyond Chinchilla-optimal

1:18:59

Deducing long context memory costs from API pricing

1:33:02

Convergent evolution between neural nets and cryptography

2:04:02

Transcript

Dwarkesh Patel: Today, I'm interviewing Reiner Pope, who is CEO of MatX, which is a new chip startup. Previously, he was doing TPU architecture and many other things at Google. This is a very different format from my usual interviews. This is going to be a...