scripod.com

Reiner Pope – Chip design from the bottom up

Dwarkesh Podcast
This podcast features a detailed technical discussion on how computer chips, from basic logic gates to advanced AI accelerators, are designed and how they function. The conversation explores the fundamental trade-offs in chip architecture, focusing on the balance between computation and data movement, and compares different processor types including CPUs, GPUs, TPUs, and FPGAs.
The discussion begins with the multiply-accumulate (MAC) operation, the core primitive for AI chips, built from logic gates like AND gates and full adders. A key insight is that data movement costs far exceed the cost of computation itself, driving architectural innovations like systolic arrays that minimize communication overhead by keeping data fixed longer. The podcast contrasts different memory models: CPUs use caches with non-deterministic latency, while TPUs use software-controlled scratchpads for deterministic access. CPU cores are large due to complex features like branch prediction needed for high clock speeds, whereas GPUs strip these out for more compute units. FPGAs offer reprogrammability but are about 10x slower than ASICs. The brain is compared to chips, highlighting its unstructured sparsity and co-located memory, but operating at much slower speeds. Finally, GPUs are described as many small, identical compute units tiled across a chip, while TPUs have fewer, larger matrix units, with the trade-off being data bandwidth versus register cost amortization.
09:54
09:54
Multiply-accumulate is the core primitive in AI chips.
22:39
22:39
Register file costs motivated the shift to tensor cores.
26:10
26:10
Quadratic compute growth with only linear communication costs.
48:23
48:23
Adding pipeline registers increases clock speed but consumes area
1:01:25
1:01:25
FPGAs are 10x slower than ASICs.
1:03:32
1:03:32
Deterministic latency is possible in CPUs but avoided for market reasons.
1:10:47
1:10:47
Branch prediction enables high clock speeds
1:12:05
1:12:05
Slowing a chip to MHz reduces energy linearly but not 1000x due to idle circuits.
1:18:54
1:18:54
GPUs have higher data movement bandwidth than TPUs