Eric Jang – Building AlphaGo from scratch
Dwarkesh Podcast
22 HOURS AGO
Eric Jang – Building AlphaGo from scratch
Eric Jang – Building AlphaGo from scratch

Dwarkesh Podcast
22 HOURS AGO
In this episode, Eric Jang revisits AlphaGo not as a historical artifact, but as a pedagogical and architectural blueprint for understanding intelligence—particularly how search, learning from experience, and self-play interact to solve problems with vast combinatorial spaces.
Jang walks through building AlphaGo from scratch using modern tools, emphasizing Monte Carlo Tree Search (MCTS) as a solution to Go’s exponential complexity—guiding exploration via neural policy and value networks while sidestepping the credit assignment problem that plagues naive RL. Unlike LLMs trained with high-variance policy gradients over long token sequences, AlphaGo’s MCTS provides precise, move-level training targets, enabling efficient distillation of search into neural weights. He contrasts on-policy self-play with off-policy methods, noting AlphaGo Zero’s replay buffer strategically samples near-optimal states to avoid compounding errors. The discussion extends to why MCTS doesn’t translate directly to language modeling—due to lack of well-defined value estimation and deterministic outcomes—and highlights how supervised pretraining with soft targets, not raw RL, underpins stable early learning. Finally, Jang reflects on automating AI research: while LLMs now handle implementation and hyperparameter tuning, selecting high-leverage questions and escaping dead ends remains uniquely human—for now.
02:43
02:43
Players can intentionally let opponents capture stones to gain greater advantage elsewhere on the board
11:14
11:14
AlphaGo's breakthrough was using neural nets to make the search problem tractable
54:07
54:07
MCTS recursively improves its own neural predictions by updating node values and visit counts through backup
1:17:32
1:17:32
Neural networks amortize computation to solve NP-hard problems, challenging traditional hardness assumptions
1:42:24
1:42:24
MCTS and Q-learning share a recursive dynamic programming property that enables value estimation without explicit search.
2:00:02
2:00:02
Strong initialization against Katago reduces the need for architectural tricks and auxiliary supervision objectives
2:07:23
2:07:23
MCTS relabeling replaces target network computation and has a stabilizing effect while better saturating the GPU
2:21:33
2:21:33
In local minima with flat signals, the win rate curve of an MCTS policy versus the raw network provides a clean supervision signal
2:25:22
2:25:22
Mythos-class models and Go-inspired RL environments offer promising paths toward verifiable AI self-improvement