scripod.com

The Utility of Interpretability — Emmanuel Amiesen

Shownote

Emmanuel Amiesen is lead author of “Circuit Tracing: Revealing Computational Graphs in Language Models” (https://transformer-circuits.pub/2025/attribution-graphs/methods.html ), which is part of a duo of MechInterp papers that Anthropic published in March (alongside https://transformer-circuits.pub/2025/attribution-graphs/biology.html ). We recorded the initial conversation a month ago, but then held off publishing until the open source tooling for the graph generation discussed in this work was released last week: https://www.anthropic.com/research/open-source-circuit-tracing This is a 2 part episode - an intro covering the open source release, then a deeper dive into the paper — with guest host Vibhu Sapra (https://x.com/vibhuuuus ) and Mochi the MechInterp Pomsky (https://x.com/mochipomsky ). Thanks to Vibhu for making this episode happen! While the original blogpost contained some fantastic guided visualizations (which we discuss at the end of this pod!), with the notebook and Neuronpedia visualization (https://www.neuronpedia.org/gemma-2-2b/graph ) released this week, you can now explore on your own with Neuronpedia, as we show you in the video version of this pod.

Highlights

This podcast episode delves into the groundbreaking work of Emmanuel Amiesen, lead author of Anthropic's recent papers on Circuit Tracing and Mechanical Interpretability. The discussion is split into two parts: an introduction to the open-source release of circuit tracing tools and a deeper exploration of the research behind these advancements. With guest host Vibhu Sapra and Mochi the MechInterp Pomsky, the episode highlights how these tools enable users to explore and understand the inner workings of language models like Gemma 2 2B.
01:04
A recent release allows anyone to explore model computation in open-source models.
08:39
Models are trained on base models for next-token prediction, not chat.
13:01
Notebooks can be run on Google Colab without an expensive GPU.
19:14
Using a model to analyze words related to pomskies and trace model outputs.
24:19
Emmanuel Amiesen shares his personal journey into MechInterp
28:23
Interpretability is easier to transition into as it doesn't require large-scale compute.
34:30
Language models use superposition to pack more information than vision models.
37:05
Features are represented as directions in multi-dimensional space.
42:02
Golden Gate Claude was created by adjusting weights in the model.
55:17
Induction heads are a pair of attention heads that enable text repetition and smart copying in NLP models.
59:25
Changing an intermediate feature proves reasoning over memorization.
1:17:40
Models plan well in advance and use backward planning to influence sentence structure.
1:33:01
Parallel circuits in models can lead to conflicting interpretations.
1:40:17
Publishing interpretability research involves balancing benefits and risks.
1:51:37
There are many ideas to try on smaller models in mechanical interpretability.

Chapters

Intro & Guest Introductions
00:00
Anthropic's Circuit Tracing Release
01:00
Exploring Circuit Tracing Tools & Demos
06:11
Model Behaviors and User Experiments
13:01
Behind the Research: Team and Community
17:02
Main Episode Start: Mech Interp Backgrounds
24:19
Getting Into Mech Interp Research
25:56
History and Foundations of Mech Interp
31:52
Core Concepts: Superposition & Features
37:05
Applications & Interventions in Models
39:54
Challenges & Open Questions in Interpretability
45:59
Understanding Model Mechanisms: Circuits & Reasoning
57:15
Model Planning, Reasoning, and Attribution Graphs
1:04:24
Faithfulness, Deception, and Parallel Circuits
1:30:52
Publishing Risks, Open Research, and Visualization
1:40:16
Barriers, Vision, and Call to Action
1:49:33

Transcript

swyx: All right, we are actually going to record this as a intro to the main episode. But here we have my trusty co-host, guest host, I guess, Vibhu, as well as Emmanuel from Anthropic. We're going to talk about the circuit tracing stuff and all the interp...