scripod.com

The Utility of Interpretability — Emmanuel Amiesen

This podcast episode delves into the groundbreaking work of Emmanuel Amiesen, lead author of Anthropic's recent papers on Circuit Tracing and Mechanical Interpretability. The discussion is split into two parts: an introduction to the open-source release of circuit tracing tools and a deeper exploration of the research behind these advancements. With guest host Vibhu Sapra and Mochi the MechInterp Pomsky, the episode highlights how these tools enable users to explore and understand the inner workings of language models like Gemma 2 2B.
The podcast explores the significance of circuit tracing in enhancing model interpretability, focusing on Anthropic's open-source release that allows users to experiment with pre-computed graphs and extend methods to other models. It emphasizes practical applications such as multi-hop reasoning and interventions in model features, exemplified by the Golden Gate Bridge feature. Challenges include understanding superposition and packed dimensions, while advancements in visualization tools make complex concepts more accessible. The conversation also addresses the importance of collaborative efforts and community contributions in advancing mechanical interpretability, encouraging participation from researchers outside major labs. Finally, it discusses the balance between transparency and risk in publishing such research, highlighting the potential for improving model behavior and reducing biases through deeper understanding of internal mechanisms.
01:04
01:04
A recent release allows anyone to explore model computation in open-source models.
08:39
08:39
Models are trained on base models for next-token prediction, not chat.
13:01
13:01
Notebooks can be run on Google Colab without an expensive GPU.
19:14
19:14
Using a model to analyze words related to pomskies and trace model outputs.
24:19
24:19
Emmanuel Amiesen shares his personal journey into MechInterp
28:23
28:23
Interpretability is easier to transition into as it doesn't require large-scale compute.
34:30
34:30
Language models use superposition to pack more information than vision models.
37:05
37:05
Features are represented as directions in multi-dimensional space.
42:02
42:02
Golden Gate Claude was created by adjusting weights in the model.
55:17
55:17
Induction heads are a pair of attention heads that enable text repetition and smart copying in NLP models.
59:25
59:25
Changing an intermediate feature proves reasoning over memorization.
1:17:40
1:17:40
Models plan well in advance and use backward planning to influence sentence structure.
1:33:01
1:33:01
Parallel circuits in models can lead to conflicting interpretations.
1:40:17
1:40:17
Publishing interpretability research involves balancing benefits and risks.
1:51:37
1:51:37
There are many ideas to try on smaller models in mechanical interpretability.