Intro to Anthropic Research on “Circuit Tracing: Revealing Computational Graphs in Language Models”
Emmaniel Ameisen et al. (many authors), Circuit Tracing: Revealing Computational Graphs in Language Models, available at https://transformer-circuits.pub/2025/attribution-graphs/methods.html.
This is a synthetic introductory class (no expertise assumed!) generated using enTalkenator (selecting Claude 3.7 Sonnet).
Abstract: We introduce a method to uncover mechanisms underlying behaviors of language models. We produce graph descriptions of the model’s computation on prompts of interest by tracing individual computational steps in a “replacement model”. This replacement model substitutes a more interpretable component (here, a “cross-layer transcoder”) for parts of the underlying model (here, the multi-layer perceptrons) that it is trained to approximate. We develop a suite of visualization and validation tools we use to investigate these “attribution graphs” supporting simple behaviors of an 18-layer language model, and lay the groundwork for a companion paper applying these methods to a frontier model, Claude 3.5 Haiku.