Academic Workshop on Anthropic Research “Circuit Tracing”

Emmaniel Ameisen et al. (many authors), Circuit Tracing: Revealing Computational Graphs in Language Models, available at https://transformer-circuits.pub/2025/attribution-graphs/methods.html.

This is a synthetic academic workshop generated using enTalkenator (selecting Claude 3.7 Sonnet).

Abstract: We introduce a method to uncover mechanisms underlying behaviors of language models. We produce graph descriptions of the model’s computation on prompts of interest by tracing individual computational steps in a “replacement model”. This replacement model substitutes a more interpretable component (here, a “cross-layer transcoder”) for parts of the underlying model (here, the multi-layer perceptrons) that it is trained to approximate. We develop a suite of visualization and validation tools we use to investigate these “attribution graphs” supporting simple behaviors of an 18-layer language model, and lay the groundwork for a companion paper applying these methods to a frontier model, Claude 3.5 Haiku.

Previous
Previous

Ramsey on “The Originalist Case Against the Insular Cases”

Next
Next

Intro to Anthropic Research “On the Biology of a Large Language Model”