Mechanistic interpretability

Mechanistic interpretability (often abbreviated as mech interp, mechinterp, or MI) is a subfield of research within explainable artificial intelligence that aims to understand the internal workings of neural networks by analyzing the mechanisms present in their computations. The approach seeks to analyze neural networks in a manner similar to how binary computer programs can be reverse-engineered to understand their functions.^[1]

History

The term mechanistic interpretability was coined by Chris Olah.^[2] Early work combined various techniques such as feature visualization, dimensionality reduction, and attribution with human-computer interaction methods to analyze models like the vision model Inception v1.^[3] Later developments include the 2020 paper Zoom In: An Introduction to Circuits, which proposed an analogy between neural network components and biological neural circuits.^[4]

In recent years, mechanistic interpretability has gained prominence with the study of large language models (LLMs) and transformer architectures. The field is expanding rapidly, with multiple dedicated workshops such as the ICML 2024 Mechanistic Interpretability Workshop being hosted.^[5]

Key concepts

Mechanistic interpretability aims to identify structures, circuits or algorithms encoded in the weights of machine learning models.^[6] This contrasts with earlier interpretability methods that focused primarily on input-output explanations.^[7]

Multiple definitions of the term exist, from narrow technical definitions (the study of causal mechanisms inside neural networks) to broader cultural definitions encompassing various AI interpretability research.^[2]

Linear representation hypothesis

This hypothesis suggests that high-level concepts are represented as linear directions in the activation space of neural networks. Empirical evidence from word embeddings and more recent studies supports this view, although it does not hold up universally.^[8]^[9]

Superposition

Superposition describes how neural networks may represent many unrelated features within the same neurons or subspaces, leading to densely packed and overlapping feature representations.^[10]

Methods

Probing

Probing involves training simple classifiers on neural network activations to test whether certain features are encoded.^[1]

Causal interventions

Mechanistic interpretability employs causal methods to understand how internal model components influence outputs, often using formal tools from causality theory.^[11]

Sparse decomposition

Methods such as sparse dictionary learning and sparse autoencoders help disentangle complex overlapping features by learning interpretable, sparse representations.^[12]

Applications and significance

Mechanistic interpretability is crucial in the field of AI safety to understand and verify the behavior of increasingly complex AI systems. It helps identify potential risks and improves transparency.^[13]

References

^ ^a ^b Bereska, Leonard (2024). "Mechanistic Interpretability for AI Safety -- A Review". TLMR. arXiv:2404.14082.
^ ^a ^b Saphra, Naomi; Wiegreffe, Sarah (2024). Mechanistic?. BlackboxNLP workshop. arXiv:2410.09087.
^ Olah, Chris; et al. (2018). "The Building Blocks of Interpretability". Distill. 3 (3). doi:10.23915/distill.00010.
^ Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (2020). "Zoom In: An Introduction to Circuits". Distill. 5 (3). doi:10.23915/distill.00024.001.
^ "ICML 2024 Mechanistic Interpretability Workshop". icml2024mi.pages.dev. Retrieved 2025-05-12.
^ "Towards automated circuit discovery for mechanistic interpretability". NeurIPS: 16318–16352. 2023.
^ Kästner, Lena; Crook, Barnaby (2024). "Explaining AI through mechanistic interpretability". European Journal for Philosophy of Science. 14 (4) 52. doi:10.1007/s13194-024-00614-4.
^ "Linguistic Regularities in Continuous Space Word Representations". NAACL: 746–751. 2013.
^ Park, Kiho (2024). "The Linear Representation Hypothesis and the Geometry of Large Language Models". ICML. 235: 39643–39666.
^ Elhage, Nelson; Hume, Tristan; Olsson, Catherine; Schiefer, Nicholas; Henighan, Tom; Kravec, Shauna; Hatfield-Dodds, Zac; Lasenby, Robert; Drain, Dawn; Chen, Carol; Grosse, Roger; McCandlish, Sam; Kaplan, Jared; Amodei, Dario; Wattenberg, Martin; Olah, Christopher (2022). "Toy Models of Superposition". arXiv:2209.10652 [cs.LG].
^ "Investigating Gender Bias in Language Models Using Causal Mediation Analysis". NeurIPS: 12388–12401. 2020. ISBN 978-1-7138-2954-6.
^ Cunningham, Hoagy (2024). Sparse Autoencoders Find Highly Interpretable Features in Language Models.
^ Sullivan, Mark (2025-04-22). "This startup wants to reprogram the mind of AI—and just got $50 million to do it". Fast Company. Retrieved 2025-05-12.