This memo zooms in on a very specific alignment-adjacent question: how do we rigorously evaluate whether a mechanistic explanation of a transformer behavior is actually correct? The IOI circuit work in GPT-2 small is a rare case where we can test this directly with quantitative criteria.
Three high-signal insights
- The IOI work shows that we can extract a concrete circuit (26 heads across 7 classes) and then test it with causal interventions, turning interpretability into an evaluable hypothesis rather than a narrative.
- Faithfulness, completeness, and minimality reveal different failure modes: a circuit can be faithful but still incomplete, or complete but not minimal, which matters if you want mechanistic explanations that scale to safety analysis.
- The broader transformer-circuits framework motivates this style of decomposition and gives a formal lens for reasoning about how heads compose through the residual stream.
Open question
Can we design evaluation criteria that remain reliable when the circuit hypothesis includes both attention and MLP components, or when the behavior is diffuse (e.g., in-context learning mechanisms linked to induction heads)?
Why this is useful for alignment
Alignment needs explanations that are more than plausible stories. IOI offers a template for converting interpretability into a falsifiable claim: propose a circuit, measure its causal footprint, and explicitly account for what is missing.