Evaluating IOI circuits: faithfulness, completeness, minimality

This memo zooms in on a very specific alignment-adjacent question: how do we rigorously evaluate whether a mechanistic explanation of a transformer behavior is actually correct? The IOI circuit work in GPT-2 small is a rare case where we can test this directly with quantitative criteria.

Summary: In IOI, a proposed circuit is not just a story. It is evaluated with faithfulness (does it match the behavior), completeness (does it capture all of it), and minimality (is it as small as possible).

Three high-signal insights

The IOI work shows that we can extract a concrete circuit (26 heads across 7 classes) and then test it with causal interventions, turning interpretability into an evaluable hypothesis rather than a narrative.
Faithfulness, completeness, and minimality reveal different failure modes: a circuit can be faithful but still incomplete, or complete but not minimal, which matters if you want mechanistic explanations that scale to safety analysis.
The broader transformer-circuits framework motivates this style of decomposition and gives a formal lens for reasoning about how heads compose through the residual stream.

Open question

Can we design evaluation criteria that remain reliable when the circuit hypothesis includes both attention and MLP components, or when the behavior is diffuse, such as in-context learning mechanisms linked to induction heads?

Why this is useful for alignment

Alignment needs explanations that are more than plausible stories. IOI offers a template for converting interpretability into a falsifiable claim: propose a circuit, measure its causal footprint, and explicitly account for what is missing.

References

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (primary)
A Mathematical Framework for Transformer Circuits (auxiliary)
In-context Learning and Induction Heads (auxiliary)

Evaluating IOI circuits: faithfulness, completeness, minimality

Three high-signal insights

Open question

Why this is useful for alignment

References

Interpretability gets more useful when critique is invited early.