TV Tibo Vanleke AI safety researcher
← Back to research notes

Mechanistic interpretability memo

Evaluating IOI circuits: faithfulness, completeness, minimality

A short memo on what it means for an interpretability explanation to be not just plausible, but causally credible.

February 4, 2026 4 min read IOI circuits

This memo zooms in on a very specific alignment-adjacent question: how do we rigorously evaluate whether a mechanistic explanation of a transformer behavior is actually correct? The IOI circuit work in GPT-2 small is a rare case where we can test this directly with quantitative criteria.

Summary: In IOI, a proposed circuit is not just a story. It is evaluated with faithfulness (does it match the behavior), completeness (does it capture all of it), and minimality (is it as small as possible).

Three high-signal insights

  • The IOI work shows that we can extract a concrete circuit (26 heads across 7 classes) and then test it with causal interventions, turning interpretability into an evaluable hypothesis rather than a narrative.
  • Faithfulness, completeness, and minimality reveal different failure modes: a circuit can be faithful but still incomplete, or complete but not minimal, which matters if you want mechanistic explanations that scale to safety analysis.
  • The broader transformer-circuits framework motivates this style of decomposition and gives a formal lens for reasoning about how heads compose through the residual stream.

Open question

Can we design evaluation criteria that remain reliable when the circuit hypothesis includes both attention and MLP components, or when the behavior is diffuse, such as in-context learning mechanisms linked to induction heads?

Why this is useful for alignment

Alignment needs explanations that are more than plausible stories. IOI offers a template for converting interpretability into a falsifiable claim: propose a circuit, measure its causal footprint, and explicitly account for what is missing.

Continue the thread

Interpretability gets more useful when critique is invited early.

If you are thinking about causal evaluation, circuit quality, or how these standards scale beyond toy settings, I would be happy to hear your angle.