← Back to blog

Evaluating IOI circuits: faithfulness, completeness, minimality

February 4, 2026 · 4 min read

This memo zooms in on a very specific alignment-adjacent question: how do we rigorously evaluate whether a mechanistic explanation of a transformer behavior is actually correct? The IOI circuit work in GPT-2 small is a rare case where we can test this directly with quantitative criteria.

Summary: In IOI, a proposed circuit is not just a story. It is evaluated with faithfulness (does it match the behavior), completeness (does it capture all of it), and minimality (is it as small as possible).

Three high-signal insights

Open question

Can we design evaluation criteria that remain reliable when the circuit hypothesis includes both attention and MLP components, or when the behavior is diffuse (e.g., in-context learning mechanisms linked to induction heads)?

Why this is useful for alignment

Alignment needs explanations that are more than plausible stories. IOI offers a template for converting interpretability into a falsifiable claim: propose a circuit, measure its causal footprint, and explicitly account for what is missing.