Sparsity as a control knob for superposition

Superposition is one of those interpretability concepts that is easy to recite but hard to operationalize. We say that neurons are "polysemantic" because multiple features share the same direction in activation space, and that a model "stores" more features than it has neurons. But what actually controls the transition from clean, monosemantic features to a dense superposed representation?

Research question: In toy models of superposition, how does a sparsity penalty change the representational regime, and what does that imply for interpretability in real models?

Abstract

This memo digs into the toy-model analysis of superposition and focuses on a single control knob: sparsity. In the Anthropic "Toy Models of Superposition" work, a sparse-penalized autoencoder-like objective produces phase transitions between (1) a dense, interferent representation where many features share directions, and (2) a near-monosemantic regime where features map cleanly onto coordinates. The key mechanism is geometric: sparsity changes the cost of activating extra features, which changes the optimal packing of feature directions. I summarize the setup, the mechanism, and the implications, then outline what the toy models do and do not tell us about larger transformer representations.

Related Work

The superposition story sits on a longer arc of interpretability research about feature geometry and linear structure in neural networks. The toy model work formalizes superposition in a controlled setting, while "Towards Monosemanticity" attempts to recover cleaner features in real models via dictionary learning. The transformer-circuits framework provides the mathematical context for treating residual streams as linear spaces with sparse features.

Method/Mechanism

The toy models use a simplified, controlled network that learns to represent a set of independent features with limited representational capacity. The model sees sparse feature vectors (only a subset of features active at once) and must reconstruct them with a low-dimensional hidden representation. The training objective includes a sparsity penalty (think L1 on activations) that discourages spreading mass across many units.

The mechanism is geometric packing. Without strong sparsity pressure, the model gains reconstruction accuracy by letting different features share directions in the hidden layer. Features become "superposed" because the model can tolerate some interference if it gains capacity. When sparsity is increased, those shared directions become more expensive: each activation now carries a higher penalty, so it is cheaper to keep features isolated. The result is a phase transition where the optimal solution shifts from shared to separated directions, giving rise to monosemantic features.

Key Findings

Sparsity does not merely clean up features; it changes the optimal representational geometry. The same model can prefer superposition at low sparsity and monosemanticity at high sparsity because the cost of shared activations crosses a threshold.
Superposition can be understood as a packing problem: you can fit more features than dimensions by letting them overlap, but the overlaps create structured interference that depends on the sparsity regime.
There is a qualitative phase transition in toy models where a small change in sparsity weight leads to a sudden change in feature allocation, not a smooth interpolation.
The toy model predictions align with empirical findings in larger systems: sparse dictionary learning tends to recover more "monosemantic" features, suggesting that the sparsity knob is a realistic lever for interpretability.
The geometry implies a trade-off between capacity and clarity. Pushing toward monosemanticity sacrifices feature density, which could matter for performance unless the model can scale width or depth to compensate.

Concrete Examples

Example 1: Two features, one neuron. In the toy model, suppose there are two independent features but only one hidden unit. With low sparsity, the optimal solution is to align the hidden unit along the sum of the two feature directions, so both can be partially reconstructed. Each feature is entangled, but the model still captures both. As the sparsity penalty increases, the cost of activating the shared unit grows; the model can no longer justify the overlap, and it effectively chooses a single feature to represent cleanly. This is the simplest illustration of superposition giving way to monosemanticity.

Example 2: Four features, two neurons. With two neurons and four features, low sparsity yields a "packed" solution where each neuron carries combinations of multiple features. Increasing sparsity creates a bifurcation: the model reorganizes so each neuron is more dedicated, and certain features are dropped or reconstructed poorly. The transition is abrupt, reflecting the phase change in the optimization landscape.

Limitations

The toy models abstract away many critical details: real transformers are deep, nonlinear, and operate on highly structured data rather than independent features. The superposition argument relies on assumptions about feature independence and sparsity that may not hold in language. Also, the phase transition in toy settings may smooth out in large models with millions of parameters. Finally, the sparsity penalty is a clean knob in the toy objective, but in practice sparsity arises indirectly from architecture and data statistics, not from an explicit L1 term.

Future Directions

If sparsity is indeed a control knob, we need to identify the real-world equivalents: what architectural or training choices act like implicit sparsity penalties? Structured sparsity (group L1, activation gating, or sparsity-inducing regularizers) could be experimental levers for testing the toy predictions. Another direction is to measure the geometry of learned features across layers in existing models and see whether the toy model phase transition appears as a shift in feature overlap or polysemanticity.

Open question: Is there a measurable "sparsity threshold" in real LLMs where the representation shifts from superposition-dominated to monosemantic-dominated features, or does scale smooth the transition so much that the threshold becomes meaningless?

Summary

The toy models suggest that superposition is not just a descriptive label; it is a regime that depends on a tunable sparsity penalty. When sparsity is low, feature sharing is optimal and polysemantic neurons are inevitable. When sparsity is high, the model reorganizes toward monosemantic features, trading capacity for clarity. The practical takeaway is that interpretability may depend as much on shaping representation geometry as on probing it.

References

Toy Models of Superposition (primary)
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (auxiliary)
A Mathematical Framework for Transformer Circuits (auxiliary)

Sparsity as a control knob for superposition

Abstract

Related Work

Method/Mechanism

Key Findings

Concrete Examples

Limitations

Future Directions

Summary

References

This is the kind of question that gets better when multiple lenses meet.