← Back to blog

Why attention sinks stabilize long-context decoding

June 17, 2026 · 18 min read

Abstract

One of the strangest facts about modern autoregressive transformers is that a handful of semantically unimportant early tokens can become operationally indispensable. Remove them from the KV cache during long-context decoding and perplexity spikes; keep them and the model remains stable far beyond its nominal context window. Xiao et al. (2023) named these tokens attention sinks. The narrow question in this memo is not whether sinks are useful in practice, but why they appear at all. My reading of the literature is that attention sinks are a structural byproduct of softmax attention in residual architectures: every head must allocate probability mass somewhere, even when no past token is semantically worth attending to, so models learn special positions that absorb this surplus mass with minimal disruption to the value stream. That makes sinks look less like memory slots and more like normalization valves. The deeper consequence is conceptual. Long-context behavior is partly stabilized not by better semantic retention, but by a learned mechanism for safely dumping unavoidable attention.

Related Work

The central source is Xiao et al. (2023), introduced as StreamingLLM. Their headline result was practical: sliding-window decoding breaks once the prompt length exceeds the cache, but performance is largely restored by preserving only the most recent tokens plus a few initial tokens. The paper argues that those early tokens become universal attention recipients even when they contain little semantic content, which is why evicting them destabilizes decoding.

Later work made the mechanism more explicit. Bondarenko et al. (2023) studied activation outliers in transformers and argued that some heads effectively try to perform a no-op residual update; under softmax, achieving an exact no-op is awkward because attention weights must still sum to one, so the network drives certain logits to extreme values. Wan et al. (2025) then investigated when attention sink emerges during pretraining and found that it appears after effective optimization on sufficient data, can shift to other positions when the data distribution changes, and weakens when softmax is replaced by non-normalized alternatives. Zhai et al. (2023), while not about sinks directly, provide a useful adjacent lens: sharp, low-entropy attention distributions are a recurrent optimization tendency in transformers rather than an isolated artifact.

Taken together, these papers suggest that the sink is not a quirky prompt hack. It is a learned answer to a persistent optimization constraint created by normalized attention.

Method/Mechanism

The mechanism starts from an asymmetry in the attention computation. Softmax enforces that each query distributes one unit of probability mass across previous keys. But many decoding steps do not truly need external information from all heads. Some heads, at some layers, would ideally contribute almost nothing. In a residual architecture, the cleanest behavior for such a head is to leave the residual stream mostly unchanged.

Softmax makes that awkward. A head cannot literally attend to nobody. It must put weight somewhere. The model therefore benefits from creating token positions whose values are relatively harmless and whose keys reliably attract leftover attention. Early sequence positions are convenient for that role: they are globally visible, stable across examples, and often weak in semantic content compared with the task-relevant middle context. Over training, those positions accumulate large attention scores and can develop unusually large hidden-state norms or key biases, making them persistent recipients of otherwise wasted mass.

This also explains why the sink matters specifically for streaming. In ordinary full-context decoding, the sink token remains present, so the head can continue dumping excess mass there. In naive windowed decoding, that token gets evicted. The head is then forced to redirect its mass onto recent tokens that actually carry content, causing unwanted mixing and degraded predictions. Preserving a few initial tokens keeps the learned pressure-release path intact.

Key Findings

Two case studies make the story concrete:

Four crisp insights follow:

A fifth insight matters for alignment-adjacent reasoning: if some internal machinery exists mainly to route around architectural constraints, then interpreting attention maps naively is risky. High attention on a token does not always mean high semantic relevance. Sometimes it means "this is where the model safely dumps unavoidable probability."

Limitations

The current evidence is persuasive but not fully unified. StreamingLLM established the phenomenon functionally, while later papers isolate candidate causes such as key bias, hidden-state outliers, and softmax normalization. These stories are compatible, but not yet reduced to one compact theorem that covers realistic LLMs. There is still a gap between simple mechanistic explanations and the exact behavior of large production architectures with RoPE, grouped-query attention, and finetuning.

There is also a measurement issue. Attention weight concentration, activation outliers, and sink usefulness are related but not identical notions. A model may display one strongly and another weakly. Finally, most sink analyses focus on language modeling loss or perplexity recovery. We know less about how sink preservation affects downstream reasoning quality, truthfulness, or safety behavior under very long adversarial contexts.

Future Directions

One obvious direction is architectural: can we design attention mechanisms that let a head represent "attend to nobody" directly, instead of forcing it to learn a sink token? That would test whether sinks are an optimization crutch or an actually useful inductive bias. Non-normalized or rectified attention variants are one route, but they need to preserve the robustness that softmax still provides.

A second direction is interpretability. It should be possible to separate semantically meaningful heads from sink-dependent heads using causal interventions rather than raw attention maps. That would sharpen mechanistic analyses of long-context reasoning and reduce overinterpretation of visually salient early tokens.

For evaluation, I would also like to see long-context benchmarks report whether performance depends on preserving designated sink tokens. If so, "effective context length" is partly a property of cache policy, not just the pretrained model.

Open question: can transformers be given an explicit low-distortion no-op pathway that removes the need for attention sinks without sacrificing the optimization stability that made them emerge in the first place?

Summary

Attention sinks look accidental at first, but the literature suggests they are an adaptive response to a real constraint of softmax attention. StreamingLLM showed that keeping a few early sink tokens can rescue long-context decoding; Bondarenko et al. linked related outliers to heads trying to do nothing; Wan et al. showed sink behavior emerges through training and can move when the data distribution changes. The common lesson is that some transformer internals are best understood as control systems for the architecture itself, not as clean representations of meaning. In long-context LLMs, stable generation sometimes depends on preserving that internal plumbing.

References