Why attention sinks stabilize long-context decoding

Abstract

One of the strangest facts about modern autoregressive transformers is that a handful of semantically unimportant early tokens can become operationally indispensable. Remove them from the KV cache during long-context decoding and perplexity spikes; keep them and the model remains stable far beyond its nominal context window. Xiao et al. (2023) named these tokens attention sinks. The narrow question in this memo is not whether sinks are useful in practice, but why they appear at all. My reading of the literature is that attention sinks are a structural byproduct of softmax attention in residual architectures: every head must allocate probability mass somewhere, even when no past token is semantically worth attending to, so models learn special positions that absorb this surplus mass with minimal disruption to the value stream. That makes sinks look less like memory slots and more like normalization valves. The deeper consequence is conceptual. Long-context behavior is partly stabilized not by better semantic retention, but by a learned mechanism for safely dumping unavoidable attention.

Related Work

The central source is Xiao et al. (2023), introduced as StreamingLLM. Their headline result was practical: sliding-window decoding breaks once the prompt length exceeds the cache, but performance is largely restored by preserving only the most recent tokens plus a few initial tokens. The paper argues that those early tokens become universal attention recipients even when they contain little semantic content, which is why evicting them destabilizes decoding.

Later work made the mechanism more explicit. Bondarenko et al. (2023) studied activation outliers in transformers and argued that some heads effectively try to perform a no-op residual update; under softmax, achieving an exact no-op is awkward because attention weights must still sum to one, so the network drives certain logits to extreme values. Wan et al. (2025) then investigated when attention sink emerges during pretraining and found that it appears after effective optimization on sufficient data, can shift to other positions when the data distribution changes, and weakens when softmax is replaced by non-normalized alternatives. Zhai et al. (2023), while not about sinks directly, provide a useful adjacent lens: sharp, low-entropy attention distributions are a recurrent optimization tendency in transformers rather than an isolated artifact.

Taken together, these papers suggest that the sink is not a quirky prompt hack. It is a learned answer to a persistent optimization constraint created by normalized attention.

Method/Mechanism

The mechanism starts from an asymmetry in the attention computation. Softmax enforces that each query distributes one unit of probability mass across previous keys. But many decoding steps do not truly need external information from all heads. Some heads, at some layers, would ideally contribute almost nothing. In a residual architecture, the cleanest behavior for such a head is to leave the residual stream mostly unchanged.

Softmax makes that awkward. A head cannot literally attend to nobody. It must put weight somewhere. The model therefore benefits from creating token positions whose values are relatively harmless and whose keys reliably attract leftover attention. Early sequence positions are convenient for that role: they are globally visible, stable across examples, and often weak in semantic content compared with the task-relevant middle context. Over training, those positions accumulate large attention scores and can develop unusually large hidden-state norms or key biases, making them persistent recipients of otherwise wasted mass.

This also explains why the sink matters specifically for streaming. In ordinary full-context decoding, the sink token remains present, so the head can continue dumping excess mass there. In naive windowed decoding, that token gets evicted. The head is then forced to redirect its mass onto recent tokens that actually carry content, causing unwanted mixing and degraded predictions. Preserving a few initial tokens keeps the learned pressure-release path intact.

Key Findings

Two case studies make the story concrete:

Case study 1: retain 4 initial tokens, recover long-context decoding. In StreamingLLM, standard sliding-window decoding degrades badly once the sequence exceeds the cache length. But retaining just the first few tokens alongside the recent window largely restores perplexity and enables stable decoding into the millions of tokens. That is hard to explain unless those initial tokens are functionally special.
Case study 2: the sink can move if training changes the privileged position. Wan et al. show that if a fixed token is placed consistently in position two or three during training, the sink shifts there rather than remaining on the literal first token. This is strong evidence against the idea that the sink is only a BOS-token quirk. The model is learning a role, not merely inheriting a fixed token identity.

Four crisp insights follow:

Attention sinks are more like normalization infrastructure than semantic memory. Their job is often to absorb excess score mass, not store task content.
The phenomenon is induced by the objective-mechanism pair, not just by prompt format. It emerges during successful pretraining and weakens when attention normalization is changed.
Long-context failure under window eviction is partly self-inflicted. Removing sinks does not just remove old information; it breaks a learned way of avoiding harmful mixing.
The sink is positional but not intrinsically first-token-specific. Training can relocate it, which means evaluations should distinguish role from location.

A fifth insight matters for alignment-adjacent reasoning: if some internal machinery exists mainly to route around architectural constraints, then interpreting attention maps naively is risky. High attention on a token does not always mean high semantic relevance. Sometimes it means "this is where the model safely dumps unavoidable probability."

Limitations

The current evidence is persuasive but not fully unified. StreamingLLM established the phenomenon functionally, while later papers isolate candidate causes such as key bias, hidden-state outliers, and softmax normalization. These stories are compatible, but not yet reduced to one compact theorem that covers realistic LLMs. There is still a gap between simple mechanistic explanations and the exact behavior of large production architectures with RoPE, grouped-query attention, and finetuning.

There is also a measurement issue. Attention weight concentration, activation outliers, and sink usefulness are related but not identical notions. A model may display one strongly and another weakly. Finally, most sink analyses focus on language modeling loss or perplexity recovery. We know less about how sink preservation affects downstream reasoning quality, truthfulness, or safety behavior under very long adversarial contexts.

Future Directions

One obvious direction is architectural: can we design attention mechanisms that let a head represent "attend to nobody" directly, instead of forcing it to learn a sink token? That would test whether sinks are an optimization crutch or an actually useful inductive bias. Non-normalized or rectified attention variants are one route, but they need to preserve the robustness that softmax still provides.

A second direction is interpretability. It should be possible to separate semantically meaningful heads from sink-dependent heads using causal interventions rather than raw attention maps. That would sharpen mechanistic analyses of long-context reasoning and reduce overinterpretation of visually salient early tokens.

For evaluation, I would also like to see long-context benchmarks report whether performance depends on preserving designated sink tokens. If so, "effective context length" is partly a property of cache policy, not just the pretrained model.

Open question: can transformers be given an explicit low-distortion no-op pathway that removes the need for attention sinks without sacrificing the optimization stability that made them emerge in the first place?

Summary

Attention sinks look accidental at first, but the literature suggests they are an adaptive response to a real constraint of softmax attention. StreamingLLM showed that keeping a few early sink tokens can rescue long-context decoding; Bondarenko et al. linked related outliers to heads trying to do nothing; Wan et al. showed sink behavior emerges through training and can move when the data distribution changes. The common lesson is that some transformer internals are best understood as control systems for the architecture itself, not as clean representations of meaning. In long-context LLMs, stable generation sometimes depends on preserving that internal plumbing.

References

Primary: Xiao et al. "Efficient Streaming Language Models with Attention Sinks." ICLR 2024. https://arxiv.org/abs/2309.17453
Auxiliary: Bondarenko et al. "Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing." NeurIPS 2023. https://arxiv.org/abs/2306.12929
Auxiliary: Wan et al. "When Attention Sink Emerges in Language Models: An Empirical View." ICLR 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/f1b04face60081b689ba740d39ea8f37-Paper-Conference.pdf
Auxiliary: Zhai et al. "Stabilizing Transformer Training by Preventing Attention Entropy Collapse." ICML 2023. https://arxiv.org/abs/2303.06296