← Back to blog

Why open-ended decoding needs tail truncation

June 26, 2026 · 20 min read

Abstract

A narrow but important question in language modeling is why decoding strategies that maximize model probability often produce bad text in open-ended settings. Holtzman et al. made the puzzle vivid: beam search and greedy decoding can turn strong language models into machines for repetition, blandness, and local dead ends, while stochastic truncation methods such as nucleus sampling produce more natural continuations. The best explanation is not that search is weak. It is that next-token distributions in open-ended generation contain an unreliable tail: many individually small token probabilities become dangerous in aggregate, while the globally highest-probability continuations are often too safe, too repetitive, and too unsurprising to match human text. Later work sharpens that picture. Typical decoding reframes the issue in terms of local surprisal, and more recent theory connects the bad tail to systematic modeling error, including limits imposed by the output softmax. The result is a useful lesson about LLMs: generation quality depends not just on what a model knows, but on how inference navigates uncertainty at each step.

Related Work

The central source is Holtzman et al. (2020), "The Curious Case of Neural Text Degeneration." That paper showed that open-ended text generation behaves very differently from tasks like translation, where beam search is often helpful. In story continuation and long-form generation, maximizing likelihood produced text that was too probable, too repetitive, and too narrow compared with human continuations. Their proposed fix was nucleus sampling: dynamically truncate the distribution to the smallest set of tokens containing most of the probability mass, then sample within that set.

Meister et al. (2023) pushed the question from "which tokens are likely?" to "which tokens have the right amount of information?" Their typical decoding work argues that human text tends to stay near the model's local conditional entropy rather than always choosing the top-probability token. Finlayson et al. (2024) then supplied missing theory for why truncation works at all. Their account shows that threshold-based truncation can be understood as a way to avoid sampling tokens that may have zero true probability, and that the softmax bottleneck is one plausible source of these tail errors. Welleck et al. (2020) provide a useful complement on the training side: if repetitive continuations receive too much probability mass, one can modify the objective itself with unlikelihood training rather than only repairing the decoder afterward.

Method/Mechanism

The core mechanism starts with a mismatch between sequence likelihood and human quality. In open-ended writing there are many acceptable next moves. Human continuations therefore wander through moderately probable but informative tokens. Beam search does the opposite: it compounds locally high-probability choices into globally dull trajectories. Holtzman et al. showed that this makes generated text less surprising than human text and more likely to fall into repetition loops.

Full ancestral sampling fails in the opposite direction. If one samples from the entire predicted distribution, the tail injects too many low-quality continuations. The problem is not any single token in isolation. It is the aggregate effect of thousands of low-probability candidates whose estimates are noisy. Nucleus sampling addresses exactly that point: keep the high-mass "nucleus" that the model seems confident about, discard the long tail, and preserve randomness only inside the surviving set. Because the cutoff adapts to context, the candidate set stays small in peaked contexts and broadens when many continuations are genuinely plausible.

Typical decoding offers a more refined intuition. It says that human-like text is not merely high probability; it is locally typical relative to the model's own entropy. Tokens that are too predictable make text collapse into generic loops, while tokens that are too surprising push it into incoherence. Later theory from Finlayson et al. makes the tail story more mechanistic by showing how truncation can be seen as support recovery under bounded error assumptions. Their basis-aware analysis goes further: if output-layer structure causes some low-probability estimates to be wrong for geometric reasons, then smarter truncation could keep good rare tokens while rejecting bad ones, instead of relying on a blunt threshold alone.

Key Findings

Two concrete case studies make the literature easier to trust:

Four crisp insights follow:

One alignment-adjacent implication is that refusal quality, honesty, and deliberative behavior may be partly decoder-sensitive. If bad tails contain sycophantic, evasive, or repetitive continuations, then evaluation of aligned behavior should not treat decoding as a neutral wrapper around model competence.

Limitations

This line of work has clear limits. First, much of the classic evidence uses GPT-2-era models. The phenomenon still matters, but the exact failure mix changes as base models improve. Second, open-ended generation is subjective: quality, diversity, and coherence trade off against each other, and no single automatic metric settles the issue. Third, truncation methods remain heuristic. Nucleus sampling often works well, but it can still over-prune in low-entropy contexts or admit the wrong rare tokens in high entropy ones. Finlayson et al.'s theory is promising here, but their more precise support-aware methods are not yet the default practical decoder.

There is also a conceptual limit. Tail truncation explains how to avoid some decoding failures, but it does not by itself explain why the model assigns too much probability to degenerate loops or generic continuations. That part reaches back into training data, objective choice, and architectural constraints in the model itself.

Future Directions

The most interesting next step is to replace hand-tuned truncation heuristics with decoders that reason directly about calibrated support, local information content, and downstream utility. Another direction is joint training-decoding design: if unlikelihood-style objectives and truncation methods solve complementary parts of the same problem, they should probably be co-designed rather than studied in isolation.

A third direction matters for alignment evaluation. Many safety benchmarks are effectively open-ended generation tasks, so we should ask how much measured harmlessness or truthfulness depends on tail control rather than on internal representation alone. Decoder choice may be part of the policy, not just part of the interface.

Open question: can we build a decoding rule that preserves nucleus sampling's robustness to tail errors while selectively recovering low-probability tokens that are rare for good reasons rather than bad ones?

Summary

Open-ended language generation breaks naive likelihood maximization because good text is not usually the most probable path through a token-level model. Holtzman et al. showed that maximization collapses into repetition and genericness, while full sampling exposes an unreliable tail. Nucleus sampling works because it keeps randomness where the model is confident and cuts it where estimation error aggregates. Typical decoding and later theory refine that story, but they preserve its core lesson: the decoder is an essential part of language-model behavior, not a disposable afterthought.

References