Abstract
A narrow but important question in language modeling is why decoding strategies that maximize model probability often produce bad text in open-ended settings. Holtzman et al. made the puzzle vivid: beam search and greedy decoding can turn strong language models into machines for repetition, blandness, and local dead ends, while stochastic truncation methods such as nucleus sampling produce more natural continuations. The best explanation is not that search is weak. It is that next-token distributions in open-ended generation contain an unreliable tail: many individually small token probabilities become dangerous in aggregate, while the globally highest-probability continuations are often too safe, too repetitive, and too unsurprising to match human text. Later work sharpens that picture. Typical decoding reframes the issue in terms of local surprisal, and more recent theory connects the bad tail to systematic modeling error, including limits imposed by the output softmax. The result is a useful lesson about LLMs: generation quality depends not just on what a model knows, but on how inference navigates uncertainty at each step.
Related Work
The central source is Holtzman et al. (2020), "The Curious Case of Neural Text Degeneration." That paper showed that open-ended text generation behaves very differently from tasks like translation, where beam search is often helpful. In story continuation and long-form generation, maximizing likelihood produced text that was too probable, too repetitive, and too narrow compared with human continuations. Their proposed fix was nucleus sampling: dynamically truncate the distribution to the smallest set of tokens containing most of the probability mass, then sample within that set.
Meister et al. (2023) pushed the question from "which tokens are likely?" to "which tokens have the right amount of information?" Their typical decoding work argues that human text tends to stay near the model's local conditional entropy rather than always choosing the top-probability token. Finlayson et al. (2024) then supplied missing theory for why truncation works at all. Their account shows that threshold-based truncation can be understood as a way to avoid sampling tokens that may have zero true probability, and that the softmax bottleneck is one plausible source of these tail errors. Welleck et al. (2020) provide a useful complement on the training side: if repetitive continuations receive too much probability mass, one can modify the objective itself with unlikelihood training rather than only repairing the decoder afterward.
Method/Mechanism
The core mechanism starts with a mismatch between sequence likelihood and human quality. In open-ended writing there are many acceptable next moves. Human continuations therefore wander through moderately probable but informative tokens. Beam search does the opposite: it compounds locally high-probability choices into globally dull trajectories. Holtzman et al. showed that this makes generated text less surprising than human text and more likely to fall into repetition loops.
Full ancestral sampling fails in the opposite direction. If one samples from the entire predicted distribution, the tail injects too many low-quality continuations. The problem is not any single token in isolation. It is the aggregate effect of thousands of low-probability candidates whose estimates are noisy. Nucleus sampling addresses exactly that point: keep the high-mass "nucleus" that the model seems confident about, discard the long tail, and preserve randomness only inside the surviving set. Because the cutoff adapts to context, the candidate set stays small in peaked contexts and broadens when many continuations are genuinely plausible.
Typical decoding offers a more refined intuition. It says that human-like text is not merely high probability; it is locally typical relative to the model's own entropy. Tokens that are too predictable make text collapse into generic loops, while tokens that are too surprising push it into incoherence. Later theory from Finlayson et al. makes the tail story more mechanistic by showing how truncation can be seen as support recovery under bounded error assumptions. Their basis-aware analysis goes further: if output-layer structure causes some low-probability estimates to be wrong for geometric reasons, then smarter truncation could keep good rare tokens while rejecting bad ones, instead of relying on a blunt threshold alone.
Key Findings
Two concrete case studies make the literature easier to trust:
- Case study 1: GPT-2 story continuation under beam search. Holtzman et al. show that even with strong context and a capable model, beam search drifts into repeated phrases and overconfident continuations. The key empirical clue is that beam-generated text has lower perplexity than human text under the model, which means the decoder is not failing to find the model's favorite continuation. It is finding it too well.
- Case study 2: unlikelihood training reduces repetition at the model level. Welleck et al. report large drops in repeated n-grams after fine-tuning against repeated tokens and sequences. That matters because it separates two problems: bad decoding can expose distributional flaws, but training objectives can also create or suppress the flaws that decoding later has to manage.
Four crisp insights follow:
- Likelihood is not the right objective for open-ended decoding. Human text is not usually the argmax continuation, because interesting writing repeatedly chooses informative but non-maximal tokens.
- The dangerous part of the distribution is the tail in aggregate. Many small probability errors become harmful when sampled over long horizons.
- Dynamic truncation works because uncertainty is context-dependent. A fixed top-k cutoff ignores whether the model is currently certain or genuinely undecided.
- Decoder fixes and training fixes target different layers of the same problem. Nucleus sampling repairs inference locally, while unlikelihood training changes which continuations get probability mass in the first place.
One alignment-adjacent implication is that refusal quality, honesty, and deliberative behavior may be partly decoder-sensitive. If bad tails contain sycophantic, evasive, or repetitive continuations, then evaluation of aligned behavior should not treat decoding as a neutral wrapper around model competence.
Limitations
This line of work has clear limits. First, much of the classic evidence uses GPT-2-era models. The phenomenon still matters, but the exact failure mix changes as base models improve. Second, open-ended generation is subjective: quality, diversity, and coherence trade off against each other, and no single automatic metric settles the issue. Third, truncation methods remain heuristic. Nucleus sampling often works well, but it can still over-prune in low-entropy contexts or admit the wrong rare tokens in high entropy ones. Finlayson et al.'s theory is promising here, but their more precise support-aware methods are not yet the default practical decoder.
There is also a conceptual limit. Tail truncation explains how to avoid some decoding failures, but it does not by itself explain why the model assigns too much probability to degenerate loops or generic continuations. That part reaches back into training data, objective choice, and architectural constraints in the model itself.
Future Directions
The most interesting next step is to replace hand-tuned truncation heuristics with decoders that reason directly about calibrated support, local information content, and downstream utility. Another direction is joint training-decoding design: if unlikelihood-style objectives and truncation methods solve complementary parts of the same problem, they should probably be co-designed rather than studied in isolation.
A third direction matters for alignment evaluation. Many safety benchmarks are effectively open-ended generation tasks, so we should ask how much measured harmlessness or truthfulness depends on tail control rather than on internal representation alone. Decoder choice may be part of the policy, not just part of the interface.
Open question: can we build a decoding rule that preserves nucleus sampling's robustness to tail errors while selectively recovering low-probability tokens that are rare for good reasons rather than bad ones?
Summary
Open-ended language generation breaks naive likelihood maximization because good text is not usually the most probable path through a token-level model. Holtzman et al. showed that maximization collapses into repetition and genericness, while full sampling exposes an unreliable tail. Nucleus sampling works because it keeps randomness where the model is confident and cuts it where estimation error aggregates. Typical decoding and later theory refine that story, but they preserve its core lesson: the decoder is an essential part of language-model behavior, not a disposable afterthought.
References
- Primary: Holtzman et al. "The Curious Case of Neural Text Degeneration." ICLR 2020. https://arxiv.org/abs/1904.09751
- Auxiliary: Welleck et al. "Neural Text Generation with Unlikelihood Training." ICLR 2020. https://arxiv.org/abs/1908.04319
- Auxiliary: Meister et al. "Locally Typical Sampling." TACL 2023. https://arxiv.org/abs/2202.00666
- Auxiliary: Finlayson et al. "Closing the Curious Case of Neural Text Degeneration." ICLR 2024. https://arxiv.org/abs/2310.01693