Why next-token prediction creates the reversal curse

Abstract

This memo examines a narrow but revealing failure mode in language models: if a model learns a fact in the form "A is B," why does it often fail to answer the reversed query "Who is B?" Berglund et al. (2023) named this the reversal curse. The phenomenon matters because it separates surface fluency from a more basic kind of relational generalization. A model that truly stored a reversible association between an entity and a description should often recover either side when prompted from the other. Yet autoregressive LLMs frequently do not. My reading of the literature is that this is not just a quirky benchmark artifact. It exposes a structural bias of next-token prediction: the training objective strongly rewards modeling p(B|A) when text is written in that direction, but gives much weaker pressure to model p(A|B) unless the reverse pattern also appears in data or context. The reversal curse therefore says something fundamental about what a causal LM learns from text: it learns directional conditionals extremely well, but it does not automatically convert them into symmetric, query-robust knowledge.

Related Work

The primary source is Berglund et al. (2023), which demonstrated the effect both in controlled finetuning experiments with fictitious facts and in prompting experiments on real-world celebrity relations. Their key finding was stark: models can answer "Tom Cruise's mother is?" much more reliably than "Mary Lee Pfeiffer's son is?" even though both names are individually known and the relation is conceptually simple.

Two follow-up lines matter for interpretation. First, Grosse et al. (2023) studied generalization with influence functions and found that training-example influence drops sharply when key phrase order is flipped, which independently supports the idea that LLM knowledge is often stored in an order-sensitive way rather than as an abstract bidirectional rule. Second, Zhang et al. (2024) argued that the training objective itself is central: standard next-token prediction amplifies the curse, while more bidirectional objectives such as autoregressive blank infilling can reduce it. Golovneva et al. (2024) then pushed the mitigation idea further with reverse training, explicitly exposing the model to both forward and reversed strings while preserving entity boundaries.

Together, these papers shift the question from "Do models sometimes fail on reverse lookup?" to "What kind of representation is induced by causal language modeling, and why is it so asymmetric?"

Method/Mechanism

The core mechanism is easiest to see probabilistically. If pretraining repeatedly contains sentences such as "Valentina Tereshkova was the first woman to travel to space," then next-token prediction gets many gradient updates that help estimate continuations after the prefix "Valentina Tereshkova was..." and perhaps after "the first woman to..." only insofar as that exact phrasing appears often enough. The objective never directly says: from the fact that one string predicts the other, infer the reverse conditional by logical closure.

In other words, causal LMs are excellent at sequence continuation, not guaranteed to build a relation-centric database. If entity A usually appears before descriptor B in text, then training concentrates probability mass on forward retrieval. Reverse retrieval requires either explicit evidence, a separate learned schema for inverse relations, or enough in-context structure for the model to reason on the fly. Berglund et al. showed the last case can happen: when "A is B" is placed directly in context, models often can answer "Who is B?" correctly. That distinction is important. The failure is not pure incapacity; it is a failure of what gets stored and made directly queryable by pretraining.

This also explains why the curse is stronger for rare relations and long-tail entities. If the web rarely phrases a fact in both directions, then the model has little incentive to compress both sides into one abstract reversible representation. The learned knowledge remains tied to textual direction, lexical surface form, and retrieval cueing.

Key Findings

Two case studies make the mechanism concrete:

Case study 1: fictitious fact finetuning. Berglund et al. fine-tuned models on synthetic statements such as "Uriah Hawthorne is the composer of Abyssal Melodies" and then tested both the trained direction and the reversed question. Models learned the trained direction but were often near chance on the reverse query. This is strong evidence because it strips away confounds from pretraining knowledge and world familiarity.
Case study 2: objective-level mitigation. Zhang et al. report that GLM models trained with autoregressive blank infilling, where masked tokens can use both left and right context, are much more robust on reversed retrieval than causal next-token models trained on the same type of data. That result points away from "LLMs just do not understand inverse relations" and toward "the training objective determines which directions become directly accessible."

Five crisp insights follow from the literature:

Knowledge in LLMs is often conditional, not relational. The model may know how to continue a familiar string without storing an explicit reversible fact object.
Autoregressive training bakes in directional asymmetry. If text usually presents facts one way, learning pressure is concentrated in that same direction.
Reverse success in context does not imply reverse knowledge in weights. Models can reason over an explicit sentence they just saw, while still failing to retrieve the reverse relation from parametric memory.
The curse is partly architectural-objective, not merely data scarcity. Bidirectional or reverse-aware objectives alleviate the problem without needing entirely new capabilities.
This is a retrieval-shape problem with alignment consequences. Systems that appear factual may still be brittle when users query the same fact through an unfamiliar access path.

A useful extra observation is that the reversal curse is not really about simple logic in isolation. It is about whether pretraining creates representations that are stable under paraphrase and inversion. That puts it closer to robustness and knowledge access than to textbook deductive reasoning.

Limitations

The benchmark setups are intentionally clean, which is a strength for causal diagnosis but a weakness for external validity. Many real prompts include multiple cues, not just a bare inverse relation, and those cues can rescue performance. So the curse should not be read as "LLMs cannot infer inverses at all." It is better read as: absent supporting context, parametric retrieval is poorly conditioned on the reverse query.

There is also an ambiguity between representation and search. A model might contain all the ingredients needed for reverse lookup, but fail because decoding does not find the right path reliably. The objective-based mitigation results make me think the storage story is primary, but retrieval dynamics still matter. Finally, most published experiments concern binary relation pairs or short descriptive facts. Harder structures, like compositional relations or multi-hop inverse retrieval, remain less mapped.

Future Directions

One promising direction is to separate three candidate causes more carefully: asymmetric data frequency, asymmetric objective pressure, and asymmetric entity representation. Objective-level mitigations already suggest that bidirectional prediction helps, but it remains unclear how much of the gain comes from better relational abstraction versus simply seeing more reverse-format strings.

Another direction is mechanistic: where exactly is the asymmetry encoded? Is the relevant limitation in early token-level binding, in mid-layer factual recall, or near the output head where one query form maps more easily to a name than the other? That is the kind of question circuit-level analysis could make precise.

For alignment, the important application is evaluation. Truthfulness audits should not only ask whether a model knows a fact under the most common wording; they should ask whether knowledge survives changes in direction, paraphrase, and entity-description order. A model that fails those tests is epistemically more brittle than raw benchmark accuracy suggests.

Open question: can we design a training objective that preserves the scaling and generation advantages of causal next-token modeling while inducing genuinely relation-centric, order-robust factual representations?

Summary

The reversal curse is one of the clearest signs that LLM knowledge is often organized around directional prediction rather than abstract relational access. Berglund et al. exposed the failure behavior, influence-function work reinforced that it reflects real generalization asymmetry, and later mitigation papers showed that changing the training objective can reduce the effect. The deeper lesson is not that models are bad at a party trick. It is that next-token prediction alone does not force knowledge to be query-invariant. For alignment and reliability, that matters: users do not always ask facts in the direction the internet happened to write them.

References

Primary: Berglund et al. "The Reversal Curse: LLMs trained on 'A is B' fail to learn 'B is A'." 2023. https://arxiv.org/abs/2309.12288
Auxiliary: Grosse et al. "Studying Large Language Model Generalization with Influence Functions." 2023. https://arxiv.org/abs/2308.03296
Auxiliary: Zhang et al. "An Analysis and Mitigation of the Reversal Curse." EMNLP 2024. https://aclanthology.org/2024.emnlp-main.754/
Auxiliary: Golovneva et al. "Reverse Training to Nurse the Reversal Curse." COLM 2024. https://openreview.net/forum?id=HDkNbfLQgu