Grokking as a phase transition in transformer training

Grokking is the strange training dynamic where a model first memorizes a small algorithmic dataset and then, after a long plateau, suddenly “gets it” and begins to generalize. It is tempting to treat this as a curiosity, but the dynamics are precisely the kind of sharp regime shift that matters for alignment: what seems like rote pattern matching can abruptly become rule-following behavior under small changes in optimization.

Research question: why does a transformer trained on algorithmic data jump from memorization to generalization, and how does weight decay create a phase transition in that shift?

Abstract

This memo focuses on a narrow mechanistic question: the grokking transition in transformer classifiers trained on algorithmic tasks, and the role of weight decay in triggering that transition. In the original grokking study, small transformers trained on modular arithmetic or parity tasks exhibit a long period of near-perfect training accuracy with poor test accuracy, followed by a sudden improvement in test performance without any architectural change. The core mechanism is an optimization trade-off: memorization solutions are easy to reach but have higher weight norms, while algorithmic solutions are harder to reach but have lower norm. Weight decay slowly shifts the optimizer’s preference from the high-norm memorization basin to the low-norm generalization basin, producing the delayed phase change. I summarize the setup, the mechanism, and what this does (and does not) imply for modern LLM behavior.

Related Work

Grokking sits at the intersection of generalization theory and mechanistic interpretability. The original study positions grokking as an extreme version of overfitting dynamics on algorithmic tasks. Classic generalization results show that models can fit random labels while still generalizing in practice, suggesting optimization biases are crucial. Lottery ticket-style results and pruning work also highlight that lower-norm or sparser solutions can behave differently even when the training loss is identical. These threads motivate treating the grokking transition as a structured bias in the optimizer rather than a mysterious emergent property.

Method/Mechanism

The canonical grokking setup trains a small transformer on a simple algorithmic task, such as modular addition. Inputs are tokenized pairs (e.g., “a b”), and the model must predict “a + b mod p.” The training set is small (a fraction of all possible pairs), so the model can memorize the observed examples. Training uses standard cross-entropy loss, and weight decay (L2 regularization) is applied to all parameters.

Two distinct solution families exist. The first is a memorization solution: the model learns a lookup-table-like mapping from seen pairs to outputs. This yields near-perfect training accuracy but weak generalization because the model has not learned the underlying group structure of modular addition. The second is an algorithmic solution: the model learns a latent representation that effectively performs addition in the group, which generalizes to unseen pairs.

The key mechanism is the optimizer’s implicit bias under weight decay. Memorization solutions typically require higher weight norms or more complex parameter configurations. An algorithmic solution can be represented with lower norm, but it is harder to discover in the short term. Weight decay does not immediately “force” the algorithmic solution; instead it exerts a slow, persistent pressure toward low-norm configurations. Over many epochs, this pressure erodes the memorization basin until the model transitions to the algorithmic basin. The result is a phase-change-like transition: training accuracy remains high throughout, while test accuracy abruptly rises once the optimizer crosses the boundary between solution families.

Key Findings

Grokking is a two-basin phenomenon: memorization and generalization can both fit the training data, but they differ in norm, simplicity, and optimization accessibility.
Weight decay acts as a slow control knob that favors the lower-norm basin; the apparent “delayed” generalization is the optimizer drifting toward a simpler representation rather than a sudden discovery of new data.
The transition is sharp because the loss landscape contains a boundary between basins; once weight decay makes the memorization basin unstable, performance jumps quickly.
Grokking shows that “generalization” can arrive long after the model has achieved perfect training accuracy, which challenges the assumption that early validation performance is a reliable indicator of final capabilities.

Concrete Examples

Example 1: Modular addition. Train a 2-layer transformer on pairs (a, b) with labels a + b mod 97, using only 40% of all possible pairs. The model quickly achieves near-100% training accuracy while test accuracy stays near chance. After many additional epochs with weight decay, test accuracy suddenly rises to near-perfect. The observed jump is consistent with a shift from memorizing seen pairs to internalizing the group structure of addition.

Example 2: Parity over bitstrings. Use tokenized inputs representing 10-bit binary strings and predict parity. A memorization solution can store seen strings, while an algorithmic solution learns the XOR-like rule. With weight decay and sufficient training time, the model eventually generalizes; without weight decay, the model can remain stuck in the memorization regime indefinitely.

Limitations

Grokking results are primarily demonstrated on small algorithmic datasets with exhaustive input spaces and simple rules. It is unclear how directly the same two-basin dynamics apply to LLMs trained on natural language, where the data distribution is open-ended and multi-modal. The role of explicit weight decay in modern LLM training is also mixed; many setups rely on AdamW or other optimizers with decoupled weight decay, and the effective regularization depends on schedule and scale. Finally, grokking is sensitive to dataset size, optimizer settings, and architecture depth, suggesting the phenomenon may be less universal than the original experiments imply.

Future Directions

The most useful next step is to identify measurable grokking analogs in larger models: settings where a model can fit a subset of patterns but has to learn a latent rule to generalize. Probing for delayed generalization in synthetic “language-like” tasks could reveal whether the two-basin story persists at scale. Another direction is to link grokking dynamics to mechanistic features: do circuits corresponding to algorithmic rules appear suddenly at the transition, or do they build up slowly under the hood?

Open question: In modern LLM training, does the implicit regularization from optimizer choice and data scale create grokking-like phase transitions for specific skills, or does the diversity of natural language data smooth away the sharp boundary observed in toy algorithmic tasks?

Summary

Grokking is best understood as a phase transition between two families of solutions that both fit the training data. Weight decay biases optimization toward a lower-norm, more algorithmic representation, but it may take a long time for that bias to dominate. The resulting delayed generalization is a warning sign for alignment work: a model that looks like it is merely memorizing today may abruptly exhibit rule-like generalization later, without any architectural changes or new data.

References

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets (primary)
Understanding deep learning requires rethinking generalization (auxiliary)
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks (auxiliary)