← Back to blog

Grokking as a phase transition in transformer training

February 6, 2026 · 12 min read

Grokking is the strange training dynamic where a model first memorizes a small algorithmic dataset and then, after a long plateau, suddenly “gets it” and begins to generalize. It is tempting to treat this as a curiosity, but the dynamics are precisely the kind of sharp regime shift that matters for alignment: what seems like rote pattern matching can abruptly become rule-following behavior under small changes in optimization.

Research question: why does a transformer trained on algorithmic data jump from memorization to generalization, and how does weight decay create a phase transition in that shift?

Abstract

This memo focuses on a narrow mechanistic question: the grokking transition in transformer classifiers trained on algorithmic tasks, and the role of weight decay in triggering that transition. In the original grokking study, small transformers trained on modular arithmetic or parity tasks exhibit a long period of near-perfect training accuracy with poor test accuracy, followed by a sudden improvement in test performance without any architectural change. The core mechanism is an optimization trade-off: memorization solutions are easy to reach but have higher weight norms, while algorithmic solutions are harder to reach but have lower norm. Weight decay slowly shifts the optimizer’s preference from the high-norm memorization basin to the low-norm generalization basin, producing the delayed phase change. I summarize the setup, the mechanism, and what this does (and does not) imply for modern LLM behavior.

Related Work

Grokking sits at the intersection of generalization theory and mechanistic interpretability. The original study positions grokking as an extreme version of overfitting dynamics on algorithmic tasks. Classic generalization results show that models can fit random labels while still generalizing in practice, suggesting optimization biases are crucial. Lottery ticket-style results and pruning work also highlight that lower-norm or sparser solutions can behave differently even when the training loss is identical. These threads motivate treating the grokking transition as a structured bias in the optimizer rather than a mysterious emergent property.

Method/Mechanism

The canonical grokking setup trains a small transformer on a simple algorithmic task, such as modular addition. Inputs are tokenized pairs (e.g., “a b”), and the model must predict “a + b mod p.” The training set is small (a fraction of all possible pairs), so the model can memorize the observed examples. Training uses standard cross-entropy loss, and weight decay (L2 regularization) is applied to all parameters.

Two distinct solution families exist. The first is a memorization solution: the model learns a lookup-table-like mapping from seen pairs to outputs. This yields near-perfect training accuracy but weak generalization because the model has not learned the underlying group structure of modular addition. The second is an algorithmic solution: the model learns a latent representation that effectively performs addition in the group, which generalizes to unseen pairs.

The key mechanism is the optimizer’s implicit bias under weight decay. Memorization solutions typically require higher weight norms or more complex parameter configurations. An algorithmic solution can be represented with lower norm, but it is harder to discover in the short term. Weight decay does not immediately “force” the algorithmic solution; instead it exerts a slow, persistent pressure toward low-norm configurations. Over many epochs, this pressure erodes the memorization basin until the model transitions to the algorithmic basin. The result is a phase-change-like transition: training accuracy remains high throughout, while test accuracy abruptly rises once the optimizer crosses the boundary between solution families.

Key Findings

Concrete Examples

Example 1: Modular addition. Train a 2-layer transformer on pairs (a, b) with labels a + b mod 97, using only 40% of all possible pairs. The model quickly achieves near-100% training accuracy while test accuracy stays near chance. After many additional epochs with weight decay, test accuracy suddenly rises to near-perfect. The observed jump is consistent with a shift from memorizing seen pairs to internalizing the group structure of addition.

Example 2: Parity over bitstrings. Use tokenized inputs representing 10-bit binary strings and predict parity. A memorization solution can store seen strings, while an algorithmic solution learns the XOR-like rule. With weight decay and sufficient training time, the model eventually generalizes; without weight decay, the model can remain stuck in the memorization regime indefinitely.

Limitations

Grokking results are primarily demonstrated on small algorithmic datasets with exhaustive input spaces and simple rules. It is unclear how directly the same two-basin dynamics apply to LLMs trained on natural language, where the data distribution is open-ended and multi-modal. The role of explicit weight decay in modern LLM training is also mixed; many setups rely on AdamW or other optimizers with decoupled weight decay, and the effective regularization depends on schedule and scale. Finally, grokking is sensitive to dataset size, optimizer settings, and architecture depth, suggesting the phenomenon may be less universal than the original experiments imply.

Future Directions

The most useful next step is to identify measurable grokking analogs in larger models: settings where a model can fit a subset of patterns but has to learn a latent rule to generalize. Probing for delayed generalization in synthetic “language-like” tasks could reveal whether the two-basin story persists at scale. Another direction is to link grokking dynamics to mechanistic features: do circuits corresponding to algorithmic rules appear suddenly at the transition, or do they build up slowly under the hood?

Open question: In modern LLM training, does the implicit regularization from optimizer choice and data scale create grokking-like phase transitions for specific skills, or does the diversity of natural language data smooth away the sharp boundary observed in toy algorithmic tasks?

Summary

Grokking is best understood as a phase transition between two families of solutions that both fit the training data. Weight decay biases optimization toward a lower-norm, more algorithmic representation, but it may take a long time for that bias to dominate. The resulting delayed generalization is a warning sign for alignment work: a model that looks like it is merely memorizing today may abruptly exhibit rule-like generalization later, without any architectural changes or new data.