Why muP makes hyperparameters transfer across scale

Abstract

One of the least glamorous bottlenecks in large language model research is hyperparameter search. Learning rates, initialization scales, and scheduler choices that look harmless in a 50M parameter transformer can fail badly at 5B. The narrow question in this memo is why maximal update parameterization, or muP, often makes those choices transfer across width. The core claim from Yang et al. is that hyperparameter instability is not only a nuisance of scale; it is a consequence of using a parameterization where different parts of the network learn at incompatible rates as width grows. muP changes the scaling rules so that feature updates, logits, and residual pathways remain balanced in the wide-model limit. That is why tuning on a small proxy can predict what will work on a larger target. The deeper lesson is that optimizer settings are only meaningful relative to a scaling regime. muP matters because it reframes hyperparameter transfer as a property of training dynamics, not as folklore about picking a conservative learning rate.

Related Work

The primary source is Yang et al. (2022), "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer." That paper introduces the practical muTransfer recipe: parameterize the target network in muP, tune on a smaller model, and reuse those hyperparameters on the larger one without another expensive sweep. The empirical headline is strong because it was tested on exactly the settings people care about: transformers and large-scale language-model-style pretraining.

The conceptual precursor is Yang and Hu (2021), "Feature Learning in Infinite-Width Neural Networks." That work argues that common parameterizations either collapse toward lazy kernel behavior or produce badly imbalanced learning as width grows. The point is not merely asymptotic elegance. If the infinite-width limit suppresses useful feature updates, then width scaling will scramble the meaning of a fixed learning rate. muP emerges from that broader attempt to preserve feature learning rather than freezing the network into a near-linear regime.

A useful later extension is Blake et al. (2024) on u-muP, which combines muP with unit scaling to make implementation more robust, especially in low precision. The broader message is that the original idea was right but operationally fiddly: real training stacks still need careful engineering around initialization, width definitions, and numerical ranges.

Method/Mechanism

Why do hyperparameters stop transferring when width changes? In standard parameterization, different tensors effectively see different update magnitudes as the network gets wider. Some layers become too lazy, barely changing their learned features; others become too reactive, so the same nominal learning rate causes outsized functional movement. The optimizer step is the same in parameter space, but not in function space.

muP addresses this by choosing width-dependent scaling rules for initialization and learning rates so that updates remain order-one in the parts of the network that matter. In loose terms, it tries to keep the network in a regime where increasing width adds representational capacity without changing the basic "meaning" of an optimizer step. That is why a sweep on a small proxy model can stay predictive: the proxy and the target are no longer different optimization problems in disguise.

There is also a useful theoretical contrast with NTK-style scaling. NTK parameterizations are designed to stabilize training in the infinite-width limit by making the model behave almost linearly around initialization. That can make the dynamics analytically clean, but it downplays feature learning. muP instead aims for maximal feature learning while preserving stable large-width behavior. In other words, it tries to keep the network plastic rather than merely well behaved.

For transformers, that matters because widening attention and MLP channels should ideally give the model more room to learn richer features, not force a retuning of every optimizer knob. muP says that if the scaling is chosen correctly, width should mostly change capacity, while good hyperparameters should remain close to invariant. That is the mechanism behind zero-shot transfer across width.

Key Findings

Two case studies make the argument concrete:

Case study 1: BERT-large from a much smaller proxy. Yang et al. report that hyperparameters transferred from a 13M parameter proxy model can beat published BERT-large results. The conceptual point is sharper: if a small model can predict a 350M model's preferred learning-rate region, then the parameterization has preserved the geometry of optimization surprisingly well.
Case study 2: GPT-style scaling without a fresh sweep. The same paper reports transfer from a 40M parameter proxy to a 6.7B GPT-3-style model, with tuning cost far below a full large-scale search. This is the kind of result that makes muP interesting for LLM training specifically, because the gap between exploratory runs and full training runs is exactly where research budgets usually get burned.

Four crisp insights follow:

Hyperparameter fragility is often a parameterization problem. Width changes break transfer when they silently change which features can move.
The useful invariant is functional update scale, not raw parameter step size. Equal parameter deltas can correspond to very different behavioral changes as width grows.
muP preserves feature learning where NTK-style scaling tends to linearize it away. That makes it more relevant for pretrained transformers than purely lazy limits.
Proxy tuning works only when the proxy and target inhabit the same optimization regime. muP is valuable because it makes that statement closer to true rather than treating transfer as luck.

One alignment-adjacent implication is easy to miss. Safety interventions often depend on comparing many training runs under small recipe changes. If width scaling changes the interpretation of the optimizer, then observed safety effects may partly reflect accidental optimization-regime shifts rather than the intervention itself. muP offers a cleaner way to compare scaling experiments because it reduces one source of confounding.

Limitations

muP is not magic. First, its cleanest promise is widthwise transfer, not universal transfer across depth, data mixtures, architectures, optimizers, or regularization choices. If the larger model differs in more ways than width alone, hyperparameter portability can still fail. Second, the theory is more elegant than the implementation. Real training stacks need explicit base shapes, consistent width definitions, and careful handling of tied embeddings and residual branches. That friction partly explains why muP is respected but not yet universal.

There is also a scope issue. The central evidence shows that many optimal hyperparameters become more stable, not that all of them do. Scheduler length, batch size, regularization, and data curriculum can still interact with scale in ways muP does not erase. Finally, some of the broader tensor-programs intuition remains hard for practitioners to audit. When a method depends on subtle scaling identities, debugging mis-specification can become difficult enough that teams retreat to brute-force sweeps anyway.

Future Directions

The most important next step is broadening the invariance story beyond width. Large language model practice now changes depth, sequence length, optimizer families, precision, and MoE structure all at once. A parameterization that handles width but not those other axes still leaves a lot of tuning on the table. Recent work on Depth-muP and optimizer-specific extensions suggests progress, but it is not yet a single unified recipe for modern transformer stacks.

Another direction is interpretability of optimization itself. muP says that good scaling preserves comparable feature updates across model sizes. That invites a more direct empirical question: can we measure that invariance in activation space or residual-stream geometry, instead of inferring it from loss curves and validation sweeps? If yes, hyperparameter transfer could become easier to diagnose and perhaps partially automatable.

Open question: can we derive a scaling recipe that preserves hyperparameter transfer simultaneously across width, depth, optimizer choice, and sparse architectures without reintroducing a lazy-learning regime?

Summary

muP works because it treats width scaling as an optimization-dynamics problem rather than a purely architectural one. Yang et al. showed that, with the right scaling rules, a small proxy transformer can predict learning-rate choices for a much larger target. The mechanism is that feature updates remain balanced instead of drifting toward either instability or laziness as width grows. The broader lesson is that hyperparameters are not portable by default; they become portable when the parameterization makes model size preserve the meaning of a training step.

References

Primary: Yang et al. "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer." NeurIPS 2021 / arXiv 2022. https://arxiv.org/abs/2203.03466
Auxiliary: Yang and Hu. "Feature Learning in Infinite-Width Neural Networks." ICML 2021. https://arxiv.org/abs/2011.14522
Auxiliary: Yang et al. "Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks." arXiv 2023. https://arxiv.org/abs/2310.02244
Auxiliary: Blake et al. "u-muP: The Unit-Scaled Maximal Update Parametrization." OPT 2024. https://arxiv.org/abs/2407.17465