Why SwiGLU replaced standard Transformer MLPs

Abstract

A surprisingly durable design choice in modern language models is that many of them no longer use a plain ReLU or GELU feed-forward block. Instead, they use a gated variant such as SwiGLU. The narrow question in this memo is why that change keeps surviving contact with scale. My reading is that SwiGLU helps because it turns the Transformer MLP from a simple pointwise expansion-and-compression block into a lightweight conditional-computation module. One pathway proposes a feature to write into the residual stream, and a second pathway decides how strongly that feature should pass through. The smooth Swish gate then makes that routing decision without the brittle thresholding of ReLU. The result is not merely a parameter-count artifact. Even under compute-matched comparisons, gated MLPs seem to use width more selectively and make each FFN update more context dependent.

Related Work

The central source is Shazeer (2020), which tested several gated feed-forward variants inside a T5-style Transformer and found that GEGLU and SwiGLU beat both ReLU and GELU on pretraining perplexity. That result matters because the comparison was parameter matched: the gated blocks use three weight matrices instead of two, so Shazeer reduced the hidden width to keep the total budget fixed. The improvement therefore cannot be dismissed as simply "more parameters."

A useful precursor is Dauphin et al. (2017), which introduced GLU in gated convolutional language models and argued that multiplicative gates create a more direct gradient path than conventional nonlinearities. Modern LLM papers then turned the empirical win into a default architectural choice. PaLM explicitly uses SwiGLU in its dense Transformer blocks, and LLaMA keeps the same family of gated MLPs. That sequence is instructive: the original evidence came from a relatively controlled architecture study, and later large-scale model builders kept the change after scaling everything else.

Method/Mechanism

In a standard Transformer FFN, the residual stream is projected up, passed through an activation such as ReLU or GELU, and projected back down. SwiGLU changes that structure by splitting the expanded representation into two learned pathways. One projection creates a candidate feature vector. The other creates a gate. After a Swish nonlinearity is applied to the gate side, the two vectors are multiplied elementwise before the final projection returns the result to model width.

That simple product changes the role of the MLP. A plain GELU block says, roughly, "expand features, squash them independently, then compress." A SwiGLU block says, "expand a candidate feature, but only let it flow when another learned signal says this context should activate it." In other words, content and selection are partially separated. That is valuable in language modeling because many useful computations are conditional: a feature should fire only for certain syntactic, semantic, or positional contexts, not for every token that weakly matches it.

The Swish gate also matters. ReLU makes hard zero-or-positive decisions. GELU softens that, but still treats each neuron as a single transformed channel. SwiGLU adds a second channel whose magnitude can attenuate or amplify the candidate feature continuously. This creates a richer local interaction while preserving a relatively direct gradient route through the multiplicative structure.

Key Findings

Two case studies make the mechanism concrete:

Case study 1: T5-style controlled comparisons. In Shazeer's experiments, GEGLU and SwiGLU produced the best held-out log-perplexities among the tested FFN variants, despite matching parameter count and computation against the ReLU baseline. This is the cleanest evidence that the gain comes from representational structure rather than scale alone.
Case study 2: persistence in large open and closed LLM families. PaLM and LLaMA both retain gated MLP variants in model families built for very different scales and training regimes. That is strong external validation that the advantage survives inside frontier-scale training pipelines.

Four crisp insights follow:

SwiGLU makes FFNs more conditional. The extra gate lets the model decide when a candidate feature should matter, not just what the feature is.
Multiplicative routing can outperform extra width. When compute is held fixed, a narrower gated block can beat a wider ungated one.
Smooth gates appear friendlier to optimization than hard thresholds. Swish-based gating preserves nuanced gradient signals that ReLU-style truncation discards.
The MLP is doing more than tokenwise feature expansion. In practice it behaves more like a context-sensitive write operation into the residual stream.

One way to summarize the empirical picture is that SwiGLU increases the expressivity of each FFN layer without requiring the model to widen everything indiscriminately. For LLMs, where feed-forward blocks dominate parameter count and a large share of flops, that is a highly leveraged trade.

Limitations

The evidence is still more empirical than mechanistic. Shazeer's paper showed consistent improvements, but not a decomposition proving which part of the gain comes from gating, which part from the specific Swish nonlinearity, and which part from changed optimization dynamics over long training runs. The adoption of SwiGLU in PaLM- and LLaMA-style models is suggestive, but those reports bundle many other architectural decisions together.

There is also a measurement problem. Better perplexity does not automatically tell us whether gated MLPs produce sparser features, more stable gradients, cleaner specialization across neurons, or just a more favorable loss landscape. The gain may also depend on scale, normalization scheme, and data mix.

Future Directions

The next useful step is not another leaderboard comparison but a causal account. We should be able to ask whether SwiGLU works by increasing activation sparsity, by improving gradient conditioning, by enabling more selective feature composition, or by some combination of all three. Activation statistics, gate entropy, and intervention studies on MLP outputs seem more promising here than broader benchmark sweeps.

Open question: if SwiGLU helps by implementing context-dependent feature selection, can we directly measure that selectivity in trained LLMs and connect it to interpretable circuits or token categories rather than only to perplexity gains?

Summary

SwiGLU replaced standard Transformer MLP activations because it gives the feed-forward block a better way to decide when a feature should be written, not just how large an intermediate vector should be. Shazeer's compute-matched results show that gated variants can beat ReLU and GELU without simply buying more width. Dauphin et al. provide the older gating intuition, and large-model families such as PaLM and LLaMA show that the design remains attractive at scale. The deepest lesson is that Transformer MLPs are not merely generic nonlinear expansions. They are selective routing modules, and gated variants seem to fit that role better.

References

Primary: Shazeer. "GLU Variants Improve Transformer." arXiv, 2020. https://arxiv.org/abs/2002.05202
Auxiliary: Dauphin et al. "Language Modeling with Gated Convolutional Networks." ICML, 2017. https://arxiv.org/abs/1612.08083
Auxiliary: Chowdhery et al. "PaLM: Scaling Language Modeling with Pathways." arXiv, 2022. https://arxiv.org/abs/2204.02311
Auxiliary: Touvron et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv, 2023. https://arxiv.org/abs/2302.13971