Abstract
This memo asks a narrow research question: do transformer feed-forward (FFN/MLP) layers behave like key-value memories that store factual associations, and if so, what does that imply about how models retrieve and edit facts? The core evidence comes from Geva et al. (EMNLP 2021), who analyze FFN layers as key-value memories where keys align with input text patterns and values induce output-token distributions. A second line of work (ROME; NeurIPS 2022) shows that factual recall in GPT-style models can be localized to mid-layer MLPs and edited with a small rank-one weight update. A related neuron-level view (Knowledge Neurons; ACL 2022) finds that a sparse set of MLP neurons controls the expression of specific facts in cloze prompts. Together these results support a cohesive picture: FFNs are not just generic nonlinearities but structured memory accessors, and factual knowledge is encoded in a way that is localized, editable, and mechanistically interpretable.
Related Work
Geva et al. provide the central mechanistic claim: each FFN layer can be viewed as a key-value memory where keys correspond to textual patterns that activate the layer and values encode an induced distribution over next-token predictions. Their analysis shows that the patterns are human-interpretable, with lower layers reflecting shallow lexical patterns and upper layers reflecting more semantic cues. They also show the FFN output is a composition of multiple “memories,” later refined by residual connections across layers.
Meng et al. (ROME) analyze where factual associations live in GPT-style autoregressive transformers. They introduce causal tracing to identify activations that determine factual predictions and find that mid-layer MLP modules are decisive during the subject token. Building on that, they propose Rank-One Model Editing (ROME), which directly updates a single MLP weight matrix to insert a new factual association while preserving generalization.
Dai et al. (Knowledge Neurons) operate at a finer granularity: they propose a knowledge attribution method to identify neurons within FFN modules whose activations correlate with specific facts in cloze tasks. Suppressing or amplifying these neurons affects whether the model states the corresponding fact, and their case studies show targeted edits of factual knowledge without full fine-tuning.
Method/Mechanism
The key-value memory view starts from the FFN formulation: a hidden state is projected by a first linear layer (keys), passed through a nonlinearity, and then mapped by a second linear layer (values) back to the model dimension. Geva et al. interpret each hidden neuron as a “key” that fires on certain input patterns, while the corresponding output weight vector serves as a “value” that pushes the logits toward tokens likely to follow those patterns. Aggregating over all active keys yields a weighted sum of value vectors, which becomes the FFN output.
ROME uses a causal tracing procedure to localize the components responsible for recalling a fact. They corrupt subject representations, restore individual activations, and measure which interventions recover the correct factual prediction. This identifies a mid-layer MLP site and a specific token position (the last subject token) where the fact is effectively retrieved. ROME then computes a rank-one update that changes the MLP’s value mapping for that subject representation, inserting a new key-value pair while minimally affecting other behavior.
Knowledge Neurons mirrors the key-value story at the neuron level. A knowledge attribution score is computed for neurons based on their contribution to a cloze completion. The highest-scoring neurons form a sparse set that can be manipulated: amplifying them boosts the probability of the correct answer, while suppressing them reduces it. This offers a concrete operationalization of “memory slots” inside the FFN.
Key Findings
Two case studies illustrate the key-value memory interpretation in practice:
- Case study 1: Trigger patterns and value predictions. Geva et al. surface trigger examples that highly activate individual FFN keys. These triggers align with recognizable textual patterns (e.g., lexical or syntactic cues), and the associated values bias the model toward tokens that commonly follow those patterns. The combined FFN output acts like a superposition of multiple retrieved memory entries.
- Case study 2: Editing a single fact in GPT. ROME demonstrates that a single, localized rank-one update to a mid-layer MLP can rewrite a specific factual association, improving accuracy on a counterfactual dataset while preserving generalization across paraphrases.
From these cases, several crisp insights follow:
- FFN layers behave like memory lookups, not just generic nonlinear transforms.
- Lower-layer keys skew toward surface patterns; higher-layer keys skew toward semantic ones.
- Factual recall is localized in mid-layer MLP modules when the subject is processed.
- Rank-one edits can insert a new factual association while preserving nearby behavior.
- Neuron-level attribution provides a practical handle for auditing or editing facts.
Limitations
The key-value memory framing is supported by strong qualitative analyses but is not a complete mechanistic account of all FFN behavior. Many FFN activations are distributed and context-dependent, so “memory slot” language can be an approximation. ROME and Knowledge Neurons provide evidence of locality for factual associations, yet their manipulations target individual facts and do not scale cleanly to large-scale knowledge updates. Additionally, the causal tracing and attribution methods are sensitive to experimental setup (model choice, prompt style, and dataset), so the localization may vary across architectures and training corpora.
Future Directions
A compelling next step is to unify the layer-level key-value interpretation (Geva) with neuron-level attribution (Knowledge Neurons) into a single diagnostic that predicts which FFN subspaces encode a given fact, along with how robust that encoding is across prompts and paraphrases. Another direction is to stress-test model editing at scale: do repeated rank-one edits remain stable, or do they gradually erode other memories encoded in the same layer?
Open question: Can we derive a principled, model-agnostic criterion for when a factual association is “localized enough” to be safely edited in a single MLP layer without creating hidden interference elsewhere in the network?
Summary
The emerging evidence points to a concrete mechanistic story: transformer FFN layers act like key-value memories, and factual recall in GPT-style models is concentrated in mid-layer MLPs. Geva et al. provide the structural view (keys, values, and compositional memory outputs), while ROME and Knowledge Neurons show that specific facts can be localized and edited with targeted interventions. Together these results motivate a focused research program on memory localization, editability, and the stability of knowledge stored in FFNs.
References
- Primary: M. Geva, R. Schuster, J. Berant, O. Levy, "Transformer Feed-Forward Layers Are Key-Value Memories." EMNLP 2021. https://aclanthology.org/2021.emnlp-main.446/
- Aux: K. Meng, D. Bau, A. Andonian, Y. Belinkov, "Locating and Editing Factual Associations in GPT." NeurIPS 2022. https://proceedings.neurips.cc/paper_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html
- Aux: D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, F. Wei, "Knowledge Neurons in Pretrained Transformers." ACL 2022. https://aclanthology.org/2022.acl-long.581/