← Back to blog

FFN layers as key-value memories for factual recall

February 12, 2026 · 14 min read

Abstract

This memo asks a narrow research question: do transformer feed-forward (FFN/MLP) layers behave like key-value memories that store factual associations, and if so, what does that imply about how models retrieve and edit facts? The core evidence comes from Geva et al. (EMNLP 2021), who analyze FFN layers as key-value memories where keys align with input text patterns and values induce output-token distributions. A second line of work (ROME; NeurIPS 2022) shows that factual recall in GPT-style models can be localized to mid-layer MLPs and edited with a small rank-one weight update. A related neuron-level view (Knowledge Neurons; ACL 2022) finds that a sparse set of MLP neurons controls the expression of specific facts in cloze prompts. Together these results support a cohesive picture: FFNs are not just generic nonlinearities but structured memory accessors, and factual knowledge is encoded in a way that is localized, editable, and mechanistically interpretable.

Related Work

Geva et al. provide the central mechanistic claim: each FFN layer can be viewed as a key-value memory where keys correspond to textual patterns that activate the layer and values encode an induced distribution over next-token predictions. Their analysis shows that the patterns are human-interpretable, with lower layers reflecting shallow lexical patterns and upper layers reflecting more semantic cues. They also show the FFN output is a composition of multiple “memories,” later refined by residual connections across layers.

Meng et al. (ROME) analyze where factual associations live in GPT-style autoregressive transformers. They introduce causal tracing to identify activations that determine factual predictions and find that mid-layer MLP modules are decisive during the subject token. Building on that, they propose Rank-One Model Editing (ROME), which directly updates a single MLP weight matrix to insert a new factual association while preserving generalization.

Dai et al. (Knowledge Neurons) operate at a finer granularity: they propose a knowledge attribution method to identify neurons within FFN modules whose activations correlate with specific facts in cloze tasks. Suppressing or amplifying these neurons affects whether the model states the corresponding fact, and their case studies show targeted edits of factual knowledge without full fine-tuning.

Method/Mechanism

The key-value memory view starts from the FFN formulation: a hidden state is projected by a first linear layer (keys), passed through a nonlinearity, and then mapped by a second linear layer (values) back to the model dimension. Geva et al. interpret each hidden neuron as a “key” that fires on certain input patterns, while the corresponding output weight vector serves as a “value” that pushes the logits toward tokens likely to follow those patterns. Aggregating over all active keys yields a weighted sum of value vectors, which becomes the FFN output.

ROME uses a causal tracing procedure to localize the components responsible for recalling a fact. They corrupt subject representations, restore individual activations, and measure which interventions recover the correct factual prediction. This identifies a mid-layer MLP site and a specific token position (the last subject token) where the fact is effectively retrieved. ROME then computes a rank-one update that changes the MLP’s value mapping for that subject representation, inserting a new key-value pair while minimally affecting other behavior.

Knowledge Neurons mirrors the key-value story at the neuron level. A knowledge attribution score is computed for neurons based on their contribution to a cloze completion. The highest-scoring neurons form a sparse set that can be manipulated: amplifying them boosts the probability of the correct answer, while suppressing them reduces it. This offers a concrete operationalization of “memory slots” inside the FFN.

Key Findings

Two case studies illustrate the key-value memory interpretation in practice:

From these cases, several crisp insights follow:

Limitations

The key-value memory framing is supported by strong qualitative analyses but is not a complete mechanistic account of all FFN behavior. Many FFN activations are distributed and context-dependent, so “memory slot” language can be an approximation. ROME and Knowledge Neurons provide evidence of locality for factual associations, yet their manipulations target individual facts and do not scale cleanly to large-scale knowledge updates. Additionally, the causal tracing and attribution methods are sensitive to experimental setup (model choice, prompt style, and dataset), so the localization may vary across architectures and training corpora.

Future Directions

A compelling next step is to unify the layer-level key-value interpretation (Geva) with neuron-level attribution (Knowledge Neurons) into a single diagnostic that predicts which FFN subspaces encode a given fact, along with how robust that encoding is across prompts and paraphrases. Another direction is to stress-test model editing at scale: do repeated rank-one edits remain stable, or do they gradually erode other memories encoded in the same layer?

Open question: Can we derive a principled, model-agnostic criterion for when a factual association is “localized enough” to be safely edited in a single MLP layer without creating hidden interference elsewhere in the network?

Summary

The emerging evidence points to a concrete mechanistic story: transformer FFN layers act like key-value memories, and factual recall in GPT-style models is concentrated in mid-layer MLPs. Geva et al. provide the structural view (keys, values, and compositional memory outputs), while ROME and Knowledge Neurons show that specific facts can be localized and edited with targeted interventions. Together these results motivate a focused research program on memory localization, editability, and the stability of knowledge stored in FFNs.

References