Why RMSNorm works without mean-centering
A longer review of why RMSNorm preserves most of LayerNorm's optimization benefits, what stable activation scale buys, and where mean subtraction still matters.
Research notes
These memos stay intentionally compact. Each one isolates a mechanism, paper, or question so the work remains visible before it becomes polished.
Format
Recent focus
Latest
Everything else stays browseable below, but the freshest research question should be immediate.
A longer review of why RMSNorm preserves most of LayerNorm's optimization benefits, what stable activation scale buys, and where mean subtraction still matters.
A longer review of why sharing input and output embeddings improves language models, what shared lexical geometry buys, and where the assumption breaks.
Open memoA longer review of why draft-and-verify generation can accelerate LLM inference without changing the target model's sampling distribution.
Open memoA longer review of why compute-optimal LLM training often favors smaller models trained on more tokens, and what that reveals about undertraining.
Open memoA longer review of why likelihood-maximizing decoding degenerates in open-ended generation, why nucleus sampling works, and what later theory says about the tail.
Open memoA longer review of why written principles plus AI-generated critiques can improve harmlessness, and where that substitution for human labels breaks.
Open memoA longer review of why maximal update parameterization keeps learning-rate and scale-sensitive hyperparameters stable as transformers grow wider.
Open memoArchive
Ordered newest first so the cadence and evolving interests are easy to scan.
Why stable activation scale often matters more than zero-mean activations in deep transformer optimization.
Read memoWhy sharing input and output embeddings regularizes lexical geometry and often improves perplexity.
Read memoWhy draft-and-verify generation can speed up autoregressive sampling without changing the large model's distribution.
Read memoWhy compute-optimal language-model training often rewards smaller models trained longer on more data.
Read memoWhy likelihood maximization collapses into repetition in open-ended generation, and why truncating the unreliable tail works better.
Read memoWhy written principles plus AI critiques can substitute for much direct harmlessness labeling when the base model already understands the norms.
Read memoWhy width-aware parameter scaling keeps learning rates and related tuning choices stable across transformer scale.
Read memoWhy gated feed-forward blocks beat plain ReLU and GELU MLPs, even under compute-matched comparisons.
Read memoWhy downstream specialization often occupies a small weight-space subspace and can be captured with low-rank updates.
Read memoWhy a few early tokens absorb surplus attention mass and become necessary for stable streaming inference.
Read memoWhy sampling several chains of thought and aggregating answers approximates marginalization over noisy reasoning paths.
Read memoWhy LLMs often learn directional factual retrieval without learning stable reverse access to the same relation.
Read memoWhen latent truth directions survive misleading prompts, and where that evidence remains fragile.
Read memoWhy a specific copy circuit appears abruptly during training and tracks the onset of copy-based in-context learning.
Read memoWhy larger language models can become less truthful when scaling mostly improves imitation rather than evidence-sensitive abstention.
Read memoWhen DPO matches KL-regularized reward optimization, and where the approximation starts to leak.
Read memoHow layer norm placement shapes gradient flow and training stability in deep Transformers.
Read memoArchitectural ingredients, formal expressivity, and what they say about real-world attention.
Read memoWhy a low-rank output layer limits expressivity and how mixture-of-softmaxes expands it.
Read memoWhy ALiBi extrapolates to longer contexts more reliably than RoPE.
Read memoEvidence that mid-layer MLPs store and retrieve factual associations.
Read memoA compact review of the Hopfield-network view of attention, storage capacity, and retrieval.
Read memoEvidence that Transformers can implement gradient-descent-like procedures in context.
Read memoHow weight decay can trigger a delayed shift from memorization to algorithmic generalization.
Read memoHow sparsity penalties can flip a model between dense superposition and cleaner feature geometry.
Read memoFaithfulness, completeness, and minimality as standards for mechanistic explanations.
Read memoA lightweight public writing habit for turning weekly notes into usable research output.
Read memoTopic log
Useful as a sanity check against repetition and as a map of what the memo series is actually circling.