A deep review of how layer norm placement controls gradient flow and residual amplification in deep Transformers.
A deep review of how layer norm placement controls gradient flow and residual amplification in deep Transformers.
A deep review of when Transformers are Turing complete, why hard attention matters, and what formal-language limits imply.
A deep review of how the softmax bottleneck limits expressivity in language models and how mixture-of-softmaxes raises the effective rank.
A focused review of why ALiBi extrapolates to longer contexts more reliably than RoPE, and what that implies about positional inductive bias in attention.
A deep review of evidence that transformer feed-forward layers behave like key-value memories and localize factual recall in mid-layer MLPs.
A focused review of the modern Hopfield-network view of attention, with emphasis on storage capacity and retrieval behavior.
A deep dive into gradient-descent-like mechanisms in in-context linear regression.
A deep dive into how weight decay triggers delayed generalization in transformer grokking.
A deep dive into how sparsity penalties trigger phase transitions between superposition and monosemantic features.
A focused memo on how to evaluate mechanistic explanations in GPT-2 small.
A quick pilot for turning weekly reading notes into a public-facing update.