← Back to blog

ALiBi vs RoPE: positional bias and length extrapolation

February 13, 2026 · 11 min read

Abstract

This memo asks a narrow, practical question: why does ALiBi (Attention with Linear Biases) extrapolate to longer sequence lengths more reliably than Rotary Position Embedding (RoPE), and what does that imply about the inductive bias baked into attention scores? Both methods were introduced in 2021 and aim to encode position without the fixed-length limitations of learned absolute embeddings. RoPE rotates query and key vectors to encode absolute position while inducing relative-position structure inside the attention dot product. ALiBi adds a distance-proportional penalty directly to attention scores, encouraging recency while avoiding explicit positional vectors. The ALiBi paper demonstrates “train short, test long” extrapolation (1024 → 2048) with matched perplexity and lower compute costs, while RoPE demonstrates strong performance on long-text classification benchmarks and theoretical properties around relative distance. This review compares the mechanisms, highlights where the extrapolation behavior plausibly originates, and isolates the open question of how to combine the best of both biases.

Related Work

Positional information is essential in the Transformer architecture introduced by Vaswani et al. (2017), which adds position encodings to token embeddings so attention can distinguish order. The original sinusoidal encodings are fixed and length-agnostic but still require explicit positional vectors to be added at the input. RoPE (Su et al., 2021) and ALiBi (Press et al., 2021) both attempt to retain the flexibility of sinusoidal encodings while improving length generalization and simplifying the incorporation of position into the attention computation itself.

RoPE proposes a rotation-based encoding that inserts relative-position structure directly into the attention dot product. The paper highlights several desirable properties: it can be extended to longer sequence lengths, it yields decaying inter-token dependency with increasing relative distance, and it can be adapted to linear attention variants. ALiBi takes a different stance: instead of injecting positional vectors, it biases the attention scores with a linear distance penalty, producing a built-in recency preference and enabling length extrapolation without changing the embedding layer.

Method/Mechanism

RoPE encodes position by rotating query and key vectors with a position-dependent rotation matrix. The key feature is that the dot product of rotated queries and keys depends on their relative positions, effectively integrating relative distance into attention while maintaining a clean, parameter-free functional form. The RoFormer paper emphasizes that this rotation yields both absolute positional encoding and explicit relative-position dependency in the self-attention formulation, making it compatible with long inputs and theoretical analysis.

ALiBi skips explicit position vectors entirely. Instead, it adds a penalty to each attention score proportional to the distance between tokens. The effect is a monotonic recency bias that is present at every layer and head without increasing embedding dimensionality. The paper shows that this simple bias supports extrapolation to longer sequences: models trained on length 1024 can be evaluated on length 2048 with perplexity matching a sinusoidal position-embedding baseline trained at 2048, while using less memory and training faster.

Key Findings

Two concrete case studies from the primary papers help anchor the comparison:

From these cases, several crisp insights follow:

Limitations

These methods are not directly comparable across all tasks. ALiBi emphasizes recency through a linear distance penalty, which is well-suited for language modeling but may underweight very long-range dependencies in tasks that require precise distant retrieval. RoPE, in contrast, encodes relative positions more explicitly but does not include an explicit extrapolation benchmark in its primary results, making it harder to compare on “train short, test long” settings. As a result, the relative advantages may depend heavily on task type (classification vs. generation) and the training objective.

Future Directions

A useful next step is a controlled study that evaluates RoPE and ALiBi under identical training conditions, isolating the effects of the positional bias rather than architectural or data choices. Another direction is to build hybrid schemes that preserve RoPE’s relative-geometry benefits while incorporating the monotonic distance bias that makes ALiBi extrapolate. Such hybrids could be tested on length extrapolation and on tasks requiring long-distance retrieval to see if they can achieve both extrapolation and faithful long-range reasoning.

Open question: Can we design a positional encoding that preserves RoPE’s relative structure while adding ALiBi-style monotonic distance bias, and does this hybrid improve both length extrapolation and long-range retrieval accuracy in the same model?

Summary

ALiBi and RoPE represent two distinct inductive biases for positional information. RoPE injects relative-position structure into the attention dot product through rotation, while ALiBi adds a distance-proportional score penalty that yields strong length extrapolation. The ALiBi results show a surprisingly large benefit from a simple linear bias, while RoPE demonstrates that relative structure can be embedded directly into attention geometry. The main takeaway is that length extrapolation is not just about having “more position information,” but about how that information shapes attention scores.

References