Hard attention and Turing completeness in Transformers

Abstract

This memo asks a narrow theoretical question: under what architectural assumptions is a Transformer Turing complete, and how do those assumptions relate to the self-attention used in modern LLMs? The JMLR result by Pérez, Barceló, and Marinkovic shows that a Transformer with hard attention can simulate a Turing machine by treating attention as a content-addressable read/write interface over internal dense representations. The flip side is that standard soft attention with fixed depth does not inherit those guarantees. I review the constructive proof ingredients, the minimal building blocks identified by the Turing-completeness result, and the countervailing limitations shown in formal-language studies of self-attention. The goal is to clarify which pieces of the Transformer stack are doing the heavy theoretical lifting, and which gaps remain between existence proofs and the behavior of practical LLMs.

Related Work

Theoretical work on the computational power of Transformers splits into two camps. The first camp establishes expressivity results, typically via explicit constructions. The JMLR paper “Attention is Turing-Complete” (2021) is the canonical example, showing that a Transformer with hard attention and positional encodings can simulate a Turing machine. The second camp studies limitations of self-attention on formal languages. Hahn (TACL 2020) shows that fixed-depth self-attention cannot model certain regular and hierarchical languages unless the depth or number of heads grows with sequence length. Empirical work by Bhattamishra et al. (EMNLP 2020) complements this with controlled experiments on Dyck-1 and other counter languages, revealing when Transformers can generalize and which components seem responsible. Taken together, these papers show that Transformers can be both surprisingly powerful in principle and surprisingly brittle under realistic constraints.

Method/Mechanism

The Turing-completeness construction is not a claim about typical training; it is a claim about representational capacity. Pérez et al. show that a Transformer with hard attention can implement the state transition of a Turing machine by encoding the machine’s tape, head position, and finite control state into a sequence of vectors. Hard attention matters because it lets the model select a single memory slot deterministically. Once a slot is selected, residual updates can overwrite it, mimicking a write operation on the tape. Positional encodings or other markers provide a stable addressing scheme so the model can move its “read/write head” left or right.

The proof relies on three architectural ingredients: (1) an embedding that stores a discrete tape symbol in a continuous vector, (2) a way to point to a specific position in the sequence, and (3) a mechanism for overwriting that position’s representation. The authors show that a finite number of Transformer layers suffices to carry out each Turing machine step, so the depth grows with the number of steps, not with input length. This makes the result more like a programmable machine than a fixed sequence transducer: you can simulate arbitrarily long computations, but you still need arbitrarily many layers for long runtimes.

Key Findings

Two concrete case studies help ground the theory in something more tangible:

Case study 1: Simulating a Turing tape with hard attention. Pérez et al. construct a Transformer whose attention selects exactly one tape cell at a time, updates its symbol, and shifts the focus left or right based on the Turing machine’s control state. The sequence itself functions as the tape; the residual stream acts as the tape cell contents. The construction is explicit and mechanical, which makes it useful for reasoning about what Transformer components can encode.
Case study 2: Dyck-1 and counter languages. Bhattamishra et al. build Transformers that recognize Dyck-1 (balanced parentheses) and other counter languages. Their experiments show that Transformers can learn the intended counting mechanism when given suitable positional encodings and training distributions. This gives a practical example of how the model might implement a limited pushdown-like behavior with attention and MLPs.

From these analyses, a few crisp insights emerge:

Hard attention is a qualitative shift. The deterministic selection of a single value vector is what allows attention to behave like a read/write head rather than a smooth average.
Positional encodings are part of the memory interface. Without a stable addressing scheme, it is unclear how the model can repeatedly return to and update a specific tape cell.
Depth represents time, not just feature depth. The Turing-completeness proof unrolls computation across layers, implying that deeper stacks correspond to longer simulated runtimes.
Fixed-depth Transformers are formally limited. Hahn’s results show that if depth does not grow with input length, there are formal languages the model provably cannot recognize.
Empirical success is compatible with theoretical limits. The formal-language studies suggest that natural language may sit in a “learnable subset” even if it exceeds the strict capacity of fixed-depth self-attention on worst-case strings.

The overall picture is that expressivity results are existence proofs about what a Transformer could compute, while limitations results define the boundary conditions when depth and precision are fixed. The tension between these two viewpoints is exactly what makes the topic interesting.

Limitations

The Turing-completeness theorem assumes hard attention, which is not how attention is implemented in standard LLMs. Soft attention averages values, which makes it harder to emulate discrete read/write operations without loss. The proof also assumes unbounded precision in the vectors, whereas real-world training operates with finite precision and noisy optimization. Another limitation is that the construction is not necessarily efficient: simulating long computations requires deep networks or repeated application of the model. Finally, formal-language results like Hahn’s show that even with hard attention, bounded depth models cannot capture certain periodic or hierarchical languages unless capacity scales with sequence length.

Future Directions

One path forward is to connect expressivity proofs with learnability: can gradient descent discover the constructions that the Turing-completeness theorem guarantees? Another direction is to study whether soft attention plus auxiliary mechanisms (e.g., recurrence or external memory) can approximate hard attention in a way that preserves theoretical power. Formal-language experiments could also be refined to measure not just accuracy but the specific mechanisms learned by attention heads and MLPs.

Open question: What is the minimal set of architectural tweaks that let a practical, trainable Transformer approximate hard-attention read/write behavior without sacrificing the optimization stability of soft attention?

Summary

Transformers can be Turing complete, but the proof relies on hard attention, stable addressing, and the ability to spend depth as computational time. Fixed-depth, soft-attention Transformers are provably limited on certain formal languages, yet they can still learn meaningful counting and structured behavior in practice. The gap between theoretical capacity and what SGD actually finds remains the core mystery. Understanding when attention behaves like a crisp memory interface, and when it devolves into a fuzzy average, is likely the key to closing that gap.

References

Pérez, Barceló, & Marinkovic. “Attention is Turing-Complete.” Journal of Machine Learning Research, 2021. https://www.jmlr.org/papers/v22/20-302.html
Hahn. “Theoretical Limitations of Self-Attention in Neural Sequence Models.” TACL 2020. https://aclanthology.org/2020.tacl-1.11/
Bhattamishra, Ahuja, & Goyal. “On the Ability and Limitations of Transformers to Recognize Formal Languages.” EMNLP 2020. https://aclanthology.org/2020.emnlp-main.576/