Can hidden states reveal what an LLM thinks is true?

Abstract

This memo looks at a narrow alignment question: when a language model says something false, do its internal activations still contain information about the truth? The central paper is Burns et al. (2023), which introduced Contrast-Consistent Search (CCS), an unsupervised method for extracting a truth-related direction from model activations without labels. The result matters because many safety failures look like a gap between knowing and saying. A model may imitate a common misconception, follow a misleading prompt, or optimize for sounding plausible, even if some internal representation still tracks reality better than the text it emits. The deeper question is not just whether probes can classify true versus false statements, but whether truth is represented in a stable enough geometry to support monitoring or intervention. My view is that the evidence supports a real but fragile phenomenon: sufficiently capable models often linearly encode truth signals, yet those signals are entangled with prompt framing, dataset structure, and the distinction between factual belief and next-token policy.

Related Work

The problem sits inside the broader Eliciting Latent Knowledge agenda: if model outputs are not fully trustworthy, perhaps internal states are a better place to ask what the model "knows." TruthfulQA (Lin et al., 2022) gave this agenda a concrete benchmark by showing that larger models can become less truthful because they imitate polished human falsehoods. Burns et al. then shifted focus from outputs to representations, asking whether unlabeled activations alone can recover correct answers on paired statements and negations.

Follow-up work sharpened the picture. Marks and Tegmark (2023) argued that truth and falsehood form an emergent linear structure in sufficiently large models and showed transfer across curated true/false datasets. Li et al. (2023) moved from reading to acting: if a truth direction exists, one can shift activations along it at inference time and measurably increase truthfulness. Together these papers turn a philosophical claim into an empirical one: hidden states may separate "belief-like" content from the final generation policy, at least on constrained factual tasks.

Method/Mechanism

CCS starts with paired inputs that should have opposite truth values, such as a statement and its negation or a yes/no question under two answer templates. For each item, one extracts hidden activations from a chosen layer and learns a simple linear probe. The key constraint is not supervision from labels but consistency: exactly one member of the pair should be true, and the probe should assign complementary probabilities to the two members. Across many examples, the optimization searches for a direction in activation space that makes these consistency relations hold.

Mechanistically, this is interesting because it does not require the model to answer correctly in text. If a prompt says "Answer like a conspiracy theorist," the output channel may be pushed toward a false completion. CCS asks whether earlier residual-stream states still preserve a separable truth signal before decoding and instruction following fully dominate. In that sense, CCS is not a behavioral method but a representation-geometry method.

The follow-up "geometry of truth" line makes a stronger claim. If truth is genuinely linearly encoded, then a crude probe such as a difference in class means should generalize across datasets, and causal intervention along that direction should change outputs. That is exactly why intervention papers matter here: they convert a correlational probe result into limited causal evidence. A direction that merely reads out spurious formatting cues will classify well in distribution but should not reliably alter the truthfulness of generations when injected back into the forward pass.

Key Findings

Two case studies make the main claim more concrete:

Case study 1: misleading prompts versus hidden-state truth. Burns et al. report that CCS can retain high accuracy even when prompts push models toward incorrect answers. This is the clearest example of the "knowing vs saying" split: the output degrades, but an internal linear signal still points toward the correct truth value.
Case study 2: intervention on TruthfulQA. Li et al. show that steering activations along truth-related directions can substantially improve TruthfulQA performance; their reported Alpaca result moves truthfulness from 32.5% to 65.1%. That does not prove the whole concept of truth is localized, but it does show that some extracted directions are operationally meaningful.

Four crisp insights stand out from this literature:

Truth seems easier to encode than to express. Output generation reflects imitation, instruction following, and stylistic priors, while intermediate states can preserve a cleaner factual signal.
Linearity is the surprising part. The important result is not merely that truth is present somewhere, but that simple linear probes often recover it in capable models.
Probe success is stronger when symmetry is built into the task. Paired statements, negations, and balanced true/false datasets make the geometry visible; open-ended generations are much messier.
Causal intervention is a stricter test than decoding accuracy. If moving activations along a discovered direction changes answers, the direction is more than a passive diagnostic.

A fifth observation is more cautionary. Transfer improves with model scale and dataset cleanliness, but not all models appear to host equally stable truth directions. This suggests the phenomenon is emergent rather than guaranteed by the transformer architecture alone.

Limitations

First, these results live mostly on controlled true/false tasks. That is a much narrower target than real-world truthfulness in long-form, ambiguous, or multi-hop settings. A model may linearly encode the truth value of "The Pacific is larger than the Atlantic" while still failing on questions that require synthesis, source uncertainty, or causal explanation.

Second, there is a conceptual gap between a probe decoding truth and a model actually "believing" something in any robust sense. The probe may exploit latent features correlated with dataset construction, lexical form, or negation patterns. This is why cross-dataset transfer and intervention matter so much, but even they do not fully resolve the semantics of belief in autoregressive models.

Third, truth directions can trade off against helpfulness or fluency. If an intervention makes the model more cautious, less speculative, or more likely to abstain, benchmark truthfulness can improve without a correspondingly rich improvement in epistemic competence. Alignment-wise, that distinction is important: reducing falsehoods is useful, but it is not the same as obtaining calibrated, well-scoped honesty.

Future Directions

The next step is to test whether truth directions survive harder distribution shifts: adversarial prompting, deceptive instructions, multilingual settings, and longer reasoning chains. Another useful direction is localization. Are truth-related signals carried broadly across the residual stream, or do they concentrate in a sparse subset of heads, MLP features, or layers? That matters for both safety monitoring and robustness.

A second frontier is calibration rather than binary classification. The field currently asks whether a statement is true or false, but deployed systems need graded uncertainty, not only a sign. Finally, there is a deep alignment connection: if internal truth signals can be read more faithfully than outputs, they may become ingredients for scalable oversight, lie detection, or corrigibility checks. But that only works if the signals remain stable when the model is pushed into strategically deceptive regimes.

Open question: Is there a truth representation that remains linearly readable even when a capable model is explicitly optimizing to appear convincing while being false?

Summary

The evidence so far suggests a real separation between internal factual representation and outward generation behavior. CCS made the key move by showing that unlabeled activations can often recover truth better than raw model answers, and later work strengthened the claim with transfer and intervention results. Still, the phenomenon is narrower than "models have a universal truth neuron." It is strongest on structured tasks, benefits from paired symmetry, and weakens as we move toward open-ended reasoning and strategic behavior. For alignment, that is still important news: if models sometimes know better than they say, hidden-state methods may become one of the few ways to measure the gap directly.

References

Burns et al. "Discovering Latent Knowledge in Language Models Without Supervision." ICLR 2023. https://arxiv.org/abs/2212.03827
Lin et al. "TruthfulQA: Measuring How Models Mimic Human Falsehoods." ACL 2022. https://aclanthology.org/2022.acl-long.229/
Marks and Tegmark. "The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets." 2023. https://arxiv.org/abs/2310.06824
Li et al. "Inference-Time Intervention: Eliciting Truthful Answers from a Language Model." NeurIPS 2023 Workshop paper. https://arxiv.org/abs/2306.03341
Roger. "Eliciting Latent Knowledge from Quirky Language Models." ICLR 2024. https://openreview.net/forum?id=nGCMLATBit