Do language models plan ahead for future tokens?
On this page
Do transformers “think ahead” during inference at a given position? It isknown transformers prepare information in the hidden states of the forward passat time step t that is then used in future forward passes t+τ. We posittwo explanations for this phenomenon: pre-caching, in which off-diagonalgradient terms present during training result in the model computing featuresat t irrelevant to the present inference task but useful for the future, andbreadcrumbs, in which features most relevant to time step t are already thesame as those that would most benefit inference at time t+τ. We test thesehypotheses by training language models without propagating gradients to pasttimesteps, a scheme we formalize as myopic training. In a constructed syntheticdata setting, we find clear evidence for pre-caching. In the autoregressivelanguage modeling setting, our experiments are more suggestive of thebreadcrumbs hypothesis, though pre-caching increases with model scale.
Further reading
- Access Paper in arXiv.org