LLMs, Attention and Working Memory

This is just some rough thinking I posted as a long Tweet:

https://x.com/techczech/status/1797157824604041671

Here’s the full text:

LLMs don't have working memory, all they have is attention.

This is an essential difference between LLM reasoning and human reasoning that explains many of their failings.

Human attention is very limited-not from distractions-but because it just feeds into working memory.

Working memory is also limited but it has some interesting properties (chunking, sketchpad, loops) that allows us to reason. Working memory is what allows system 2 (slow) thinking.

LLMs' attention is amazing but it feeds directly into tokens. LLMs can attend simultaneously to millions of tokens, discern which are the important ones and react with another token instantly. There is no intermediate working memory module that could, for instance, set a few tokens aside and wait for a few more to come in to decide which one to choose. Thus as @goodside says, they are really 'free-style rapping' their responses. Or in the Fast/Slow terms, they only have System 1 thinking.

We don't have any useful intuitions for what that's like because we don't have any introspective experience that is analogous to LLM attention. Closest I can suggest is the experience of seeing a face and immediately being able to react to its expression. But imagine being able to do it with a whole book. But even this does not let us make reliable inferences about what future LLMs will be able to do.

There are many attempts to add something like working memory to LLMs and they fall essentially into 2 categories:

Internal architecture: Things like state space models, etc. There's some promise here but nothing that seems like it would lead to an internal 'system 2' capability in LLMs.
Orchestration: Things like agents or prompt flows that do things like generate multiple options at once and let the system vote on them. There is more promise here but all attempts run into the 'bitter lesson' problem - trying to be too clever about designing a system based on how we think we think has initial promise but hits a ceiling very quickly.