How does ChatGPT read your text: RAG and Context Window

<aside> <img src="/icons/paste_gray.svg" alt="/icons/paste_gray.svg" width="40px" /> About: This is an early draft from a bigger piece. Comments welcome.

</aside>

Table of Contents

How ChatGPT does not read: An unhelpful mental model

When people imagine a Large Language Model like the one powering ChatGPT reading through a text to answer their question, they have an image in mind of a little bot going through the text word by word, taking notes and then writing a little report.

This is not what it is doing at all. The LLM does not do anything resembling what a human does when they read a text. What it is doing is weird and hard to wrap your head around. And what's worse, knowing how it reads text does not necessarily help you predict exactly what it will and will not be able to do.

Two settings

And what's even worser, is that in different settings, it's using very different approaches to "reading" your text. And they're both weird in different ways. The two settings are:

Context window: Text that the LLM can see at once as a prompt (this could be quite long but rarely longer than a short book - different models have different limits but most of the time, think about the length of a newspaper article or at best an academic paper). This will usually happen inside a chat.
RAG (Retrieval Augmented Generation): Texts or a collection of texts that are too long to fit into the context window. For example, the files you might upload into a custom GPT, Poe bot or a Claude project. The LLM does not do this on its own but it has to be marshalled to do it by the programmers of the interface.

How does an LLM answer questions about your prompt? Attention is all it needs

Metaphor

So, this is weird. The way an LLM reads the text you give it is not the way you would do it. The closest comparison to what a person may have an experienced is writing down a phone number. Unless, they really focus, most people cannot write down a whole phone number at once after a single look. So they, glance at the number, try to recognise a pattern and write down as many digits as possible, glance again and write down some more. They try to combine attention and pattern recognition in a superfast process.

That's exactly what an LLM is doing. It glances at the whole text as you would glance at a phone number, tries to quickly recognise a bunch of patterns, writes down one tiny bit in response, glances again, writes down another bit, glances again (including the things it has already written down), and so on. Usually, these bits are parts of words or sequences of characters called tokens.

Reality

This metaphor is fairly general and leaves out many important details. The three concepts that are key to understanding here are tokens (chunks of words that the LLM sees and generates), embeddings (large vector representing the 'meaning' of each token that the LLM actually uses), attention (a process through which the LLM decides which tokens are more important for it to decide what to generate next).

All of those are tied together through many many layers of engineering and math glue. For example, the token vectors are multiplied by each other in one layer to determine which ones are more relevant. But that often generates weirdly large numbers, so every so often there's a normalisation layers, that just cuts off the peaks (kind of link normalisation in audio editing).

Or, when the tokens from the prompt are sent to the next layer, the system has no idea which order they are in. So the order has to be injected back in through a layer called positional embedding.

And finally, when the system has gone through about 30 layers of the process of vector multiplication and normalisation, the result is converted into a probability distribution over all 50-100 thousand tokens, the top k are chosen and the system essentially rolls a pair of dice to pick one at random.

And all of this does not generate the whole text but just one more token. When that token is added to what's been there. The system forgets everything that came before, takes a quick glance at the whole text (including the token it just added) and starts over.