LLM Token Context Windows and the Attention Architecture
If you’ve ever had a long conversation with an LLM and noticed it “forgot” something you said earlier, you’ve hit the context window limit. It’s the model’s working memory — a fixed-size sliding window over the conversation. But the story of how that window works, why expanding it is so hard, and how different labs approach the problem is one of the most interesting technical challenges in modern AI.
Let’s go under the hood.
What Is a Context Window?
Every interaction with an LLM starts the same way: your text is broken into tokens — subword units that the model actually processes. The model can only “see” a fixed number of these tokens at once. That’s the context window.
If the window is 128K tokens, the model can attend to roughly 100,000 English words at a time. Everything beyond that simply doesn’t exist as far as the model is concerned. Earlier messages scroll out of view, and the model’s responses are generated from whatever remains.
But the window isn’t just a buffer. Every token in it interacts with every other token through the attention mechanism — and that’s where things get computationally expensive.
The Attention Mechanism: Why Every Token Talks to Every Other Token
Here’s the central idea: when an LLM reads a sentence, every token has to figure out which other tokens are relevant to its meaning. Take the sentence:
The trophy didn’t fit in the suitcase because it was too big.
What does “it” refer to — the trophy or the suitcase? To resolve that, the model lets the word “it” compare itself against every other word in the window and decide which ones matter most. That comparison process — running in parallel for every token, not just “it” — is attention.
A Tiny Search Engine, Once Per Token
The cleanest way to think about attention is as a small search engine running inside the model, once for every token. For each token, the model produces three things:
- A Query — what this token is looking for
- A Key — a “tag” describing what this token contains
- A Value — the actual information the token will contribute if someone attends to it
Then for any given token, the model takes its Query and matches it against the Keys of every token in the window. The closer the match, the more weight that token’s Value gets in the final result. It’s exactly like typing into a search engine and getting back a ranked list of documents — except it happens for every token, in parallel, at every layer of the model.
In code, the whole thing is three lines:
scores = Q @ K.T / sqrt(d_k) # how well does each query match each key?
weights = softmax(scores) # turn the scores into probabilities
output = weights @ V # weighted sum of values
The middle step — softmax — just turns the raw match scores into a set of probabilities that sum to 1. Each token has a fixed “attention budget” and spreads it across the other tokens based on how well they match.
The thing to notice is the shape of the intermediate result: n × n, where n is the number of tokens. That’s one entry for every pair of tokens in the window. We’ll come back to why that matters in a moment.
Multi-Head Attention: Many Searches in Parallel
A single round of “every token looks at every other token” turns out not to be enough. Language has too many things going on at once — grammar, meaning, position, references across paragraphs — and one attention computation has to compromise between all of them.
So instead of running a single search, models run many in parallel. Each parallel search is called an attention head.
You can think of a head as an independent copy of the entire Q/K/V machinery from the previous section: it has its own Query, Key, and Value projections, with its own learned weights. All the heads see the same input tokens, but because each one has different weights, each one ends up learning to look for something different. It’s like having a panel of specialists read the same paragraph at the same time — one watching for grammar, one for meaning, one for position — and then merging their notes at the end.
In practice, the kinds of patterns heads tend to specialize in include:
- Grammar — which verb goes with which subject
- Meaning — which words refer to similar concepts
- Position — what came right before or right after
- Long-range references — which paragraph is “this” pointing back to
A single layer of a modern frontier model typically has dozens of these heads, and a model has dozens of layers stacked on top of each other — so the total number of heads in the model can run into the thousands. After each layer’s heads have done their work, their outputs are concatenated and projected back into a single representation per token, which then feeds into the next layer. This parallelism is what gives attention its expressive power — and most of its cost.
The Quadratic Wall: Why Expanding Context Is Hard
Remember that n × n matrix? That’s where the cost comes from.
Think of it like a meeting. With 10 people in a room, there are 45 possible pairs of people who could talk. With 20 people, 190 — not double, but four times as many. With 100, almost 5,000. The cost of “everyone talks to everyone” doesn’t grow with the number of people, it grows with the number of pairs — which is roughly the number of people, squared. Computer scientists call this quadratic scaling (written as O(n²): cost grows as the square of n).
Attention has exactly this problem. Every token compares itself to every other token, so the cost scales with the square of the window size:
- Doubling the window → 4× the work and memory
- 4× the window → 16× the work and memory
- Going from a 32K to a 1M token window → roughly 1,000× the work
To put numbers on it, the matrix of attention scores alone — for a single attention head, at a 128K window — takes about 64 GB of memory at full precision. A real model has dozens of layers and dozens of heads stacked on top of that.
This is the fundamental tension: users want larger windows, but the math fights back. Every lab solves the same problem differently.
Pushing Past the Wall: Compression and Interpolation
There’s no way to make attention non-quadratic without changing the underlying algorithm — but there are clever ways to shrink the constant factors and to extend a model’s effective context beyond what it was trained on. Two ideas have done most of the heavy lifting in production models: latent attention for memory, and RoPE interpolation for position.
Multi-Head Latent Attention (MLA)
Each attention head normally keeps its own full set of Keys and Values for every token in the window. With a hundred-plus heads, that adds up fast. Multi-Head Latent Attention (introduced by DeepSeek in their V2 model) is a compression trick: instead of every head storing its own full K and V, the model stores a single compact “summary” per token, and each head unpacks that summary into its own K and V on the fly.
Think of it like a ZIP archive. Rather than keeping a hundred separate copies of the same file, you keep one compressed copy and unpack what you need, when you need it.
The compressed summary is much smaller than the full per-head Keys and Values. For a model with 128 heads, each producing 128-dimensional vectors, the full per-token storage works out to 128 × 128 × 2 = 32,768 numbers. Compressing to a 512-number summary cuts that by ~64× — paid for in a tiny bit of extra compute when each head unpacks its share.
This is what makes million-token context windows feasible at inference time. Without compression of some kind, the storage alone would run into terabytes.
YaRN: Teaching the Model About Position
Attention by itself doesn’t know anything about order. To attention, “the cat sat on the mat” and “mat the on sat cat the” are the same set of tokens. To fix that, the model adds a positional signal to each token — a kind of stamp that says “I’m token #42.”
The most popular way to do this today is RoPE (Rotary Position Embedding). It rotates each token’s vector by an angle that depends on its position in the sequence. Tokens that are close together end up rotated to similar angles; tokens far apart end up at very different angles, and the model learns to read the angle difference as a distance cue.
The catch: a model trained with RoPE on, say, 128K tokens has only ever seen rotation angles up to position 128,000. Ask it to handle position 500,000 and the rotations land in territory the model has never encountered — and quality falls off a cliff.
YaRN (Yet another RoPE extensioN) is a recipe for stretching RoPE to longer contexts without retraining from scratch. It does two things:
- Stretches the rotations carefully. Rather than uniformly squashing every angle to fit a longer window, YaRN preserves the fast-changing rotations (which encode short-range relationships, like “what’s the next word?”) and only stretches the slow-changing ones (which encode long-range position, like “which paragraph are we in?”).
- Re-sharpens attention. When the window gets very long, attention probabilities tend to spread out and become indecisive. YaRN nudges the math back toward a sharper, more confident distribution.
The result: a model trained on a 128K context can be extended to far longer contexts with light fine-tuning and no loss of short-range performance.
Scale vs. Fidelity
Even with MLA and YaRN, raw context size is only half the story. A million-token window is useless if the model can’t reliably retrieve information from the middle of it. This is what the Needle-in-a-Haystack benchmark measures: place a random fact (“the best coffee is in San Cristóbal”) at a random position in a long document, then ask the model to retrieve it.
Empirically, the bigger the window, the harder this gets. Each token has a fixed “attention budget” to spread across all the others, and the more tokens there are to spread it across, the less attention any single one gets — so retrieval accuracy tends to drop in the deeper regions of a long window, especially when the model wasn’t extensively trained at that length. Shorter windows, with more compute budgeted per token and longer training on long-context data, tend to hold up better across their full range.
So labs face the same tradeoff every time: optimize for the size of the window, or for the fidelity of recall within it. Compression and interpolation tricks buy more size cheaply, but they don’t automatically buy fidelity — that has to be earned in training. There’s no architectural free lunch.
The KV Cache: The Silent Bottleneck
Underneath every approach to long context lies one practical bottleneck: the KV cache.
When an LLM is generating a response, it produces tokens one at a time, left to right. Each new token has to attend to every token that came before it. Without any caching, that means recomputing the Keys and Values for the entire conversation, every single time a new token comes out — quadratic work at every step.
The fix is obvious: compute the Keys and Values once, store them, and reuse them. That stash is the KV cache.
The catch is that the cache is huge. Its size is the product of:
2 (one for K, one for V)
× number of layers
× number of heads
× dimensions per head
× tokens in the conversation
For a fairly typical big model — 96 layers, 128 heads, 128 dimensions per head, 200K tokens of context — that multiplies out to about 628 billion numbers. At fp16 (16-bit floats, the precision most production models run at), that’s roughly 1.3 TB of memory just to hold the conversation so far.
This is why compression schemes like MLA matter so much in practice. Without them, the KV cache dominates inference cost long before the model’s actual weights do. The hard ceiling on how long a conversation can be — for any given GPU — is set by the cache, not by the model itself.
The Next Frontier
A few lines of research are trying to escape the n² wall altogether:
Ring Attention
Instead of storing the giant attention matrix on a single GPU, Ring Attention spreads it across many GPUs arranged in a ring. Each GPU computes its slice, passes the result to the next GPU, and receives the next slice — like a relay race. The total amount of work is the same; the memory is distributed across the cluster. This is what makes context windows in the tens of millions of tokens possible — at the cost of a serious GPU bill.
Sliding Window + Global Attention
Most language doesn’t actually need every token to attend to every other token. Local context — the surrounding sentence or paragraph — usually carries most of the signal. Sliding window attention restricts each token to looking only at its nearest neighbors (say, the last few thousand tokens), which knocks the cost from quadratic down to roughly linear in the window size. A small number of “global” tokens (start markers, separators) are still allowed to attend to everything, so the model can pull in distant information when it needs to. Mistral and Gemma use variants of this.
State Space Models
What if attention isn’t the right tool at all? Mamba and other state space models (SSMs) replace it entirely. Instead of comparing every token to every other token, they sweep through the sequence once, maintaining a small running summary of everything seen so far — a bit like reading a book and updating your mental model as you go, rather than re-reading from page one every time. Cost grows linearly with n, not quadratically, which in principle enables “infinite” context. The downside: the running summary is highly compressed, and SSMs currently lag attention-based models when the task requires retrieving a precise fact from deep in a long context.
Hybrid Memory
The most promising direction may be hybrid architectures that pair a transformer core (good at fine-grained reasoning over a short window) with an external memory module (a searchable store of past tokens). Think of it as giving the model a notebook — separating the small, fast “working memory” of the context window from a large, slower “long-term memory” stored elsewhere.
Summary
The context window isn’t just a number on a spec sheet — it’s the resolution of a fundamental architectural tradeoff between scale and fidelity. Compression schemes like MLA and interpolation tricks like YaRN have pushed practical context windows past a million tokens, but every gain in size has to be paid for somewhere — usually in recall accuracy at the deeper end of the window.
These approaches are all stopgaps. The quadratic cost of attention is an evolutionary bottleneck, not a destination. Ring attention, sliding windows, state space models, and hybrid memory architectures are all racing to break through it. The architecture that cracks linear-cost context scaling without sacrificing retrieval quality will redefine what “remembering” means for AI.
References
- DeepSeek-V2 Technical Report — Original Multi-Head Latent Attention paper
- YaRN: Efficient Context Window Extension — RoPE interpolation for long-context LLMs
- RoFormer: Enhanced Transformer with Rotary Position Embedding — The RoPE paper
- Ring Attention with Blockwise Transformers — Distributed attention for near-infinite context
- Mamba: Linear-Time Sequence Modeling — State space models as an alternative to attention