The Transformer

You know how text becomes vectors (Part 1). The transformer takes a sequence of token vectors and produces a sequence of contextualized token vectors, meaning each output now encodes not just what that token means alone but what it means given every other token in the sequence. Everything else in this post is the “how.”

Self-attention: the core mechanism

Self-attention is the single most important idea in modern AI; if you understand one thing from this series, it should be this.

Every token in the sequence needs to “talk to” every other token to build context, and self-attention is the mechanism that makes it happen. It works through three learned projections called Query, Key, and Value.

Intuition (The dictionary analogy)

Self-attention is a differentiable dictionary lookup. Every token generates a query (“what am I looking for?”), a key (“what do I contain?”), and a value (“what do I provide if you attend to me?”). The query searches across all the keys, and the result is a weighted blend of all the values, with weights proportional to how well each key matched the query.

Think of a search engine: your search query gets matched against an index (keys), and you get back a blend of the relevant content (values).

Here’s the math, step by step:

Project into Q, K, V

Each token’s embedding vector gets multiplied by three different learned weight matrices to produce a query vector $Q$ , a key vector $K$ , and a value vector $V$ . These projections are learned during training; the model figures out what kinds of “questions” and “answers” are useful.

For a sequence of $n$ tokens with embedding dimension $d$ , this gives us three matrices: $Q$ , $K$ , and $V$ , each of shape $n \times d_k$ (where $d_k$ is the dimension of each head).

Compute attention scores

Take the dot product of each query with every key: $QK^T$ . This produces an $n \times n$ matrix where entry $(i, j)$ is how much token $i$ ‘s query matches token $j$ ‘s key. High dot product means strong match means “pay attention to this token.”

Scale

Divide by $\sqrt{d_k}$ . This prevents the dot products from growing too large as the dimension increases. Without scaling, the softmax in the next step saturates, pushing all the weight onto one token and killing the gradients. It’s just normalization.

Softmax

Apply softmax across each row to turn the scores into probabilities. Now each row sums to 1, and each entry represents the fraction of attention token $i$ pays to token $j$ . This is the attention pattern: a probability distribution over the sequence for each token.

Weighted sum of values

Multiply the attention weights by the value matrix $V$ . Each token’s output is a weighted blend of every other token’s value, weighted by how much attention it paid. Tokens that got high attention contribute more to the output.

The full equation:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Example (Pronoun resolution through attention)

Consider: “The cat sat on the mat because it was tired.”

When processing “it,” the model’s self-attention needs to figure out what “it” refers to. The query for “it” asks “what entity am I?”; the key for “cat” responds with “I’m an entity that does things.” The attention weight from “it” to “cat” should be high, and from “it” to “mat” should be low.

After attention, the vector for “it” now contains information drawn from “cat” (it has been contextualized). The model has resolved the pronoun reference not through a rule but through learned attention patterns, and this is one of the things that makes transformers so powerful: they learn to solve language problems by learning which tokens to attend to.

Remark (On the $\sqrt{d_k}$ scaling)

This is a detail that’s often glossed over or presented as mysterious, but it’s straightforward. Dot products grow with the dimension of the vectors. If $d_k = 64$ , two random unit vectors will have a dot product with standard deviation $\sqrt{64} = 8$ . Without dividing by $\sqrt{d_k}$ , the softmax inputs would be large, softmax would produce near-one-hot distributions (almost all weight on one token), and gradients would be tiny. Scaling keeps the distribution well-behaved, for the same reason you normalize things in statistics.

Multi-head attention

One attention head can only focus on one “type” of relationship at a time, which is limiting. In a sentence like “The nervous cat quickly jumped over the lazy brown dog,” there are multiple things going on at once:

Syntactic structure: “cat” is the subject of “jumped”
Adverbial modification: “quickly” modifies “jumped”
Adjective-noun binding: “nervous” goes with “cat,” “lazy brown” goes with “dog”
Prepositional phrase: “over” connects “jumped” to “dog”

One attention pattern can’t capture all of these at once, so the solution is to run multiple attention heads in parallel, each with their own learned $Q$ , $K$ , $V$ projections.

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O$

where each $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$ .

One head might learn to track syntactic dependencies, attending from verbs to their subjects and objects to build sentence structure. The attention pattern for “jumped” would concentrate on “cat” (subject) and “dog” (object).

Another head might learn a simple proximity pattern: attend to the tokens immediately before and after. This captures local structure (adjective-noun pairs, compound words, punctuation patterns) and is not sophisticated but is reliably useful.

A third head might track semantic similarity regardless of position. If “cat” and “dog” appear in the same sentence, this head might attend between them because they’re semantically related (both animals), even though they’re far apart.

GPT-3 uses 96 attention heads per layer, each learning a different pattern. The concatenated output captures a richer view of the relationships in the sequence than any single head could.

Tip (Why this matters in practice)

Multi-head attention is part of why LLMs can handle so many different tasks. Different heads specialize for different kinds of reasoning, and the model routes information through whichever heads are relevant. This is also why pruning attention heads (removing some to make the model faster) works surprisingly well; many heads are redundant or only useful for niche tasks.

The full transformer block

Self-attention is one layer in a transformer block. A complete block also includes:

Layer normalization: stabilizes training by normalizing the activations. Without it, the model’s internal values can drift wildly during training.
Feed-forward network (FFN): two linear transformations with a nonlinearity (usually GeLU) in between. This is where each token processes what it learned from attention.
Residual connections: skip connections that add the input directly to the output. These help gradients flow during training and let the model preserve information from earlier layers.

The block formula:

$x' = \text{LayerNorm}(x + \text{MultiHeadAttention}(x))$ $\text{output} = \text{LayerNorm}(x' + \text{FFN}(x'))$

Intuition (Discussion and reflection)

The attention layer lets tokens talk to each other (a round of group discussion), while the FFN lets each token think about what it heard (a round of individual reflection). Stacking these alternating layers gives you repeated cycles of “gather information from the group” followed by “process that information privately.”

A GPT-4-class model has roughly 120 of these blocks stacked on top of each other, meaning 120 rounds of discussion and reflection. The depth is what gives the model its reasoning capacity.

Tip (Where knowledge lives)

The FFN is often 4x the hidden dimension. In GPT-3 ( $d_\text{model} = 12288$ ), the FFN inner dimension is 49,152. This is where most of the model’s parameters live, and research suggests this is where most “knowledge” is stored (factual associations, learned heuristics, and domain expertise). Attention handles routing and composition; the FFN handles the actual knowledge retrieval.

Decoder-only vs. encoder-decoder vs. encoder-only

There are three transformer architectures. By 2026, one of them has decisively won.

Bidirectional attention: every token can see every other token, in both directions. Good for understanding tasks (classification, named entity recognition, sentiment analysis) because the model can look at the full context. Trained with masked language modeling (“fill in the blank”).

BERT can’t write a paragraph because it wasn’t trained to generate text left-to-right, so it’s not used for text generation.

The original architecture. An encoder processes the input with bidirectional attention; a decoder generates the output autoregressively (one token at a time, left to right), attending to both previous output tokens and the encoder’s representation of the input.

Used for tasks with clear input-output structure like translation and summarization, and still used in some specialized systems, but not the dominant paradigm.

Causal (left-to-right) attention: each token can only see tokens before it in the sequence. A triangular mask blocks attention to future tokens. The model is trained to predict the next token, always.

This is what ChatGPT, Claude, Gemini, Llama, and every other major LLM uses. One architecture, one objective (predict the next token), scaled to absurd sizes. The simplicity is the point.

Important (Decoder-only won)

Almost every major LLM in 2025-2026 is decoder-only. When people say “transformer,” they almost always mean a decoder-only transformer with a causal attention mask.

This has a direct practical implication: LLMs are better at continuing text than editing it. The architecture is designed to predict the next token, not to go back and revise previous tokens. This is why “rewrite this paragraph” sometimes feels worse than “write a paragraph about X”; continuation is native, revision is simulated.

The causal mask is simple: a triangular matrix where position $i$ can only attend to positions $\leq i$ . When the model is generating token 50, it can see tokens 1-49 but not tokens 51+. This is what makes generation autoregressive; each new token is conditioned on all previous tokens.

Next-token prediction

After all the transformer blocks process the input, the final layer projects each position’s vector to the size of the vocabulary (typically 50,000-100,000+ tokens) and applies softmax. The result is a probability distribution over all possible next tokens.

Example (What next-token prediction actually looks like)

Given the prompt “The best programming language is”, the model might output:

Token	Probability
”Python”	18.2%
“the”	9.7%
“Rust”	6.4%
“a”	5.1%
“Java”	4.8%
“C”	3.9%
…	…

There’s no single “right answer” here. The model outputs a probability distribution; “Python” is most likely, but “Rust,” “Java,” and “C” are all plausible continuations. The model doesn’t “believe” Python is the best; it predicts that text saying Python is the best is the most likely continuation of this prompt, based on its training data. This distinction matters.

The entire sophistication of ChatGPT, Claude, and Gemini emerges from training a transformer to predict the next token. The architecture is the same across all of them; what differs is the training data, the scale, and the post-training (which we’ll cover in Part 3).

Warning (Misconception: 'AI is just autocomplete')

You’ll hear people dismiss LLMs as “just autocomplete” or “stochastic parrots.” This is technically correct and practically useless, like calling a human brain “just electrochemistry.” Yes, next-token prediction is the mechanism, but the capabilities that emerge from doing next-token prediction at sufficient scale (reasoning, coding, translation, creative writing, mathematical proof) are genuinely new. Reductive descriptions of the mechanism don’t explain the capabilities. Nobody predicted that next-token prediction would produce systems that can write working code, and the fact that it does tells us something deep about the relationship between language prediction and intelligence.

KV-cache and the quadratic cost

There’s a practical detail that matters a lot for how you interact with LLMs: self-attention is $O(n^2)$ in the sequence length. Every token attends to every other token, so the computation grows quadratically with context length; double the context, quadruple the compute.

Explanation (Why long context is expensive)

For a 128K-token context window, the attention matrix has $128000 \times 128000 \approx 16.4$ billion entries per head per layer. This is why:

API pricing often charges per input token and per output token
Very long prompts are noticeably slower
Context window limits exist (the model was trained up to some maximum, and going beyond it requires architectural tricks)
Providers charge a premium for large-context models

To avoid redundant computation during text generation, models use a KV-cache: they store the key and value vectors for all previously generated tokens, so each new token only needs to compute attention against the cache rather than reprocessing the entire sequence. This is why the first token of a response is slower (processing the full prompt) but subsequent tokens are faster (incrementally extending the cache).

Tip (Practical implications)

Knowing about the quadratic cost helps you write better prompts:

Front-load important context. Don’t pad your prompt with unnecessary filler. Every extra token makes everything slower.
Use summaries for long documents. Feeding a 50-page document into the context is possible but expensive and slow. Summarize or extract the relevant sections first.
Understand the “lost in the middle” problem. Research shows that models attend most strongly to the beginning and end of the context, and are worse at using information buried in the middle. This is a consequence of how attention patterns form during training. Put your most important instructions at the start or end, not in the middle.

You now understand the complete forward pass of a modern LLM:

Tokenize: text → token IDs
Embed: token IDs → vectors (lookup table + positional encoding)
Transform: vectors → contextualized vectors (self-attention + FFN, repeated ~100 times)
Predict: contextualized vector at the last position → probability distribution over next token
Sample: pick a token from the distribution, append it, repeat from step 3

The architecture is elegant and not that complicated; what makes it powerful is the scale of training, which is what we’ll cover next.