From Words to Vectors

Before we can talk about transformers, we need to talk about the problem they solve. Computers don’t understand words; they understand numbers. The whole history of NLP is really the history of people finding cleverer ways to turn words into numbers while keeping what the words mean.

Previously, on NLP

The earliest approach was embarrassingly simple: count the words. “Bag-of-words” throws every word in a document into a bag, counts the frequencies, and calls it a representation. TF-IDF is the same thing but slightly smarter, weighting words by how rare they are across documents.

Example (Why counting words doesn't work)

“The dog bit the man” and “the man bit the dog” produce identical bag-of-words vectors. Same words, same counts, opposite meanings. Word order is thrown away completely.

This is basically what keyword search does. When Google works well, it’s bag-of-words plus PageRank plus a lot of engineering; when it fails badly, it’s because meaning needs more than word frequency.

Then in 2013, Word2Vec changed everything. The idea was to represent each word as a dense vector (a list of, say, 300 numbers) in a space where nearness means likeness. You train it by predicting neighboring words in a sentence, so words that show up in similar contexts end up with similar vectors.

Intuition (Why Word2Vec works)

If “cat” and “dog” constantly appear near the same words (“pet,” “food,” “vet”), they end up at nearby points in vector space. The model doesn’t know what cats or dogs are. It just knows they’re used similarly. That’s enough.

The famous result: $\text{king} - \text{man} + \text{woman} \approx \text{queen}$ . The vector arithmetic captures semantic relationships, and yes, every Word2Vec explainer uses this example because it’s a cliche. The more interesting thing is where it breaks.

Example (Where embeddings fail)

Try $\text{doctor} - \text{man} + \text{woman}$ . In many embedding spaces, you don’t get “doctor” (you get “nurse”). The model learned the biases in its training data, not ground truth. Static embeddings bake in every bias present in the corpus, and there’s no way to fix it per-query because every word gets exactly one vector regardless of context.

That’s the bigger problem: one vector per word, period. “I went to the bank to deposit money” and “I sat on the river bank” give “bank” the same vector, with no context sensitivity at all.

Then came RNNs and LSTMs. These process words one at a time, keeping a hidden state that should carry context forward. In practice, the information decays exponentially; by the time you’re 50 words into a sentence, the model has essentially forgotten the beginning. They were also painfully slow because you can’t parallelize sequential processing.

This is why pre-2017 chatbots had such terrible memory: it wasn’t a design choice but a hard constraint of the architecture.

The attention breakthrough

In 2014, Bahdanau et al. had a simple but powerful idea for machine translation: instead of compressing the whole input sentence into a single hidden state vector, let the decoder look back at all the encoder states and decide which ones matter for each word it’s generating.

Intuition (The translation analogy)

Imagine translating a long sentence from French to English. You don’t memorize the whole sentence and then translate from memory. You glance back at the relevant part for each word you write. Attention is a mechanism that lets the model dynamically focus on different parts of the input.

This worked very well, but it was still coupled with the slow sequential processing of RNNs. The real breakthrough came in 2017, when Vaswani et al. asked: what if we only use attention? No recurrence, no sequential processing, just attention all the way down. That paper, “Attention Is All You Need,” introduced the transformer.

Attention Is All You Need

Vaswani et al., 2017. The paper that started everything.

→

Remark (Hot take: is the pre-transformer era worth knowing?)

The intuitions from embeddings are worth keeping (the idea that words can be represented as points in a space, and that geometry in that space captures meaning). That’s foundational.

But the RNN/LSTM mechanics? Not really. You don’t need to understand vanishing gradients to use or reason about modern LLMs. Most “history of NLP” content is padding to make the explainer feel thorough. I included a speedrun here because context helps, but if you forgot everything above this line, you’d be fine.

Tokens

Something that surprises most people: LLMs don’t see words. They see tokens.

Definition (Token)

A token is the fundamental unit of text that an LLM processes. It’s not a word, not a character, not a sentence. It’s a chunk produced by a tokenizer (typically a subword unit generated by Byte Pair Encoding or a similar algorithm). Common words might be a single token; uncommon words get split into multiple tokens. Punctuation is usually its own token.

Here’s what tokenization looks like in practice:

1
import tiktoken
2

3
enc = tiktoken.encoding_for_model("gpt-4")
4

5
# Simple sentence
6
tokens = enc.encode("The transformer architecture is elegant.")
7
print([enc.decode([t]) for t in tokens])
8
# ['The', ' transform', 'er', ' architecture', ' is', ' elegant', '.']
9

10
# The classic failure
11
tokens = enc.encode("strawberry")
12
print([enc.decode([t]) for t in tokens])
13
# ['str', 'awberry']
14

15
# Numbers are weird
16
tokens = enc.encode("123456789")
17
print([enc.decode([t]) for t in tokens])
18
# ['123', '456', '789']

Look at “strawberry.” It becomes two tokens: “str” and “awberry.” The model literally cannot see the individual letters, so when you ask “how many r’s are in strawberry?” and it gets it wrong, the representation it works with simply doesn’t contain the information you’re asking about. It would be like asking you to count the pixels in a word while looking at it from across the room.

Warning (Misconception: LLMs understand characters)

They don’t. They understand tokens. This is why they fail at character-level tasks (letter counting, anagram solving, precise spelling) even when they can handle much harder reasoning problems. The architecture doesn’t have access to character-level information.

Numbers are strange too. “123456789” becomes three tokens: “123”, “456”, “789.” The model doesn’t see this as one number; it sees three chunks, which is partly why LLMs struggle with precise arithmetic on large numbers.

When you’re working with LLMs and hitting character-level failures, it’s not a fixable bug but a fundamental representation limitation. The model would need to be retrained with character-level tokenization to fix it, which would make every other task worse because you’d need far more tokens to represent the same text.

Embeddings

Once text is split into tokens, each token gets turned into a vector (a list of numbers). This is the embedding.

Every token in the model’s vocabulary has a learned embedding vector. For GPT-4, these are probably 12,288-dimensional vectors (12,288 numbers per token). The model learned these during training: they started random and were adjusted billions of times until tokens with similar meanings ended up at similar points in this high-dimensional space.

Intuition (What's an embedding, intuitively?)

It’s a lookup table. Token ID 4523 maps to a specific list of 12,288 numbers. No computation, no neural network magic at this stage; just a table lookup. The magic is that training made the table entries meaningful, so similar tokens have similar vectors.

The embedding dimensions don’t have human-readable meanings (it’s not like dimension 47 means “how formal is this word”). The meaning comes from the relationships between vectors, not from individual dimensions.

Positional encoding: where are you in the sequence?

There’s a subtle problem here. A transformer processes all tokens at once (that’s what makes it fast, unlike RNNs which go one at a time), but that means it has no built-in sense of word order. “The cat sat on the mat” and “mat the on sat cat the” would look the same to a position-unaware transformer.

We fix this by adding positional encoding to each token’s embedding. The original transformer used sinusoidal functions:

$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$ $PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$

Intuition (The clock analogy)

Think of a clock with many hands, each spinning at a different speed. The second hand moves fast, the minute hand moves slowly, the hour hand barely moves. At any given time, the combination of all hand positions is unique. Positional encoding works the same way: each position in the sequence gets a unique fingerprint made of sines and cosines at different frequencies, and the model learns to read this fingerprint to figure out where each token sits.

▸ Reveal

Most modern models don’t actually use the sinusoidal scheme from the original paper. They use either learned positional embeddings (just learn a vector for each position, like token embeddings) or RoPE (Rotary Position Embeddings), which encode relative position through rotation matrices applied to the query and key vectors in attention. RoPE is more elegant because it naturally handles relative distances between tokens rather than absolute positions, and it extends more gracefully to longer sequences.

RoPE’s core idea is that instead of adding position information, you rotate the embedding based on position. Tokens close together in the sequence have their vectors rotated similarly, so their dot product (which drives attention) is higher. Tokens far apart are rotated differently, making their relationship reflect the distance between them.

The practical upshot is that context window limits come from positional encoding. The model was trained with positions up to some maximum $N$ , and it has never seen position $N + 1$ ; it doesn’t know what that position means. This is why context windows have hard limits and why extending them is an active research problem. The model has no representation for “the 130,000th token” if it was trained with a 128K context window.

At this point, you understand how an LLM gets from raw text to a sequence of vectors. Text goes to tokens (via BPE), then to embeddings (via lookup table), then to embeddings plus position (via positional encoding). The result is a sequence of vectors, one per token, that encode both what the token means and where it sits in the sequence.

This is the input to the transformer; next, we’ll look at what the transformer actually does with it.