Practical LLM Literacy
Everything from the previous four parts (attention, training, scaling) cashes in here. Every practical tip in this section has a mechanistic explanation rooted in the architecture, and that’s the whole point of this series: understanding why things work, not just that they work.
Prompting is programming
Your prompt is the model’s input, and the quality of that input determines the quality of the output. This follows directly from the architecture.
Important (The core insight)
The model doesn’t “understand” your intent. It predicts what text is most likely to follow your prompt. Writing good prompts means writing text that, when continued, naturally produces the output you want. If your prompt is vague, there are many plausible continuations and the model picks one at random. If your prompt is specific, there are fewer plausible continuations and they’re all closer to what you want.
Why prompting techniques work
Every common prompting technique has a mechanistic explanation rooted in how the architecture processes tokens.
Vague prompts produce flat probability distributions where many tokens are plausible continuations; specific prompts produce peaked distributions with fewer, better continuations.
Bad: “Write code.” Good: “Write a Python function that takes a list of integers and returns the second largest unique value. Raise ValueError if the list has fewer than 2 unique elements. Include type hints.”
The second prompt constrains the output space so tightly that there’s basically only one reasonable continuation: a correct implementation. The first prompt could go anywhere.
When you give examples in your prompt, the attention mechanism picks up the pattern and continues it. This is the same pattern-matching that makes language understanding work in the first place; the model attends to your examples to figure out the format and style you want.
Convert these to snake_case:getUserName → get_user_namesetMaxRetries → set_max_retrieshandleHTTPError →The model sees the pattern (camelCase → snake_case) through attention and continues it. The more examples you give, the less ambiguity remains about the pattern.
Each generated token is another full forward pass through the model. When you ask for step-by-step reasoning, you’re giving the model more compute to work with. The intermediate tokens serve as external working memory (scratch paper the model can “read” in later tokens through attention), so chain-of-thought is really just a way of allocating more computation to the problem.
The model has seen millions of examples of specific formats (JSON, markdown, bullet points, tables) in its training data, and specifying the format activates those patterns. “Return the result as JSON” immediately constrains the output to valid JSON syntax, because the model has extremely strong priors about what valid JSON looks like.
The system prompt is just text at the beginning of the context. Attention patterns naturally give high weight to the beginning and end of the context (more on this below), so by placing instructions at the start, you ensure they get strong attention from every generated token. System prompts are effective at setting behavior because of positional attention bias.
Chain-of-thought in practice
Try this with any LLM:
Without CoT:
A farmer has 17 sheep. All but 9 die. How many sheep does the farmer have left?Many models answer “8” (17 - 9 = 8), which is wrong; the answer is 9 (“all but 9 die”).
With CoT:
A farmer has 17 sheep. All but 9 die. How many sheep does the farmer have left?Think through this step by step.Now the model writes out: “All but 9 die means 9 survived. So the farmer has 9 sheep.” Correct.
Without CoT, the model sees a math-like word problem and its strongest pattern is “subtract the numbers.” With CoT, it generates tokens that spell out the meaning of “all but,” and those tokens give it the information it needs to avoid the trap. The reasoning tokens are doing real computational work.
The context window
Warning (Misconception: the model remembers your conversation)
It doesn’t. Every time you send a message, the entire conversation history is fed into the model from scratch. There’s no persistent memory between turns. The “memory” is just the conversation being re-read in full every time. This is why long conversations slow down (more tokens to process) and why models can “forget” earlier context in very long conversations (the context window fills up and old messages get truncated).
The context window is better understood as a whiteboard: everything the model can see during a single forward pass. Nothing persists across API calls. The “memory” features in ChatGPT and Claude are separate systems that summarize and inject past context, not the model actually remembering.
Context window sizes in 2026: 1M tokens (GPT-5.4, Claude Opus 4.6), 1M+ (Gemini 3), 10M claimed (Llama 4 Scout). These sound enormous, but attention is quadratic; a 1M-token context costs roughly 1000x more to process than a 1K-token context.
Tip (The 'lost in the middle' problem)
Research shows models attend most strongly to the beginning and end of the context. Information buried in the middle gets less attention (the attention weights are literally lower for middle positions), a consequence of how attention patterns form during training.
Practical rule: put your most important information at the top of the prompt. If you have a long document followed by a question, the question at the end gets strong attention, and the beginning of the document gets strong attention, but the middle of the document is at a disadvantage. If a key fact sits in paragraph 15 of 30, the model might miss it.
Example (Try this: buried instruction)
Paste a long article (3000+ words) into a prompt, then add a specific question about a fact from the middle of the article. Note the answer quality. Then move that fact to the first paragraph and ask again. The second answer is often better. You’re observing the attention gradient in action.
Temperature, top-p, and top-k
These control how the model samples from the next-token probability distribution.
Temperature scales the logits (pre-softmax scores) before applying softmax:
- T = 0 (or very close): greedy decoding. The model always picks the most likely token. Output is deterministic and repetitive.
- T = 0.5-0.7: moderate randomness. Good balance of coherence and variety.
- T = 1.0: the model’s native distribution. More creative, occasionally surprising.
- T > 1.5: high randomness. Output becomes increasingly incoherent.
Instead of sampling from the full vocabulary, only consider the smallest set of tokens whose cumulative probability exceeds .
Top-p = 0.9 means: sort tokens by probability, keep adding tokens until their probabilities sum to 0.9, then sample from only those. This adapts to the model’s confidence; when the model is very sure (one token has 95% probability), top-p effectively picks just that token, and when the model is uncertain (many tokens with similar probability), top-p allows more variety.
Simply: only consider the most likely tokens and sample from those. Top-k = 40 means the model only considers the top 40 tokens, regardless of their probabilities.
This is simpler than top-p but less adaptive. If the model has one 99% token, you’re still sampling from 40 candidates (though the 99% one will almost always win). If the model has 100 equally-likely tokens, you’re cutting off at 40 for no principled reason.
Warning (Misconception: temperature = creativity)
Temperature is randomness, not creativity. High temperature doesn’t make the model more creative; it makes it more random. Sometimes randomness looks creative (unexpected word choices, novel combinations), but often it just produces nonsense. Genuine “creativity” from an LLM comes from patterns learned during training, not from cranking up the randomness dial.
Tip (Practical settings)
- Code generation / factual answers: temperature 0.0-0.3, top-p 0.95
- General conversation: temperature 0.5-0.7, top-p 0.95
- Creative writing / brainstorming: temperature 0.8-1.0, top-p 0.95
If outputs are repetitive or boring, raise the temperature; if they’re incoherent, lower it. It’s a knob on the randomness of sampling.
Hallucinations
The model is trained to produce plausible text, not truthful text, and these are not the same thing. Hallucinations aren’t a bug to be patched; they’re a fundamental property of next-token prediction.
Example (A real hallucination pattern)
Ask a model: “What is the parse_nested_json() function in Python’s json module?”
There is no such function. But the model might confidently describe it: “The parse_nested_json() function recursively parses nested JSON structures, handling circular references and custom decoders…”
This happens because:
- Python’s
jsonmodule exists and has many functions - “parse_nested_json” sounds like a real function name
- The training data is full of confident descriptions of Python functions
- The most likely continuation of “What is the parse_nested_json function” is a confident description, because that’s the format of every real function documentation entry the model has seen
The model generates the most plausible continuation. A plausible description of a real-sounding function is more likely than “this function doesn’t exist,” because the vast majority of “what is function X” queries in the training data are about functions that do exist.
Warning (The golden rule)
Never trust an LLM’s factual claims without verification. The architecture optimizes for plausibility, not truth; the loss function has no way to distinguish between the two. Use the model as a starting point, not an oracle.
Mitigation strategies:
- Retrieval-Augmented Generation (RAG): give the model the actual facts in its context, so it can reference real information instead of generating from memory
- Ask for sources: then check them. If the model cites a paper or URL, verify it exists
- Cross-check: for important facts, verify with a second source
- Use the model’s uncertainty: if you ask “are you sure?” and it hedges, that’s informative
RAG: Retrieval-Augmented Generation
RAG is the most common pattern for building production LLM applications, and it directly addresses the hallucination problem.
Definition (RAG)
Retrieval-Augmented Generation: instead of relying on the model’s training data for facts, you retrieve relevant documents from a database and include them in the prompt. The model then generates answers grounded in the retrieved context, not its own (possibly outdated or incorrect) memories.
The pattern:
- User asks a question
- You search a document database for relevant content (using embeddings similarity, keyword search, or both)
- You include the top results in the prompt: “Given the following documents, answer the question…”
- The model answers based on the retrieved documents
This works because the model is very good at pulling out and combining information from its context window. By giving it real, verified documents to work with, you get the model’s language and reasoning abilities without relying on its potentially-wrong training-time knowledge.
Tip (When to use RAG)
Use RAG when:
- Your data changes frequently (documentation, support tickets, product catalogs)
- Accuracy matters and you need verifiable answers
- You have proprietary data the model wasn’t trained on
- You want to cite specific sources
Don’t use RAG when:
- The model’s training data is sufficient (general knowledge questions)
- Latency is critical and you can’t afford the retrieval step
- The task is creative or generative rather than information-seeking
Fine-tune vs. prompt vs. RAG
These are the three main approaches for customizing LLM behavior, and most people should default to prompting and only escalate when it falls short.
When to use: Always start here. System prompts, few-shot examples, and careful instruction writing solve most problems.
Pros: No training cost. Instant iteration. Works with any model. No infrastructure.
Cons: Limited by context window. Can’t teach genuinely new knowledge. Inconsistent for complex behaviors.
Example: “You are a medical triage assistant. Given a patient’s symptoms, classify urgency as LOW, MEDIUM, or HIGH. Here are three examples…”
When to use: When the model needs access to specific, current, or proprietary information it wasn’t trained on.
Pros: Grounded in real data. Data can be updated without retraining. Verifiable answers.
Cons: Requires infrastructure (vector database, retrieval pipeline). Retrieval quality is a bottleneck. Adds latency.
Example: A customer support bot that searches your docs/knowledge base to answer questions about your specific product.
When to use: When you need a consistent behavioral pattern that can’t be achieved through prompting, or when you need to run at scale and want to avoid long prompts.
Pros: Consistent behavior without long system prompts. Can learn new formats and styles. Cheaper per query at high volume (shorter prompts).
Cons: Expensive to create (need training data, compute, expertise). Hard to iterate. Doesn’t add new factual knowledge reliably.
Example: Fine-tuning a model to output in a specific JSON schema consistently, or to match a particular brand’s writing voice across thousands of queries.
Tip (The decision flow)
- Try prompting first. System prompt + few-shot examples solve 90% of problems.
- If the model needs specific knowledge, add RAG. Don’t try to fine-tune facts into the model. Give it documents.
- If you need a consistent complex behavior at scale, consider fine-tuning. But only after you’ve maxed out what prompting can do.
Most people jump to fine-tuning too early. It’s expensive, hard to get right, and locks you to a specific model version.
Tool use and agents
The biggest shift in how LLMs are used is that they don’t just generate text anymore; they call functions, execute code, search the web, and operate other tools.
The pattern is simple:
- The model decides it needs to use a tool (a calculator, a web search, a code interpreter)
- It outputs a structured tool call (function name + arguments)
- The system executes the tool and returns the result
- The model reads the result and continues generating
This closes the loop on the model’s weaknesses. Can’t do math? Call a calculator. Don’t know current events? Search the web. Need to test code? Run it.
Tip (Understanding agents)
An “agent” is just an LLM in a loop: think, act (tool call), observe (read result), think again, act again, and so on until done.
Claude Code, Cursor, GitHub Copilot in agent mode, Codex (these all follow the same pattern). The LLM handles reasoning and planning while the tools handle execution.
When you understand this, the capabilities of coding agents make sense. They’re a capable reasoner (the LLM) with access to the right tools (file reading, code execution, shell commands, web search) running in a loop until the task is done.
Prompt restructuring in practice
Take this prompt and compare two versions:
Version A:
I have a CSV file with columns: name, age, city, salary.There might be missing values. Some salaries might benegative which is an error. I need to clean the data,remove errors, fill missing ages with the median, andoutput summary statistics grouped by city.Write a Python script to do this.Version B:
Write a Python script to clean and analyze a CSV file.
Input: CSV with columns name, age, city, salaryCleaning rules:- Remove rows where salary < 0 (data error)- Fill missing age values with the median ageOutput: Summary statistics (mean, median, count) grouped by city
Use pandas.Both prompts ask for the same thing, but Version B consistently produces better code. The structured format creates clear attention anchors; each line is a distinct instruction that the model can attend to independently. In Version A, the instructions are tangled in a prose paragraph, and the model has to parse them out (which it sometimes does imperfectly). Structured prompts don’t just look cleaner; they align with how attention processes information.
You now have the practical toolkit. Every tip here connects back to the architecture: attention patterns explain why prompt structure matters, next-token prediction explains hallucinations, the quadratic cost explains context window behavior, and the training pipeline explains why RLHF’d models sometimes refuse reasonable requests.
One section left: who’s building what, what’s overrated, and where this is all going.