The model is not magic.
Every working professional building with Claude eventually hits the same wall: why is it doing that? Why is the answer slow when the question is short? Why does the same prompt cost different amounts on different days? Why did it make up a function that doesn't exist? This practice gives you the mental model that turns "magic" into "engineering."
Not a math course. Not a how-to-build-an-LLM course. The 90% of intuition that explains 100% of behavior you'll see in production. Three hours. A short, verifiable lab on every concept — you run it, you check the result, you move on.
Want the source material first? Practice 00 · First Principles of Modern AI — the ten papers behind everything on this page, each with an animated deep-dive diagram.
- Explain to a non-technical colleague what an LLM actually does in one minute, without lying
- Predict (within 2x) how many tokens your prompt + response will cost, before you send it
- Know when decoding knobs apply, and when newer models require prompting or effort settings instead
- Diagnose hallucinations — spot the three patterns before they ship
- Read an API-side trace or model log and know what every field means
The whole pipeline in one diagram.
Motto: the model takes tokens in, predicts one token out, and does it again until it stops.
Here's everything that happens when you hit send on a Claude prompt:
- Tokenize. Your text is split into tokens. English averages roughly 1.3 tokens per word (about 4 characters per token) — sometimes a token IS a word, sometimes part of one.
- Embed. Each token becomes a vector — a list of ~4,096 numbers that captures what it "means" to the model.
- Attend. Each token's vector is updated by looking at every other token's vector. The output: a refined vector that knows the context.
- Predict. The model produces a probability over every possible next token. Picks one (with randomness controlled by temperature).
- Loop. The picked token is appended to the input. Go back to step 1. Stop when the model emits a special end-token or hits your max-tokens cap.
That's it. Everything else — "agents", "memory", "tools", "thinking" — is built on top of this loop. The next nine units unpack each step, so by the end you'll see Claude's behavior as the natural consequence of this pipeline, not a mystery.
Recite the pipeline from memory.
- Cover this section. On paper or in a scratch note, write the five stages in order, one phrase each.
- Paste this into any Claude chat:
Here are the 5 stages of the LLM forward pass as I remember them: [paste your 5]. Grade each against the canonical pipeline (tokenize, embed, attend, predict, loop). For each, say CORRECT or what I got wrong, in one line. - Fix any stage Claude marks wrong; re-recite until all five are CORRECT.
Stretch. Add a sixth line naming where "thinking" fits (answer: it’s extra tokens generated in the loop, step 5 — not a new stage).
Tokens (not words).
Motto: the model thinks in tokens, you write in words; the difference is where most surprises live.
A token is the unit the model reads and produces. It is NOT a word. Roughly:
- Common short English words ("the", "and", "of") are usually one token.
- Longer or less common words split:
"unconditionally"→["un", "conditional", "ly"]. - Numbers, punctuation, whitespace, emoji are their own tokens.
- Programming code is roughly 1.5x denser in tokens than prose — lots of punctuation.
- Non-English languages can be 2-5x denser than English for the same idea.
A trap to avoid: the model cannot reliably tokenize text for you. Ask Claude "split this into tokens" and it will guess — it has no introspective access to its own tokenizer, so its split and its count are often wrong. The only ground truth is the count_tokens API (or the usage field of a real call). That gap is exactly what the next lab measures.
Why this matters
Token count = cost + speed + context fit. You're not paying per character; you're paying per token. Code costs more than prose for the same intent. Long German compound words cost more than the English equivalent. This is the foundation for every cost decision you'll make — so it's worth seeing the real numbers, not the model's guesses.
Guess the token count, then get the truth.
- Pick one sentence (use this if you have none:
The quick brown fox jumps over the lazy dog near the riverbank at dawn.). - Ask Claude to guess:
How many tokens is this sentence, in your tokenizer? Give one number, no tool use: "<your sentence>". Write the number down — call it GUESS. - Get ground truth. If you have an API key, run the ~6-line snippet below (prints TRUE). No key? Open the usage/token tools in the Claude console or send the sentence as a one-line API call and read
response.usage.input_tokens. - Compute DELTA = GUESS − TRUE.
Stretch. Repeat with the Python snippet def hello(name): return f"Hi {name.upper()}!" and a non-English sentence. Confirm TRUE-per-word rises for both (code ~1.5× denser; many languages 2–5×).
# count_tokens.py — pip install anthropic ; export ANTHROPIC_API_KEY=sk-ant-...
import anthropic
client = anthropic.Anthropic()
sentence = "The quick brown fox jumps over the lazy dog near the riverbank at dawn."
resp = client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": sentence}],
)
print("TRUE tokens:", resp.input_tokens)
Embeddings: words become vectors.
Motto: the model doesn't see "cat" — it sees a list of 4,096 numbers that capture "cat-ness."
Every token gets mapped to a fixed-size vector (a list of numbers, usually 4,096 long in modern LLMs). Tokens that mean similar things end up with vectors that point in similar directions. The model never reads characters — it reads these vectors.
What this lets you predict
- Synonyms behave the same. "Big" and "large" have very similar embeddings, so prompts using one work like prompts using the other.
- Typos often survive. "Embedding" and "embeding" have similar embeddings; the model usually understands.
- Wrong-but-related terms confuse predictably. "Java" and "JavaScript" embed in similar regions; the model needs context to know which you mean.
The prompt that demonstrates embeddings
Rate how similar these word pairs are in the model's internal representation, on a 0-10 scale. Explain your reasoning. 1. dog & puppy 2. dog & cat 3. dog & banana 4. teacher & instructor 5. teacher & janitor 6. teacher & equation 7. JavaScript & TypeScript 8. JavaScript & Java 9. JavaScript & jaywalking Now tell me: which pair would cause the most prompt-rewording problems in production, and why?
The model can't read out its real embedding distances either (same introspection gap as tokens) — treat its 0–10 scores as predictions to test, not measurements. The point isn't the exact numbers; it's that you can predict which pairs behave alike before you run them.
Predict the similarity ranking, then check.
- Before running anything, write your own answer: of pairs 5 (teacher & janitor), 8 (JavaScript & Java), 9 (JavaScript & jaywalking), rank which would cause the most production confusion.
- Run the prompt above. Note Claude’s pick.
- Now test it for real: open two fresh chats. In one, write a prompt using "JavaScript"; in the other, swap in "Java" and change nothing else. Compare the answers.
Stretch. Try the true-synonym pair (big/large) the same way; confirm the output barely changes — that’s why synonym swaps are safe and near-miss swaps aren’t.
Attention in one paragraph.
Motto: every token looks at every other token and weighs which ones matter for its own meaning.
"Attention" is the famous trick. Here it is in one sentence: for each token, the model computes a weighted average of all the other tokens' vectors, where the weights are learned to highlight which tokens are relevant.
Concretely: when the model is processing the token "it" in "The dog ran because it was hungry", attention lets the "it" token "look at" the "dog" token strongly (and ignore "ran", "because", etc). The vector for "it" after attention is updated to include "dog-ness." Pronoun resolution = high attention weight from "it" to "dog".
This happens not once but in many layers, with each layer's attention learning different relationships. Early layers pick up syntax; later layers pick up meaning, intent, even style.
Grounded in: Attention Is All You Need (Vaswani et al., 2017). Plain-English takeaway: every token can look at every other token and learn what matters. That Transformer idea is the architecture behind Claude, GPT, Gemini, and most modern LLMs.
What this lets you predict
- Order can matter a lot. Putting instructions at the end vs the start changes which tokens get attended to in the final layer.
- Long context dilutes attention. A 100k-token document drowns the few tokens that actually matter. This is why retrieval beats stuffing.
- Markdown structure helps. Headings give the model strong "section boundary" tokens to anchor attention on.
Make attention fail, then fix it.
- In a fresh chat, paste a short instruction buried in clutter: about 8 lines of filler text, then on one middle line
Reply only in French., then 8 more filler lines, then your questionWhat is the capital of Japan? - Note whether the reply is in French.
- New chat. Move the instruction to the very top AND repeat it at the very bottom, same question. Note the language now.
Stretch. Replace the filler with Markdown headings (## Context, ## Instruction, ## Question) and confirm the structured version is obeyed even with the instruction in the middle.
The next-token-prediction loop.
Motto: the model only ever picks one token at a time; everything else is a loop.
Here is the surprising thing: the model does not "write a response." It picks one token, appends it to the input, and runs the whole pipeline again. A 500-word response is 500+ separate forward passes, each one picking a token based on everything that came before.
Consequences:
- Once a token is picked, it's committed. The model can't go back and revise an earlier word.
- Streaming is real. The reason you see tokens stream out is because they ARE generated one at a time; the API just sends them as they're produced.
- The "think step-by-step" trick works because each thinking token is added to context, which the next prediction can attend to. Writing reasoning out loud literally improves the next answer.
- Stop tokens end the generation. The model emits a special end-of-message token when it decides it's done.
Prove reasoning tokens change the answer.
- New chat. Ask for the answer only:
A bat and ball cost $1.10 total. The bat costs $1.00 more than the ball. How much is the ball? Reply with only the dollar amount, no explanation. - New chat. Same problem, but:
...Think step by step, then give the answer. - Compare. (The answer is $0.05; the no-reasoning version often says $0.10.)
Stretch. Force the failure: Answer in one word, immediately, no reasoning: a 3-step word problem of your own. The terser you force it, the more the loop has no room to "think out loud."
Temperature, top-p, top-k.
Motto: at each step the model has a probability over every possible next token; these three knobs shape how it picks.
After the model computes probabilities for every next-token candidate, it has to pick one. The picking is controlled by:
- Temperature (0 to ~2). Reshapes the probability distribution. 0 = always pick the highest-probability token (deterministic). 1 = use the raw probabilities. Higher = flatter, more random.
- Top-p (0 to 1, default ~0.9). Only consider the smallest set of tokens whose probabilities sum to p. Filters out the unlikely tail.
- Top-k (an integer, often unset). Only consider the top k most likely tokens. Hard cutoff.
Current Claude caveat: these are the classic decoding knobs, but not every current model exposes them. Claude Opus 4.8 (and Opus 4.7) reject non-default temperature, top_p, and top_k — sending one returns a 400 error. For those models you steer with prompting plus adaptive thinking / effort. Models like Claude Sonnet 4.6 and Haiku 4.5 still accept the sampling knobs, so the API half of the lab below targets one of those.
Two ways to feel sampling at work
You can’t set temperature on Opus 4.8 in the chat UI — but the model still samples, so regenerating reveals its built-in variance. The optional API snippet pins temperature explicitly on a model that exposes it.
Same prompt, different outputs — measure the spread.
- In a Claude chat, send:
Write a one-sentence opening for a short story about a librarian who discovers a hidden room. Make it interesting. - Regenerate (or re-send in a fresh chat) two more times. Save all three openings.
- Write one sentence: how different were they — near-identical, same idea / different wording, or totally different?
- Optional, if you have an API key: run the snippet below — it calls a sampling-capable model at
temperature=0three times, then attemperature=1three times. Note which set is identical and which varies.
temperature=0 trio is identical or near-identical; the temperature=1 trio differs — that contrast IS the knob.)Stretch. Re-run the chat prompt but demand a canonical answer instead: Extract the city name as JSON: {"city": ...} from "I flew into Denver yesterday." three times. It won’t vary — low-entropy tasks barely move regardless of sampling.
# sampling.py — uses a model that still accepts temperature (Opus 4.8 rejects it)
import anthropic
client = anthropic.Anthropic()
prompt = "Write a one-sentence opening for a short story about a librarian who discovers a hidden room."
for temp in (0.0, 1.0):
print(f"\n=== temperature={temp} ===")
for _ in range(3):
msg = client.messages.create(
model="claude-sonnet-4-6", # sampling-capable; or claude-haiku-4-5
max_tokens=80,
temperature=temp,
messages=[{"role": "user", "content": prompt}],
)
print("-", msg.content[0].text.strip())
The rule of thumb: where sampling knobs are supported, use low temperature for canonical answers (extraction, classification, code), medium for one good answer with some variation (writing, creative tasks), and high only if you want chaos. Where knobs are not supported (Opus 4.8), ask for the behavior directly in the prompt.
Context window economics.
Motto: context is finite, expensive, and the most-attended tokens are at the start and the end.
Every model has a maximum number of tokens it can hold in context at once (200K, 1M, etc). Three things to know:
- Bigger windows cost more — linearly on your bill. What you’re billed is tokens × rate, so doubling the input tokens roughly doubles the input cost. (Separately, the attention compute itself grows quadratically — O(n²) in token count — which is why latency climbs faster than price, and why providers cap context. Don’t conflate the two: billed cost is linear, attention compute is quadratic.)
- Beginning + end get the most attention. Tokens in the middle of a long context tend to be under-weighted. This is empirically observed across most LLMs ("lost in the middle"). Put your important instruction first AND repeat it last.
- The cache exists. Modern APIs (including Claude) let you cache stable prefixes of your prompt, so re-sending the same system prompt costs ~10% of original. This is the single biggest production cost lever.
A real "lost in the middle" demo — with a real haystack
We ship a ~2,900-word document with one fact planted in the middle. Download it, paste it into Claude, and ask only the question that fact answers. Grade against the key (don’t peek at the key first — it spoils the test).
- haystack.txt — the document (planted fact lives ~58% of the way through). Right-click → Save As, or open and copy all.
- haystack_key.txt — the answer key. Open this after you record your result.
Find the needle — middle vs edges.
- Open haystack.txt and copy the whole thing.
- In a fresh chat, paste it, then on a new line ask:
What does the Meridian contract renewal cost per quarter?Record the answer (MIDDLE result) — a number, a refusal, or "doesn’t say." - New chat. This time paste the same fact sentence (you’ll know it after step 2) as the first line, then the full document, then the same question. Record the answer (EARLY result).
- Open haystack_key.txt. The correct answer is $4,250 per quarter. Mark MIDDLE and EARLY each as HIT or MISS.
Stretch. Add a LATE run (fact pasted as the last line before the question) for a three-point position curve, and note that "first AND last" beats either edge alone.
Why prompts cost what they cost.
Motto: cost = input tokens × input rate + output tokens × output rate; output is usually 5x more expensive per token than input.
The cost of a Claude API call is:
total_cost = (input_tokens × input_rate)
+ (output_tokens × output_rate)
+ (cached_input_tokens × cached_rate) [Claude only]
The output rate is typically ~5x the input rate. This means:
- Long responses are more expensive than long prompts. A 4000-token response costs more than a 4000-token system prompt.
- Cache is your friend. If your system prompt is 4000 tokens and you call the API 100 times with the same prompt + different user inputs, caching saves you ~90% on the system-prompt cost.
- Asking for shorter responses cuts cost AND improves quality. "Answer in ≤3 sentences" is a cost knob, not just a style preference.
The cost-estimation prompt
# Estimate before you build Estimate the API cost for the following Claude API call. Assume Claude Sonnet 4.6 pricing (~$3 per 1M input tokens, ~$15 per 1M output tokens; cached input is 10% of normal input). Check the current pricing page before using this for a real budget. System prompt: [paste yours, or use ~2000 tokens of made-up rules] User message: [paste yours, or describe a typical user request] Expected response length: [your estimate, e.g. 500 tokens] Calls per day: [your estimate, e.g. 100] % of calls reusing the same system prompt: [e.g. 95%] Give me: 1. Cost per call without caching 2. Cost per call with caching 3. Cost per day at the volume above 4. Where the biggest savings would come from
Estimate your call’s cost, then check it against reality.
- Pick a typical request you send Claude. Estimate input tokens (words × 1.3 from Unit 02) and expected output tokens.
- Do the math by hand at Sonnet 4.6 rates:
est = input/1e6 × $3 + output/1e6 × $15. Write down EST. - Get the truth. API: run the snippet below and read
usage.input_tokens+usage.output_tokens, then plug into the same formula for ACTUAL. No key? Send the request in chat and askRoughly how many input and output tokens was that exchange?as a rough stand-in, or use thecount_tokenssnippet from Unit 02 for the input side. - Compute the ratio EST / ACTUAL.
Stretch. Send the same system prompt twice and read usage.cache_read_input_tokens on the second call — watch the cached portion cost ~10% and your per-call estimate drop.
# cost_check.py — read the real usage field after a call
import anthropic
client = anthropic.Anthropic()
msg = client.messages.create(
model="claude-sonnet-4-6", # $3 / $15 per 1M tokens
max_tokens=500,
messages=[{"role": "user", "content": "<paste your typical request>"}],
)
u = msg.usage
cost = u.input_tokens/1e6 * 3 + u.output_tokens/1e6 * 15
print(f"in={u.input_tokens} out={u.output_tokens} ACTUAL=${cost:.5f}")
Why models hallucinate.
Motto: the model isn't lying — it's predicting the most likely next token, and "I don't know" is rarely the most likely next token.
Hallucinations have a structural cause. The model is trained on text where humans confidently state things. So when you ask a question, the most likely continuation pattern is "a confident answer," not "an admission of ignorance." Three failure modes:
- Fact hallucination — the model invents a citation, function name, person, or statistic that sounds right. Cause: the surrounding pattern is "[author/function/year]" and the model fills the slot with a plausible value.
- API hallucination — the model invents a method on a library that doesn't exist. Cause: the library has many real methods that resemble the made-up one, so the made-up one is locally probable.
- Reasoning hallucination — the model produces a logically-valid-looking chain that's actually wrong. Cause: each step is locally probable; the global validity isn't checked.
The prevention pattern
When answering this question, follow these rules: 1. If you don't know something with high confidence, say so explicitly. Use the format: "I'm not sure about X — please verify against [authoritative source]." 2. For any factual claim (name, number, date, function name), tell me your confidence: HIGH (would bet money), MEDIUM (probably right), LOW (could be wrong). 3. For any external reference (paper, function, library, API), give me a way to verify it (URL, command to run, doc to check). 4. If I ask for code, tell me which exact library version you're targeting and how I can confirm the function/method exists. Now answer: [your real question]
Why this works: the rules become part of the context the model attends to when picking each next token. Asking for confidence labels forces the model to allocate probability across "high/medium/low" tokens rather than just confident assertion.
Catch a hallucination, then defuse it.
- Bait it. In a fresh chat ask something obscure and citation-shaped:
What is the exact API signature of the function `flux_capacitor.calibrate()` in the `delorean` Python package, and which version added it?(This package/function is invented.) - Note whether Claude invents a signature/version, or says it can’t find one.
- New chat. Paste the prevention prompt above, then ask the same question as the "real question." Note the confidence label and verification step it now gives.
Stretch. Classify which of the three failure modes (fact / API / reasoning) your bait triggered, and write the one-line cure from the unit next to it.
What you can debug from outputs alone.
Motto: three observations explain 80% of "why did it do that."
Most "the model is broken" reports turn out to be one of these:
| What you observe | Most likely cause | The fix |
|---|---|---|
| Answer drifts mid-response | Sampling too loose or prompt under-specified | Lower temperature where supported; otherwise tighten format, examples, and success criteria |
| Same prompt, different answers each call | Model sampling + inherent non-determinism | Use deterministic params where supported; otherwise constrain output and test across repeated calls |
| Long-context query gets wrong answer | "Lost in the middle" | Move key facts to start/end; consider retrieval instead |
| Made-up function name | Local-probability hallucination | Add "verify against package docs" rule; pin library version |
| Cost spiked overnight | Cache miss (system prompt changed) or longer responses | Diff your system prompt; check max_tokens cap |
| First call slow, subsequent fast | Cache warm-up (normal behavior) | Pre-warm cache in deploy script |
| Output stops mid-sentence | Hit max_tokens limit | Increase max_tokens OR ask for terser format |
| Refuses a benign request | Safety training, locally tripped by phrasing | Rephrase; explain intent in the system prompt |
Diagnose three symptoms from the table.
- Cover the table. For each of these three symptoms, write the likely cause and the fix from memory: (a) "the output stopped mid-sentence," (b) "the same prompt gives different answers every call," (c) "cost spiked overnight."
- Uncover and check your three answers against the rows.
- Add one symptom you’ve actually hit with Claude and write its row (observe → cause → fix).
Stretch. For your real symptom, actually apply the fix in a chat or call and confirm the behavior changes — that closes the loop from diagnosis to repair.
One page on your desk.
The mental model from this practice on one page — with blanks you fill from your own numbers as you go. Print it; keep it next to your monitor for the first month after you start building seriously with Claude. The facts are corrected; the fill-ins make it yours.
Copy it, then fill each __ from the matching unit’s lab as you complete it. By the end you have a one-page model in your own measurements, not a generic copy.
# What Claude is actually doing — my one-page model ## The pipeline Input text → tokens (~1.3 per word, ~4 chars each) → embeddings (4,096-dim vectors) → attention layers (each token reweighted by every other token) → probability over next token → pick one (per model sampling rules) → append → repeat → stop on end-token or max_tokens [U01] Pipeline stages I recited cold, Claude marked CORRECT: __ /5 [U02] My test sentence: GUESS __ tokens, TRUE __ tokens, DELTA __ [U03] Java/JavaScript swap changed my answer via: __ (language / libraries / syntax) [U04] Reply-in-French obeyed — buried: __ / first+last: __ (yes/no) [U05] Bat-and-ball — terse: $__ / step-by-step: $__ (correct = 0.05) ## The knobs that change behavior - temperature/top_p/top_k: sampling knobs where supported; Opus 4.8 (and 4.7) REJECT non-default values (400 error) - adaptive thinking / effort: use these instead of sampling knobs on Opus 4.8 - max_tokens: hard cap on response length - cache: re-using same system prompt cuts cost ~90% [U06] Regenerating my creative prompt 3x varied: __ (none / wording / fully) ## The economics - output tokens cost ~5x input tokens - BILLED cost grows LINEARLY with tokens (tokens × rate); attention COMPUTE grows quadratically O(n²) — don't conflate them - shorter responses = cheaper AND usually better quality - cache stable prefixes; the savings are real [U08] My typical call ≈ __ in + __ out tokens ≈ $__ /call ≈ $__ /day ## Why hallucinations happen - the model picks the most LOCALLY probable next token - "I don't know" is rarely the most probable continuation - fact / API / reasoning hallucinations have different cures - fix by forcing confidence labels and verification rules in the prompt [U09] The fake-package bait gave me: __ (confident fake / honest 'unknown') ## Long-context gotchas - "lost in the middle": model under-weights tokens in the middle - put critical instructions first AND last - prefer retrieval over context-stuffing at scale [U07] haystack.txt recall — MIDDLE: __ (hit/miss), EARLY: __ (hit/miss) ## When something looks wrong, ask in order: 1. Are sampling settings too loose, or unsupported for this model? 2. Did I exceed max_tokens? 3. Is the important info buried in the middle of context? 4. Did the model invent a fact that sounds right? (check 1-2 with web) 5. Did the system prompt change (cache miss)? 6. Is this a safety refusal? (rephrase intent) [U10] A real symptom I diagnosed from the table → cause: __ The model is not magic. The model is a token predictor in a loop. Everything else is engineering on top of that.
Fill the cheatsheet with your numbers.
- Copy the cheatsheet above into a note.
- Fill the ten
[U..] __lines from your lab results: U01 (stages /5), U02 (guess/true/delta), U03 (swap effect), U04 (French buried vs first+last), U05 (bat-and-ball answers), U06 (variance), U07 (haystack hit/miss), U08 (your call’s tokens + $/day), U09 (bait result), U10 (a real symptom). - Keep it where you build. When a number changes (new model, new typical prompt), update the line.
__ blanks remain on the ten fill-in lines, and the U08 line shows a real dollar-per-day figure within 2× of what U08’s lab measured.Stretch. Pin it next to your monitor or in your team wiki and re-derive the U08 cost line after your next billing day — if your estimate held within 2×, the mental model is working.
This page replaces "the model is acting weird" with a checklist. Past the first week of building seriously, you'll diagnose 80% of issues using the table in Unit 10 + this cheatsheet, before you ever open the docs.