Pradhya Practice 17 · Under the Hood Beginner → Pro

The model is not magic.

Every working professional building with Claude eventually hits the same wall: why is it doing that? Why is the answer slow when the question is short? Why does the same prompt cost different amounts on different days? Why did it make up a function that doesn't exist? This practice gives you the mental model that turns "magic" into "engineering."

Not a math course. Not a how-to-build-an-LLM course. The 90% of intuition that explains 100% of behavior you'll see in production. Three hours. A short, verifiable lab on every concept — you run it, you check the result, you move on.

Want the source material first? Practice 00 · First Principles of Modern AI — the ten papers behind everything on this page, each with an animated deep-dive diagram.

For whom

Anyone who's used Claude for a month and wants to stop guessing

Length

2 sessions · ~90 min each

You'll walk away with

The mental model + a personal "what the model is actually doing" cheatsheet

Prereq

None — this is the practice that everything else assumes

What you’ll be able to do by the end

Explain to a non-technical colleague what an LLM actually does in one minute, without lying
Predict (within 2x) how many tokens your prompt + response will cost, before you send it
Know when decoding knobs apply, and when newer models require prompting or effort settings instead
Diagnose hallucinations — spot the three patterns before they ship
Read an API-side trace or model log and know what every field means

§ 17.01.01 · Unit 01

The whole pipeline in one diagram.

Motto: the model takes tokens in, predicts one token out, and does it again until it stops.

Here's everything that happens when you hit send on a Claude prompt:

Tokenize. Your text is split into tokens. English averages roughly 1.3 tokens per word (about 4 characters per token) — sometimes a token IS a word, sometimes part of one.
Embed. Each token becomes a vector — a list of ~4,096 numbers that captures what it "means" to the model.
Attend. Each token's vector is updated by looking at every other token's vector. The output: a refined vector that knows the context.
Predict. The model produces a probability over every possible next token. Picks one (with randomness controlled by temperature).
Loop. The picked token is appended to the input. Go back to step 1. Stop when the model emits a special end-token or hits your max-tokens cap.

That's it. Everything else — "agents", "memory", "tools", "thinking" — is built on top of this loop. The next nine units unpack each step, so by the end you'll see Claude's behavior as the natural consequence of this pipeline, not a mystery.

Recite the pipeline from memory.

You’ll do

Prove you own the five stages without looking, then have Claude grade you.

Steps

Cover this section. On paper or in a scratch note, write the five stages in order, one phrase each.
Paste this into any Claude chat: Here are the 5 stages of the LLM forward pass as I remember them: [paste your 5]. Grade each against the canonical pipeline (tokenize, embed, attend, predict, loop). For each, say CORRECT or what I got wrong, in one line.
Fix any stage Claude marks wrong; re-recite until all five are CORRECT.

Verify

You have five lines, in order, and Claude marks all five CORRECT — no peeking at the list above.

Stretch. Add a sixth line naming where "thinking" fits (answer: it’s extra tokens generated in the loop, step 5 — not a new stage).

§ 17.01.02 · Unit 02

Tokens (not words).

Motto: the model thinks in tokens, you write in words; the difference is where most surprises live.

A token is the unit the model reads and produces. It is NOT a word. Roughly:

Common short English words ("the", "and", "of") are usually one token.
Longer or less common words split: "unconditionally" → ["un", "conditional", "ly"].
Numbers, punctuation, whitespace, emoji are their own tokens.
Programming code is roughly 1.5x denser in tokens than prose — lots of punctuation.
Non-English languages can be 2-5x denser than English for the same idea.

A trap to avoid: the model cannot reliably tokenize text for you. Ask Claude "split this into tokens" and it will guess — it has no introspective access to its own tokenizer, so its split and its count are often wrong. The only ground truth is the count_tokens API (or the usage field of a real call). That gap is exactly what the next lab measures.

Why this matters

Token count = cost + speed + context fit. You're not paying per character; you're paying per token. Code costs more than prose for the same intent. Long German compound words cost more than the English equivalent. This is the foundation for every cost decision you'll make — so it's worth seeing the real numbers, not the model's guesses.

Guess the token count, then get the truth.

You’ll do

Pin down how far Claude’s self-reported token count drifts from the real number.

Steps

Pick one sentence (use this if you have none: The quick brown fox jumps over the lazy dog near the riverbank at dawn.).
Ask Claude to guess: How many tokens is this sentence, in your tokenizer? Give one number, no tool use: "<your sentence>". Write the number down — call it GUESS.
Get ground truth. If you have an API key, run the ~6-line snippet below (prints TRUE). No key? Open the usage/token tools in the Claude console or send the sentence as a one-line API call and read response.usage.input_tokens.
Compute DELTA = GUESS − TRUE.

Verify

You hold three numbers written down: GUESS, TRUE, and DELTA. For a ~14-word English sentence TRUE lands near 16–18 tokens (~1.3/word); DELTA is rarely zero — that nonzero gap is the lesson.

Stretch. Repeat with the Python snippet def hello(name): return f"Hi {name.upper()}!" and a non-English sentence. Confirm TRUE-per-word rises for both (code ~1.5× denser; many languages 2–5×).

# count_tokens.py  —  pip install anthropic ; export ANTHROPIC_API_KEY=sk-ant-...
import anthropic

client = anthropic.Anthropic()
sentence = "The quick brown fox jumps over the lazy dog near the riverbank at dawn."
resp = client.messages.count_tokens(
    model="claude-opus-4-8",
    messages=[{"role": "user", "content": sentence}],
)
print("TRUE tokens:", resp.input_tokens)

§ 17.01.03 · Unit 03

Embeddings: words become vectors.

Motto: the model doesn't see "cat" — it sees a list of 4,096 numbers that capture "cat-ness."

Every token gets mapped to a fixed-size vector (a list of numbers, usually 4,096 long in modern LLMs). Tokens that mean similar things end up with vectors that point in similar directions. The model never reads characters — it reads these vectors.

What this lets you predict

Synonyms behave the same. "Big" and "large" have very similar embeddings, so prompts using one work like prompts using the other.
Typos often survive. "Embedding" and "embeding" have similar embeddings; the model usually understands.
Wrong-but-related terms confuse predictably. "Java" and "JavaScript" embed in similar regions; the model needs context to know which you mean.

The prompt that demonstrates embeddings

Rate how similar these word pairs are in the model's internal
representation, on a 0-10 scale. Explain your reasoning.

1. dog & puppy
2. dog & cat
3. dog & banana
4. teacher & instructor
5. teacher & janitor
6. teacher & equation
7. JavaScript & TypeScript
8. JavaScript & Java
9. JavaScript & jaywalking

Now tell me: which pair would cause the most prompt-rewording
problems in production, and why?

The model can't read out its real embedding distances either (same introspection gap as tokens) — treat its 0–10 scores as predictions to test, not measurements. The point isn't the exact numbers; it's that you can predict which pairs behave alike before you run them.

Predict the similarity ranking, then check.

You’ll do

Commit to a ranking, then see whether your "synonyms are interchangeable" intuition holds in practice.

Steps

Before running anything, write your own answer: of pairs 5 (teacher & janitor), 8 (JavaScript & Java), 9 (JavaScript & jaywalking), rank which would cause the most production confusion.
Run the prompt above. Note Claude’s pick.
Now test it for real: open two fresh chats. In one, write a prompt using "JavaScript"; in the other, swap in "Java" and change nothing else. Compare the answers.

Verify

You wrote a ranking before running, and you can name one concrete way the Java/JavaScript swap changed the second answer (different language, libraries, or syntax) — confirming near-but-not-equal embeddings.

Stretch. Try the true-synonym pair (big/large) the same way; confirm the output barely changes — that’s why synonym swaps are safe and near-miss swaps aren’t.

§ 17.01.04 · Unit 04

Attention in one paragraph.

Motto: every token looks at every other token and weighs which ones matter for its own meaning.

"Attention" is the famous trick. Here it is in one sentence: for each token, the model computes a weighted average of all the other tokens' vectors, where the weights are learned to highlight which tokens are relevant.

Concretely: when the model is processing the token "it" in "The dog ran because it was hungry", attention lets the "it" token "look at" the "dog" token strongly (and ignore "ran", "because", etc). The vector for "it" after attention is updated to include "dog-ness." Pronoun resolution = high attention weight from "it" to "dog".

This happens not once but in many layers, with each layer's attention learning different relationships. Early layers pick up syntax; later layers pick up meaning, intent, even style.

Paper trail

Grounded in: Attention Is All You Need (Vaswani et al., 2017). Plain-English takeaway: every token can look at every other token and learn what matters. That Transformer idea is the architecture behind Claude, GPT, Gemini, and most modern LLMs.

What this lets you predict

Order can matter a lot. Putting instructions at the end vs the start changes which tokens get attended to in the final layer.
Long context dilutes attention. A 100k-token document drowns the few tokens that actually matter. This is why retrieval beats stuffing.
Markdown structure helps. Headings give the model strong "section boundary" tokens to anchor attention on.

Make attention fail, then fix it.

You’ll do

Watch a buried instruction get ignored, then watch the same instruction obeyed once it’s anchored.

Steps

In a fresh chat, paste a short instruction buried in clutter: about 8 lines of filler text, then on one middle line Reply only in French., then 8 more filler lines, then your question What is the capital of Japan?
Note whether the reply is in French.
New chat. Move the instruction to the very top AND repeat it at the very bottom, same question. Note the language now.

Verify

You can state which version obeyed "in French." The first-and-last version obeys at least as reliably as the buried one — the placement, not the wording, moved the result.

Stretch. Replace the filler with Markdown headings (## Context, ## Instruction, ## Question) and confirm the structured version is obeyed even with the instruction in the middle.

§ 17.01.05 · Unit 05

The next-token-prediction loop.

Motto: the model only ever picks one token at a time; everything else is a loop.

Here is the surprising thing: the model does not "write a response." It picks one token, appends it to the input, and runs the whole pipeline again. A 500-word response is 500+ separate forward passes, each one picking a token based on everything that came before.

Consequences:

Once a token is picked, it's committed. The model can't go back and revise an earlier word.
Streaming is real. The reason you see tokens stream out is because they ARE generated one at a time; the API just sends them as they're produced.
The "think step-by-step" trick works because each thinking token is added to context, which the next prediction can attend to. Writing reasoning out loud literally improves the next answer.
Stop tokens end the generation. The model emits a special end-of-message token when it decides it's done.

Prove reasoning tokens change the answer.

You’ll do

Run one tricky question two ways and watch "show your work" flip a wrong answer to right.

Steps

New chat. Ask for the answer only: A bat and ball cost $1.10 total. The bat costs $1.00 more than the ball. How much is the ball? Reply with only the dollar amount, no explanation.
New chat. Same problem, but: ...Think step by step, then give the answer.
Compare. (The answer is $0.05; the no-reasoning version often says $0.10.)

Verify

You have two saved replies. The step-by-step version reaches $0.05; if the terse version got it wrong, you’ve seen reasoning-as-tokens fix it in one shot. (If both are right, try the harder variant in Stretch.)

Stretch. Force the failure: Answer in one word, immediately, no reasoning: a 3-step word problem of your own. The terser you force it, the more the loop has no room to "think out loud."

§ 17.02.01 · Unit 06

Temperature, top-p, top-k.

Motto: at each step the model has a probability over every possible next token; these three knobs shape how it picks.

After the model computes probabilities for every next-token candidate, it has to pick one. The picking is controlled by:

Temperature (0 to ~2). Reshapes the probability distribution. 0 = always pick the highest-probability token (deterministic). 1 = use the raw probabilities. Higher = flatter, more random.
Top-p (0 to 1, default ~0.9). Only consider the smallest set of tokens whose probabilities sum to p. Filters out the unlikely tail.
Top-k (an integer, often unset). Only consider the top k most likely tokens. Hard cutoff.

Current Claude caveat: these are the classic decoding knobs, but not every current model exposes them. Claude Opus 4.8 (and Opus 4.7) reject non-default temperature, top_p, and top_k — sending one returns a 400 error. For those models you steer with prompting plus adaptive thinking / effort. Models like Claude Sonnet 4.6 and Haiku 4.5 still accept the sampling knobs, so the API half of the lab below targets one of those.

Two ways to feel sampling at work

You can’t set temperature on Opus 4.8 in the chat UI — but the model still samples, so regenerating reveals its built-in variance. The optional API snippet pins temperature explicitly on a model that exposes it.

Same prompt, different outputs — measure the spread.

You’ll do

Generate one creative prompt three times and describe how much the outputs vary.

Steps

In a Claude chat, send: Write a one-sentence opening for a short story about a librarian who discovers a hidden room. Make it interesting.
Regenerate (or re-send in a fresh chat) two more times. Save all three openings.
Write one sentence: how different were they — near-identical, same idea / different wording, or totally different?
Optional, if you have an API key: run the snippet below — it calls a sampling-capable model at temperature=0 three times, then at temperature=1 three times. Note which set is identical and which varies.

Verify

You have three saved openings plus one line describing their variance. (Ran the snippet too? The temperature=0 trio is identical or near-identical; the temperature=1 trio differs — that contrast IS the knob.)

Stretch. Re-run the chat prompt but demand a canonical answer instead: Extract the city name as JSON: {"city": ...} from "I flew into Denver yesterday." three times. It won’t vary — low-entropy tasks barely move regardless of sampling.

# sampling.py  —  uses a model that still accepts temperature (Opus 4.8 rejects it)
import anthropic

client = anthropic.Anthropic()
prompt = "Write a one-sentence opening for a short story about a librarian who discovers a hidden room."

for temp in (0.0, 1.0):
    print(f"\n=== temperature={temp} ===")
    for _ in range(3):
        msg = client.messages.create(
            model="claude-sonnet-4-6",   # sampling-capable; or claude-haiku-4-5
            max_tokens=80,
            temperature=temp,
            messages=[{"role": "user", "content": prompt}],
        )
        print("-", msg.content[0].text.strip())

The rule of thumb: where sampling knobs are supported, use low temperature for canonical answers (extraction, classification, code), medium for one good answer with some variation (writing, creative tasks), and high only if you want chaos. Where knobs are not supported (Opus 4.8), ask for the behavior directly in the prompt.

§ 17.02.02 · Unit 07

Context window economics.

Motto: context is finite, expensive, and the most-attended tokens are at the start and the end.

Every model has a maximum number of tokens it can hold in context at once (200K, 1M, etc). Three things to know:

Bigger windows cost more — linearly on your bill. What you’re billed is tokens × rate, so doubling the input tokens roughly doubles the input cost. (Separately, the attention compute itself grows quadratically — O(n²) in token count — which is why latency climbs faster than price, and why providers cap context. Don’t conflate the two: billed cost is linear, attention compute is quadratic.)
Beginning + end get the most attention. Tokens in the middle of a long context tend to be under-weighted. This is empirically observed across most LLMs ("lost in the middle"). Put your important instruction first AND repeat it last.
The cache exists. Modern APIs (including Claude) let you cache stable prefixes of your prompt, so re-sending the same system prompt costs ~10% of original. This is the single biggest production cost lever.

A real "lost in the middle" demo — with a real haystack

We ship a ~2,900-word document with one fact planted in the middle. Download it, paste it into Claude, and ask only the question that fact answers. Grade against the key (don’t peek at the key first — it spoils the test).

haystack.txt — the document (planted fact lives ~58% of the way through). Right-click → Save As, or open and copy all.
haystack_key.txt — the answer key. Open this after you record your result.

Find the needle — middle vs edges.

You’ll do

Test whether Claude recalls a buried fact, then whether moving it to the top makes recall easier.

Steps

Open haystack.txt and copy the whole thing.
In a fresh chat, paste it, then on a new line ask: What does the Meridian contract renewal cost per quarter? Record the answer (MIDDLE result) — a number, a refusal, or "doesn’t say."
New chat. This time paste the same fact sentence (you’ll know it after step 2) as the first line, then the full document, then the same question. Record the answer (EARLY result).
Open haystack_key.txt. The correct answer is $4,250 per quarter. Mark MIDDLE and EARLY each as HIT or MISS.

Verify

You have a recorded HIT/MISS for at least two positions (MIDDLE, EARLY), each checked against the key’s $4,250. A MIDDLE miss that turns into an EARLY hit is the lost-in-the-middle effect, caught live.

Stretch. Add a LATE run (fact pasted as the last line before the question) for a three-point position curve, and note that "first AND last" beats either edge alone.

§ 17.02.03 · Unit 08

Why prompts cost what they cost.

Motto: cost = input tokens × input rate + output tokens × output rate; output is usually 5x more expensive per token than input.

The cost of a Claude API call is:

total_cost = (input_tokens × input_rate)
           + (output_tokens × output_rate)
           + (cached_input_tokens × cached_rate)  [Claude only]

The output rate is typically ~5x the input rate. This means:

Long responses are more expensive than long prompts. A 4000-token response costs more than a 4000-token system prompt.
Cache is your friend. If your system prompt is 4000 tokens and you call the API 100 times with the same prompt + different user inputs, caching saves you ~90% on the system-prompt cost.
Asking for shorter responses cuts cost AND improves quality. "Answer in ≤3 sentences" is a cost knob, not just a style preference.

The cost-estimation prompt

# Estimate before you build
Estimate the API cost for the following Claude API call. Assume
Claude Sonnet 4.6 pricing (~$3 per 1M input tokens, ~$15 per 1M
output tokens; cached input is 10% of normal input). Check the
current pricing page before using this for a real budget.

System prompt: [paste yours, or use ~2000 tokens of made-up rules]
User message: [paste yours, or describe a typical user request]
Expected response length: [your estimate, e.g. 500 tokens]
Calls per day: [your estimate, e.g. 100]
% of calls reusing the same system prompt: [e.g. 95%]

Give me:
1. Cost per call without caching
2. Cost per call with caching
3. Cost per day at the volume above
4. Where the biggest savings would come from

Estimate your call’s cost, then check it against reality.

You’ll do

Predict one call’s cost from the formula, then read the true token usage and see if you were within 2×.

Steps

Pick a typical request you send Claude. Estimate input tokens (words × 1.3 from Unit 02) and expected output tokens.
Do the math by hand at Sonnet 4.6 rates: est = input/1e6 × $3 + output/1e6 × $15. Write down EST.
Get the truth. API: run the snippet below and read usage.input_tokens + usage.output_tokens, then plug into the same formula for ACTUAL. No key? Send the request in chat and ask Roughly how many input and output tokens was that exchange? as a rough stand-in, or use the count_tokens snippet from Unit 02 for the input side.
Compute the ratio EST / ACTUAL.

Verify

Your EST is within 2× of ACTUAL (i.e. ratio between 0.5 and 2.0). If it’s wildly off, your token guess was off — that’s the skill this unit builds.

Stretch. Send the same system prompt twice and read usage.cache_read_input_tokens on the second call — watch the cached portion cost ~10% and your per-call estimate drop.

# cost_check.py  —  read the real usage field after a call
import anthropic

client = anthropic.Anthropic()
msg = client.messages.create(
    model="claude-sonnet-4-6",           # $3 / $15 per 1M tokens
    max_tokens=500,
    messages=[{"role": "user", "content": "<paste your typical request>"}],
)
u = msg.usage
cost = u.input_tokens/1e6 * 3 + u.output_tokens/1e6 * 15
print(f"in={u.input_tokens} out={u.output_tokens}  ACTUAL=${cost:.5f}")

§ 17.02.04 · Unit 09

Why models hallucinate.

Motto: the model isn't lying — it's predicting the most likely next token, and "I don't know" is rarely the most likely next token.

Hallucinations have a structural cause. The model is trained on text where humans confidently state things. So when you ask a question, the most likely continuation pattern is "a confident answer," not "an admission of ignorance." Three failure modes:

Fact hallucination — the model invents a citation, function name, person, or statistic that sounds right. Cause: the surrounding pattern is "[author/function/year]" and the model fills the slot with a plausible value.
API hallucination — the model invents a method on a library that doesn't exist. Cause: the library has many real methods that resemble the made-up one, so the made-up one is locally probable.
Reasoning hallucination — the model produces a logically-valid-looking chain that's actually wrong. Cause: each step is locally probable; the global validity isn't checked.

The prevention pattern

When answering this question, follow these rules:

1. If you don't know something with high confidence, say so
   explicitly. Use the format: "I'm not sure about X — please
   verify against [authoritative source]."
2. For any factual claim (name, number, date, function name),
   tell me your confidence: HIGH (would bet money), MEDIUM
   (probably right), LOW (could be wrong).
3. For any external reference (paper, function, library, API),
   give me a way to verify it (URL, command to run, doc to check).
4. If I ask for code, tell me which exact library version you're
   targeting and how I can confirm the function/method exists.

Now answer: [your real question]

Why this works: the rules become part of the context the model attends to when picking each next token. Asking for confidence labels forces the model to allocate probability across "high/medium/low" tokens rather than just confident assertion.

Catch a hallucination, then defuse it.

You’ll do

Bait a plausible-but-fake answer, then re-ask with the prevention pattern and watch a hedge appear.

Steps

Bait it. In a fresh chat ask something obscure and citation-shaped: What is the exact API signature of the function `flux_capacitor.calibrate()` in the `delorean` Python package, and which version added it? (This package/function is invented.)
Note whether Claude invents a signature/version, or says it can’t find one.
New chat. Paste the prevention prompt above, then ask the same question as the "real question." Note the confidence label and verification step it now gives.

Verify

You can point to the difference: the plain ask produced (or risked) a confident fake; the prevention-prompt version flags LOW confidence or a "verify against the package docs" step for the same question.

Stretch. Classify which of the three failure modes (fact / API / reasoning) your bait triggered, and write the one-line cure from the unit next to it.

§ 17.02.05 · Unit 10

What you can debug from outputs alone.

Motto: three observations explain 80% of "why did it do that."

Most "the model is broken" reports turn out to be one of these:

What you observe	Most likely cause	The fix
Answer drifts mid-response	Sampling too loose or prompt under-specified	Lower temperature where supported; otherwise tighten format, examples, and success criteria
Same prompt, different answers each call	Model sampling + inherent non-determinism	Use deterministic params where supported; otherwise constrain output and test across repeated calls
Long-context query gets wrong answer	"Lost in the middle"	Move key facts to start/end; consider retrieval instead
Made-up function name	Local-probability hallucination	Add "verify against package docs" rule; pin library version
Cost spiked overnight	Cache miss (system prompt changed) or longer responses	Diff your system prompt; check max_tokens cap
First call slow, subsequent fast	Cache warm-up (normal behavior)	Pre-warm cache in deploy script
Output stops mid-sentence	Hit max_tokens limit	Increase max_tokens OR ask for terser format
Refuses a benign request	Safety training, locally tripped by phrasing	Rephrase; explain intent in the system prompt

Diagnose three symptoms from the table.

You’ll do

Turn the table into a reflex: given a symptom, name the cause and fix without re-reading.

Steps

Cover the table. For each of these three symptoms, write the likely cause and the fix from memory: (a) "the output stopped mid-sentence," (b) "the same prompt gives different answers every call," (c) "cost spiked overnight."
Uncover and check your three answers against the rows.
Add one symptom you’ve actually hit with Claude and write its row (observe → cause → fix).

Verify

Your three from-memory answers match the table’s cause+fix (max_tokens cap; sampling/non-determinism; cache miss or longer responses), and you’ve written one real fourth row.

Stretch. For your real symptom, actually apply the fix in a chat or call and confirm the behavior changes — that closes the loop from diagnosis to repair.

§ Walk-away · The "what the model is doing" cheatsheet

One page on your desk.

The mental model from this practice on one page — with blanks you fill from your own numbers as you go. Print it; keep it next to your monitor for the first month after you start building seriously with Claude. The facts are corrected; the fill-ins make it yours.

Copy it, then fill each __ from the matching unit’s lab as you complete it. By the end you have a one-page model in your own measurements, not a generic copy.

# What Claude is actually doing — my one-page model

## The pipeline
Input text → tokens (~1.3 per word, ~4 chars each) → embeddings (4,096-dim vectors)
→ attention layers (each token reweighted by every other token)
→ probability over next token → pick one (per model sampling rules)
→ append → repeat → stop on end-token or max_tokens
[U01] Pipeline stages I recited cold, Claude marked CORRECT: __ /5
[U02] My test sentence: GUESS __ tokens, TRUE __ tokens, DELTA __
[U03] Java/JavaScript swap changed my answer via: __ (language / libraries / syntax)
[U04] Reply-in-French obeyed — buried: __ / first+last: __ (yes/no)
[U05] Bat-and-ball — terse: $__ / step-by-step: $__ (correct = 0.05)

## The knobs that change behavior
- temperature/top_p/top_k: sampling knobs where supported; Opus 4.8 (and 4.7) REJECT non-default values (400 error)
- adaptive thinking / effort: use these instead of sampling knobs on Opus 4.8
- max_tokens: hard cap on response length
- cache: re-using same system prompt cuts cost ~90%
[U06] Regenerating my creative prompt 3x varied: __ (none / wording / fully)

## The economics
- output tokens cost ~5x input tokens
- BILLED cost grows LINEARLY with tokens (tokens × rate); attention COMPUTE grows quadratically O(n²) — don't conflate them
- shorter responses = cheaper AND usually better quality
- cache stable prefixes; the savings are real
[U08] My typical call ≈ __ in + __ out tokens ≈ $__ /call ≈ $__ /day

## Why hallucinations happen
- the model picks the most LOCALLY probable next token
- "I don't know" is rarely the most probable continuation
- fact / API / reasoning hallucinations have different cures
- fix by forcing confidence labels and verification rules in the prompt
[U09] The fake-package bait gave me: __ (confident fake / honest 'unknown')

## Long-context gotchas
- "lost in the middle": model under-weights tokens in the middle
- put critical instructions first AND last
- prefer retrieval over context-stuffing at scale
[U07] haystack.txt recall — MIDDLE: __ (hit/miss), EARLY: __ (hit/miss)

## When something looks wrong, ask in order:
1. Are sampling settings too loose, or unsupported for this model?
2. Did I exceed max_tokens?
3. Is the important info buried in the middle of context?
4. Did the model invent a fact that sounds right? (check 1-2 with web)
5. Did the system prompt change (cache miss)?
6. Is this a safety refusal? (rephrase intent)
[U10] A real symptom I diagnosed from the table → cause: __

The model is not magic. The model is a token predictor in a loop.
Everything else is engineering on top of that.

Fill the cheatsheet with your numbers.

You’ll do

Turn the generic page into your personal one by transcribing the ten fill-in lines from the labs you ran.

Steps

Copy the cheatsheet above into a note.
Fill the ten [U..] __ lines from your lab results: U01 (stages /5), U02 (guess/true/delta), U03 (swap effect), U04 (French buried vs first+last), U05 (bat-and-ball answers), U06 (variance), U07 (haystack hit/miss), U08 (your call’s tokens + $/day), U09 (bait result), U10 (a real symptom).
Keep it where you build. When a number changes (new model, new typical prompt), update the line.

Verify

No __ blanks remain on the ten fill-in lines, and the U08 line shows a real dollar-per-day figure within 2× of what U08’s lab measured.

Stretch. Pin it next to your monitor or in your team wiki and re-derive the U08 cost line after your next billing day — if your estimate held within 2×, the mental model is working.

This page replaces "the model is acting weird" with a checklist. Past the first week of building seriously, you'll diagnose 80% of issues using the table in Unit 10 + this cheatsheet, before you ever open the docs.

← Previous AI Engineering Foundations Next practice → Multi-Agent Systems