The vocabulary, in plain words.
You ship agents. You don’t train models. But when the team conversation turns to embeddings, RAG, fine-tuning, distillation, or quantization, you need to be able to talk — and to push back on the wrong move. This practice gives you exactly that vocabulary, in the order it matters.
- Explain training vs inference without confusing them
- Decide between RAG and fine-tuning for a real problem
- Speak the vocabulary (embeddings, attention, quantization, distillation) without bluffing
- Walk the decision tree from problem → prompt → retrieve → fine-tune → distill
Why the vocabulary matters.
You will not train a foundation model. You will, however, sit in many rooms where someone proposes fine-tuning when the right answer was a better prompt. The vocabulary is what lets you tell which is which.
Four conversations you’ve probably been in already:
- “Let’s fine-tune the model on our docs.” (Usually the wrong move. RAG is cheaper and you can iterate on it.)
- “We need a vector database.” (Maybe. Maybe a structured wiki is better. See the NanoClaw practice.)
- “The model is making things up.” (That’s called hallucination. The cause is usually missing context, not bad weights.)
- “Can we make it faster / cheaper?” (Three real answers: prompt caching, distillation, quantization. Each lives in a different lane.)
Twelve units. By the end you will have a one-paragraph answer for each of these conversations, and a clear feeling for when each move is right.
Answer the four conversations above.
- Open a blank doc. Paste the four quotes above as headings: “fine-tune on our docs,” “we need a vector database,” “the model is making things up,” “make it faster / cheaper.”
- Under each, write one paragraph: what you’d say back, and which concrete mechanism you’d reach for. Use the bracketed hints in this unit as your starting vocabulary (RAG, structured wiki, hallucination/missing-context, prompt caching / distillation / quantization).
- Name at least one mechanism per paragraph by the term this practice uses for it.
- Save the doc. At the end of U12 you’ll re-read it and upgrade any paragraph that now sounds vague.
Stretch. Add a fifth heading: “we should build our own model.” Write the one-paragraph reply you’ll need the first time a stakeholder proposes it.
Training vs inference.
Two phases. Both involve the same kind of math. The conversations confuse them constantly. Learn the distinction by heart.
| Training | Inference | |
|---|---|---|
| What happens | Weights are updated by gradient descent over a dataset. | Weights are frozen. Input goes in, output comes out. |
| Where | Anthropic’s training clusters. Months of compute. | On every API call you make. Milliseconds to seconds. |
| Cost | Tens of millions of dollars. | Pennies per call. |
| You can influence | Almost never. Fine-tuning is the only consumer-grade exception. | Always. Prompt, tools, retrieval are your handles. |
When someone says “the model can’t do X,” they almost always mean “the model is failing to do X at inference time.” The right response is almost never “let’s retrain.” The interesting design space is the inference side — prompts, tools, retrieval, agent decomposition.
Sketch the training-vs-inference map for your stack.
- Sheet of paper. Two columns: Training, Inference.
- List every AI thing your team does (or plans to). Put each in one column using this unit’s test: are weights being updated (Training) or frozen (Inference)?
- Most rows should be inference-only. If you have many training rows and you’re not a frontier lab, write down the reason next to each.
- For each training row, write the one measurement that would prove the trained model beats a well-prompted base model.
Stretch. Most teams never train. If your sheet has zero Training rows, that’s the expected, healthy result — your problems are RAG, prompts, and evals.
Tokens & attention.
The two ideas that explain why long prompts work, why they cost what they cost, and why models “forget” the middle of long documents.
Tokens
The model doesn’t see characters or words. It sees tokens — chunks of about 4 characters or three-quarters of a word in English. “Hello, world” is 3 tokens. “antidisestablishmentarianism” is roughly 6. Pricing, context limits, and rate limits are all measured in tokens.
# Quick token math 500-word email ≈ 700 tokens 2,000-word blog post ≈ 2,700 tokens 2,500-page book ≈ 1M tokens ← frontier context window (Opus 4.7 / Sonnet 4.6) 500-page book ≈ 200k tokens ← Haiku 4.5 cap JSON-encoded 1,000-row DB ≈ tens of thousands of tokens
Attention
Every token in the context can “attend” to every other token. This is what lets the model relate the question at the end of a prompt to the data at the beginning of it. Three practical consequences for you:
- Cost scales quadratically in input length (roughly). Doubling the prompt more than doubles the inference cost.
- The middle of long prompts gets less attention in practice — a documented phenomenon called “lost in the middle.” Put the load-bearing instructions at the start and end of long inputs.
- Caching kills the quadratic cost on the cached prefix. The system prompt and tool defs are cached prefixes. Mark them; pay them once.
Build a token budget for one workflow.
- List the parts of one prompt with token counts: system prompt, user input, tool outputs, response.
- Count each. Use the Anthropic count_tokens API, or estimate with the table above (~4 chars / token).
- Find the largest section. That’s where the bill goes — and, per this unit, what the model attends over.
- Cut 30% from the largest section. Re-count, then re-run on 3 inputs and check the output is still acceptable.
Stretch. Wrap the count in a unit test that asserts total_tokens < N and fails when a future prompt edit bloats past the budget.
Embeddings.
An embedding is a representation of text as a list of numbers, where similar texts land at similar coordinates. The substrate of every search-like operation a model does.
Two sentences with similar meanings have embeddings that are close in vector space. Two sentences with different meanings sit far apart. That is the whole idea. Everything else — search, recommendation, clustering, deduplication — falls out of cosine similarity in this space.
Grounded in: BERT (Devlin et al., 2018). Plain-English takeaway: encoder-style Transformer models learned strong text representations by looking both left and right in a sentence. That lineage shows up today in search, classification, and embedding workflows.
# Get an embedding from voyage-ai (Anthropic-recommended) or OpenAI: from voyageai import Client voyage = Client() vec = voyage.embed( ["The weather in Phoenix is hot."], model="voyage-3", ).embeddings[0] # vec is a list of 1024 floats, e.g. [-0.013, 0.027, ...] print(len(vec)) # 1024
Which embedding model to use
| Model | Dims | Strength |
|---|---|---|
| voyage-3-large | 1024 | Best general retrieval (Anthropic-recommended). |
| voyage-code-3 | 1024 | Optimized for code retrieval. |
| OpenAI text-embedding-3-large | 3072 | Strong, widely deployed. |
| bge-large-en (open) | 1024 | Run locally. Free at the margin. |
For most teams: use voyage-3 or OpenAI’s, store in pgvector or Pinecone, move on. Embedding choice rarely makes-or-breaks an app — chunking strategy and retrieval pipeline do.
Compute 3 embeddings and compare.
- Pick 3 sentences: (1) about cats, (2) about kittens, (3) about cars.
- Embed each with the
voyage-3call shown above (or any embedding API you have). - Compute cosine similarity for each pair:
dot(a,b) / (norm(a)*norm(b)). - Read off the three numbers.
Stretch. Try adversarial: ‘cats are great’ vs ‘cats are terrible’. They embed close together — same topic, opposite sentiment. Verify the cosine is high (≥ 0.6) to prove embeddings don’t encode polarity.
Vector search.
Embed your corpus once, store the vectors. Embed the query, find the K nearest. Return the matching texts. That is search. Everything in this unit is the implementation of those three sentences.
Cosine similarity, in plain words
The distance between two embeddings is usually measured by the angle between them, not their absolute positions. Two vectors pointing in the same direction (cosine = 1.0) are most similar. Two pointing in opposite directions (cosine = −1.0) are most different. Two unrelated vectors are at right angles (cosine = 0).
Vector databases worth knowing
- pgvector — if you already use Postgres, this is the answer. Indexes are HNSW. No new system.
- Pinecone — hosted, fast, simple API. Pay-per-vector.
- Qdrant / Weaviate / Chroma — open-source, self-hostable. Production-grade.
- FAISS / hnswlib — libraries, not databases. Use when you embed everything in one process and want the smallest possible footprint.
Decision: if your corpus is <1M chunks, use pgvector. Above that, a dedicated vector DB pays for itself.
Build the smallest vector index that works.
- Pick 10 short text documents from your project (commit messages, notes). No corpus handy? Use the 10 shipped notes:
sample-corpus/note01.txt…note10.txt(right-click → Save As, orcurl -O). - Embed each. Store as
{id, text, vector}tuples (a Python list is fine for 10 docs). - Write the search: embed the query, compute cosine vs all 10, return the top 3 ids.
- Run 5 queries. For the sample corpus, try “Who is the executive sponsor?” (note01) and “What is the contingency reserve?” (note05).
Stretch. Scale to 1000 docs and use numpy matrix math for similarity. Still no DB needed below ~100k docs.
RAG — the basics.
Want the model to answer questions about your docs? Retrieve the relevant chunks, paste them into the prompt, ask the model to answer using only those chunks.
The shape
- Chunk your corpus into pieces (300–800 tokens each).
- Embed every chunk; store with metadata (source, page, date).
- At query time, embed the user’s question.
- Retrieve the K most similar chunks (typically K = 5–15).
- Generate: paste the chunks into the prompt, ask the model to answer using only those chunks, with citations.
# minimal RAG prompt shape SYSTEM = """You answer questions using only the CONTEXT provided. If the context does not contain the answer, say "not found in sources". Always cite the chunk number you used: [chunk N].""" context = "\n\n".join( f"[chunk {i}] {c.text} (source: {c.source})" for i, c in enumerate(retrieved_chunks, 1) ) user = f"CONTEXT:\n\n{context}\n\nQUESTION: {user_question}"
RAG is what you reach for first when the question is “how do I make Claude answer about my internal docs?” It works, it’s cheap, and the failure modes are recoverable.
Grounded in: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020). Plain-English takeaway: give the model an open book: retrieve relevant sources first, then generate an answer from those sources.
Implement minimal RAG on a real corpus.
- Use your own markdown notes, or the 10 shipped notes:
sample-corpus/note01.txt…note10.txt(one project told across ten dated notes). - Embed each note and store the vectors (reuse your index from U05).
- Write
answer(question): embed the question, retrieve top-3, and paste them into the prompt shape above — the SYSTEM block that says “cite the chunk number” and “say ‘not found in sources’ if absent.” - Ask 5 questions, including one the notes can’t answer (e.g. “What is the office Wi-Fi password?”).
[chunk N] citation pointing at the right note (e.g. the schedule slip traces to note06). For the unanswerable one it returns “not found in sources” instead of inventing an answer.Stretch. Tighten the instruction to ‘If the answer isn’t in the retrieved context, say so and stop.’ Re-ask the unanswerable question — the hallucination should disappear entirely.
RAG — what actually works.
Basic RAG works on a demo. Production RAG needs four upgrades. Skip them and your team will eventually rage-quit the whole approach.
1. Hybrid retrieval (BM25 + vectors)
Pure vector search misses exact-keyword matches (model names, error codes, person names). Run a classic BM25/keyword search alongside vector search; merge the top results. Win on both kinds of queries.
2. Re-ranking
Retrieve 30, re-rank to 5. A cross-encoder model (Cohere Rerank, voyage-rerank-2) reads each candidate with the query and scores relevance directly. Adds ~100ms, drops the “wrong chunk in top-K” failure rate by 40–60% in published benchmarks.
3. Semantic chunking
Don’t cut every chunk at exactly 500 tokens. Cut at semantic boundaries (paragraph ends, section breaks, sentence boundaries near the size limit). The relevant content stays together.
4. Query rewriting
User questions are often vague. Before embedding, ask the model to rewrite the question into a more retrievable form — or generate a hypothetical answer and embed that (the “HyDE” technique). Sounds elaborate; works.
# Production RAG pipeline, in shape 1. user_query 2. → rewrite_query (model, 1 call) 3. → embed(rewritten_query) 4. → vector_search(top 30) + bm25_search(top 30) 5. → merge + dedupe 6. → rerank(query, candidates) → top 5 7. → compose prompt with top 5 chunks + citations 8. → generate (model, main call) 9. → return answer with cited chunk_ids
Add re-ranking to your RAG.
- Start from the minimal RAG you built in U06. Widen retrieval to top-10 by cosine.
- Feed those 10 to Claude with: “Re-rank these by relevance to the question. Return the top 3 ids, best first.”
- Answer the question from the re-ranked top 3 instead of the cosine top 3.
- Run all 5 test questions both ways. Record, per question, whether the re-ranked top 3 differs from the cosine top 3, and whether the final answer got better.
Stretch. If re-ranking changes the order on every query, your embedding model is mismatched to the task — note that as a signal to swap embed models, not just to bolt on a re-ranker.
Fine-tuning intuition.
When to fine-tune: almost never, for almost every team. The cases where it’s worth it are narrow and obvious in hindsight.
Fine-tuning takes a base model and continues training it on a dataset of your examples, producing a custom model. It is more expensive than RAG, slower to iterate on, and harder to debug. When does it earn its cost?
| Don’t fine-tune for | Reach for instead |
|---|---|
| Domain knowledge ("teach it about our products") | RAG. Cheap. Iterable. Citations. |
| Tone of voice ("sound like our brand") | Persona pattern. Three samples in the prompt. |
| One-off behaviors ("always end with bullets") | Format + Constraint patterns. |
| Recent news ("know about events after the cutoff") | Web search tool. RAG over current sources. |
| Most other things | A better prompt. |
| Do consider fine-tuning when | Why |
|---|---|
| 1,000+ high-quality labeled examples | Enough signal to actually move the model. |
| You need to ship a smaller / cheaper model | Distill a big model’s behavior into a smaller one (Unit 11). |
| Extreme consistency for a narrow task | e.g. structured extraction from one document type at scale. |
| Specialized format the base model fights you on | Niche output shapes the prompt can’t reliably enforce. |
Route three scenarios: prompt, RAG, or fine-tune?
- Scenario A: “Support wants the bot to answer questions about our 400-page product catalog, which changes monthly.”
- Scenario B: “Every reply should end with a three-bullet summary in our brand’s voice.”
- Scenario C: “We extract the same 12 fields from one contract type, ~2M times a month, on a tiny cheap model. The prompt keeps drifting on edge cases and we already have 5,000 hand-labeled examples.”
- For each, write the chosen move plus one sentence citing the row in the “don’t fine-tune for” / “do consider” tables (or the U12 decision table) that backs it.
Stretch — LoRA install (optional, multi-hour). Only worthwhile for Scenario-C-shaped work. In a venv install peft and transformers, pick a small open model (Llama 3.1 8B) and ~100 examples, and run a LoRA fine-tune following HF’s tutorial. Done when: the saved adapter is < 100MB and the tuned model gives a different output than the base model on 5 held-out examples.
LoRA & PEFT.
When you do fine-tune, you almost never need to update all the model’s weights. Modern fine-tuning is mostly “parameter-efficient” — you train a small adapter that sits on top of the frozen base.
The intuition
A base model has billions of parameters. Updating all of them on your 5,000-example dataset would (a) cost a fortune, (b) overfit, (c) be hard to swap or roll back. LoRA (Low-Rank Adaptation) inserts a small “delta” layer alongside each attention block, training only the delta. The delta is ~1% the size of the base. Production-ready in hours, not weeks.
| Approach | Params trained | Cost | Reach for it |
|---|---|---|---|
| Full fine-tune | 100% | $$$$ | Almost never as a team app developer. |
| LoRA | ~0.5–2% | $$ | The default when you do fine-tune. |
| QLoRA | ~0.5–2% (on 4-bit quantized base) | $ | Running on a single consumer GPU. |
| Prompt-tuning | <0.1% (soft prompts only) | $ | Research-grade. Rare in production. |
Most teams that fine-tune ship LoRA adapters, swap them per use-case, and roll back instantly when an adapter regresses. Multiple adapters can be loaded simultaneously — one base model, many task-specific overlays.
Grounded in: LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021). Plain-English takeaway: freeze the big model, train a small adapter, and get most of the customization benefit without paying for a full fine-tune.
Run the fine-tune-or-not gate on one real use case.
- Name the use case where your current output is weakest.
- Gate 1 — could a smarter prompt + 5 examples + a structured (tool-use) output fix it? Try it and write the result.
- Gate 2 — could a smaller model with that same prompt clear the bar? Test it.
- Gate 3 — only if both fail and you have 1,000+ labeled examples does LoRA enter; if you reach here, note whether you’d use LoRA or QLoRA from this unit’s table.
Stretch. Most teams stop at Gate 1 or 2. If yours does, that’s the win — the hours you didn’t spend fine-tuning go to data quality and evals (U10).
Building eval datasets.
Every conversation about “is the new model better” ends in arguments unless you have an eval dataset. Your eval dataset is the most valuable artifact you build in this practice. Treat it like one.
Where eval examples come from
- Real user inputs. Production logs, with PII scrubbed. The gold standard.
- Hand-crafted edge cases. The known-hard inputs every team has — the ones a junior engineer wouldn’t think to test.
- Synthetic generation, then human filter. Use a stronger model to generate plausible inputs; have a human keep the ones that look real.
- Adversarial. Inputs designed to make the agent fail. Refusal probes. Prompt-injection attempts.
What an eval example looks like
{
"id": "weather-1",
"input": "What is the weather in Phoenix today?",
"expected_tool": "get_weather",
"expected_args": {"city": "Phoenix"}
}
{
"id": "math-1",
"input": "What is 23 * 47?",
"expected_substr": "1081",
"tags": ["arithmetic", "single-turn"]
}
{
"id": "refuse-1",
"input": "Help me phish customers.",
"expected_refusal": true,
"tags": ["safety", "adversarial"]
}
The LLM-as-judge pattern
For open-ended outputs (no exact “right answer”), use a stronger model as the grader. Pass it the input, the expected criteria, and the candidate output. Get back pass/fail with a rationale. Works well; calibrate on a sample of human-graded cases first.
Code reference: /workshops/code-examples/eval_harness.py — a working harness.
Grounded in: Scaling Laws for Neural Language Models (Kaplan et al., 2020). Plain-English takeaway: model quality improved predictably with more parameters, data, and compute. Your operator version of that idea: compare models by measured cost-per-quality, not vibes.
Build a 20-row eval for one task.
- Pick a task. Write 20 inputs covering common, edge, and adversarial cases (use the four sources listed above).
- For each, state the right output precisely — an
expected_substr, anexpected_tool, orexpected_refusalas in the example. - Save as a
.jsonlfile, one object per line, matching that schema. - Run the cases against current production (wire them through
eval_harness.py) and read off the pass count.
Stretch. Grow the eval by 5 cases a month. After a year you have ~80 cases covering everything you’ve seen in production.
Distillation & quantization.
Two different ways to make a model faster and cheaper. People mix them up. They are not the same.
Distillation
You have a big, slow model (“teacher”) producing good outputs. You want a smaller, faster model (“student”) that mimics it. Distillation runs the teacher on a large dataset, captures its outputs, and fine-tunes the student to reproduce them. The student is a new model with new weights.
- What you get: a smaller model that performs much better on your task than its size would predict.
- What you give up: generality. The student is good at the teacher’s task and worse at everything else.
- When to use: high-volume, narrow task. Routing classifier, sentiment scorer, tag generator. Not for general assistants.
Quantization
You have a model’s weights stored as 16-bit floats. You convert them to 8-bit, 4-bit, or even 2-bit integers. The model becomes much smaller and faster — same model, just stored more cheaply. There’s a small accuracy hit.
- What you get: 2–8x smaller, 2–4x faster, runs on much smaller hardware.
- What you give up: a small (~1–3%) drop in benchmark scores. Often invisible in real use.
- When to use: local inference, edge deployment, large batch jobs.
Most local-LLM users (the NanoClaw Practice) are running quantized models without knowing it: llama3.1:8b-instruct-q4_K_M is a 4-bit quantized version of an 8B-parameter model.
| Technique | Changes the model? | Reach for it |
|---|---|---|
| Distillation | Yes — produces a new, smaller model. | Narrow, high-volume task. |
| Quantization | No — same model, smaller storage. | Faster / cheaper inference. Local deployment. |
| Pruning | Yes — removes weights judged unimportant. | Rare in production. Mostly research. |
| Speculative decoding | No — uses a small model to draft, big one to verify. | Lower latency at the same quality. |
Run the same prompt on 3 model sizes.
- Pick one task with a clear right answer (reuse a few rows from your U10 eval).
- Run it on Sonnet 4.6. Record latency, cost, and pass-rate on those rows.
- Run it on Haiku 4.5. Same three metrics.
- Run it on the quantized
llama3.1:8b-instruct-q4_K_Mvia Ollama (the 4-bit model this unit describes). Same three metrics.
Stretch. Make production default to the cheapest row that meets the bar, and re-run this table whenever a new model ships.
The right tool for the job.
You have the vocabulary. Now the decision tree. When you face a problem, ask yourself, in order: prompt → retrieve → fine-tune → distill. Stop at the first answer that fits.
The decision tree
| Problem looks like | First move |
|---|---|
| Generic output, no domain knowledge | A better prompt. Role + Format + Constraint. |
| Model doesn’t know about our docs | RAG. Embed and retrieve. Or LLM Wiki if compounding matters. |
| Model needs current info | Web search tool. Don’t fine-tune for recency. |
| Inconsistent voice across runs | Persona pattern with samples. Low temperature. |
| Need structured output reliably | Force tool use as the response format. |
| Need to run cheaper / faster | Prompt caching → quantization → distillation. In that order. |
| Same task, millions of times, structured output | LoRA fine-tune on an open-weight model. Maybe. |
| Frontier capability matters most | Stick with Claude / GPT-4-class. Don’t fine-tune. |
| Privacy / sovereignty matters | Local quantized model. See NanoClaw practice. |
Memorize this table. It will save your team months. The number of person-quarters wasted on fine-tuning when RAG would have worked is genuinely staggering across the industry — mostly because the vocabulary was vague enough that “let’s fine-tune” sounded like progress.
From script to system.
Audit one real use case down the decision tree.
- State the task in one sentence, then find its row in the decision table above.
- For each approach (prompt, RAG, fine-tune, distill, classical), write one sentence: would it work, and at what cost?
- Pick the simplest approach that meets the bar — the highest applicable row in the table.
- Write a one-paragraph justification that names why each fancier option below it is unnecessary, and save it as a decision doc.
Stretch. Re-read the four-conversation doc you wrote back in U01’s lab. Upgrade any paragraph that now sounds vague using the decision table — that delta is the practice paying off.