Practices/ AI Engineering Foundations
3 days · 12 units
Pradhya Practice 06 · AI Engineering Foundations Beginner

The vocabulary, in plain words.

You ship agents. You don’t train models. But when the team conversation turns to embeddings, RAG, fine-tuning, distillation, or quantization, you need to be able to talk — and to push back on the wrong move. This practice gives you exactly that vocabulary, in the order it matters.

Audience
Engineers & PMs who ship LLM apps
Length
3 sessions · 90 min each
Walk-away
The mental model + the wrong-move alarms
Prereq
Capable Series or working API familiarity
What you’ll be able to do by the end
  • Explain training vs inference without confusing them
  • Decide between RAG and fine-tuning for a real problem
  • Speak the vocabulary (embeddings, attention, quantization, distillation) without bluffing
  • Walk the decision tree from problem → prompt → retrieve → fine-tune → distill
§ 06.01.01 · Unit 01

Why the vocabulary matters.

You will not train a foundation model. You will, however, sit in many rooms where someone proposes fine-tuning when the right answer was a better prompt. The vocabulary is what lets you tell which is which.

Four conversations you’ve probably been in already:

  • “Let’s fine-tune the model on our docs.” (Usually the wrong move. RAG is cheaper and you can iterate on it.)
  • “We need a vector database.” (Maybe. Maybe a structured wiki is better. See the NanoClaw practice.)
  • “The model is making things up.” (That’s called hallucination. The cause is usually missing context, not bad weights.)
  • “Can we make it faster / cheaper?” (Three real answers: prompt caching, distillation, quantization. Each lives in a different lane.)

Twelve units. By the end you will have a one-paragraph answer for each of these conversations, and a clear feeling for when each move is right.

The promise You don’t need to write a training loop to make good ML decisions. You need to be able to name the trade-offs precisely. This practice is the naming.

Answer the four conversations above.

You’ll do
Draft a one-paragraph reply to each of the four scenarios in this unit — in your own words, before the rest of the practice arms you.
Steps
  1. Open a blank doc. Paste the four quotes above as headings: “fine-tune on our docs,” “we need a vector database,” “the model is making things up,” “make it faster / cheaper.”
  2. Under each, write one paragraph: what you’d say back, and which concrete mechanism you’d reach for. Use the bracketed hints in this unit as your starting vocabulary (RAG, structured wiki, hallucination/missing-context, prompt caching / distillation / quantization).
  3. Name at least one mechanism per paragraph by the term this practice uses for it.
  4. Save the doc. At the end of U12 you’ll re-read it and upgrade any paragraph that now sounds vague.
Verify
Four paragraphs exist, one per conversation. Each names a concrete mechanism from this page (e.g. RAG, prompt caching, quantization, distillation, web-search tool, structured wiki) — not just “use AI better.”

Stretch. Add a fifth heading: “we should build our own model.” Write the one-paragraph reply you’ll need the first time a stakeholder proposes it.

§ 06.01.02 · Unit 02

Training vs inference.

Two phases. Both involve the same kind of math. The conversations confuse them constantly. Learn the distinction by heart.

Training weights updated $ millions · months not your problem Inference weights frozen $ pennies · seconds your whole design space
Two phases · the interesting one is on the right
TrainingInference
What happens Weights are updated by gradient descent over a dataset.Weights are frozen. Input goes in, output comes out.
Where Anthropic’s training clusters. Months of compute. On every API call you make. Milliseconds to seconds.
Cost Tens of millions of dollars. Pennies per call.
You can influenceAlmost never. Fine-tuning is the only consumer-grade exception. Always. Prompt, tools, retrieval are your handles.

When someone says “the model can’t do X,” they almost always mean “the model is failing to do X at inference time.” The right response is almost never “let’s retrain.” The interesting design space is the inference side — prompts, tools, retrieval, agent decomposition.

The chef metaphor The model is a chef who has trained for years and now cannot change. Your prompt is the order. Your tools are the pantry. The kitchen, the cuisine, the technique — those are fixed. You don’t reach for a new chef every time the order’s wrong.

Sketch the training-vs-inference map for your stack.

You’ll do
Put every AI thing your team does into the Training column or the Inference column.
Steps
  1. Sheet of paper. Two columns: Training, Inference.
  2. List every AI thing your team does (or plans to). Put each in one column using this unit’s test: are weights being updated (Training) or frozen (Inference)?
  3. Most rows should be inference-only. If you have many training rows and you’re not a frontier lab, write down the reason next to each.
  4. For each training row, write the one measurement that would prove the trained model beats a well-prompted base model.
Verify
Count the columns: the Inference count is higher than the Training count. Every Training row has a written justification and a named measurement beside it; none is left blank.

Stretch. Most teams never train. If your sheet has zero Training rows, that’s the expected, healthy result — your problems are RAG, prompts, and evals.

§ 06.01.03 · Unit 03

Tokens & attention.

The two ideas that explain why long prompts work, why they cost what they cost, and why models “forget” the middle of long documents.

Tokens

The model doesn’t see characters or words. It sees tokens — chunks of about 4 characters or three-quarters of a word in English. “Hello, world” is 3 tokens. “antidisestablishmentarianism” is roughly 6. Pricing, context limits, and rate limits are all measured in tokens.

# Quick token math
500-word email            ≈     700 tokens
2,000-word blog post      ≈   2,700 tokens
2,500-page book           ≈      1M tokens   ← frontier context window (Opus 4.7 / Sonnet 4.6)
500-page book             ≈    200k tokens   ← Haiku 4.5 cap
JSON-encoded 1,000-row DB ≈   tens of thousands of tokens

Attention

Every token in the context can “attend” to every other token. This is what lets the model relate the question at the end of a prompt to the data at the beginning of it. Three practical consequences for you:

  • Cost scales quadratically in input length (roughly). Doubling the prompt more than doubles the inference cost.
  • The middle of long prompts gets less attention in practice — a documented phenomenon called “lost in the middle.” Put the load-bearing instructions at the start and end of long inputs.
  • Caching kills the quadratic cost on the cached prefix. The system prompt and tool defs are cached prefixes. Mark them; pay them once.
The non-obvious move For long-context tasks, repeat the question at the bottom of the prompt. Sounds redundant; produces measurably better answers. The attention math explains why.

Build a token budget for one workflow.

You’ll do
Pick a workflow. Count tokens at each step. Find the bottleneck, then cut it.
Steps
  1. List the parts of one prompt with token counts: system prompt, user input, tool outputs, response.
  2. Count each. Use the Anthropic count_tokens API, or estimate with the table above (~4 chars / token).
  3. Find the largest section. That’s where the bill goes — and, per this unit, what the model attends over.
  4. Cut 30% from the largest section. Re-count, then re-run on 3 inputs and check the output is still acceptable.
Verify
Your new total token count is at least 15% below the original number, and the outputs on the 3 inputs still pass your eyeball check.

Stretch. Wrap the count in a unit test that asserts total_tokens < N and fails when a future prompt edit bloats past the budget.

§ 06.01.04 · Unit 04

Embeddings.

An embedding is a representation of text as a list of numbers, where similar texts land at similar coordinates. The substrate of every search-like operation a model does.

Two sentences with similar meanings have embeddings that are close in vector space. Two sentences with different meanings sit far apart. That is the whole idea. Everything else — search, recommendation, clustering, deduplication — falls out of cosine similarity in this space.

Paper trail

Grounded in: BERT (Devlin et al., 2018). Plain-English takeaway: encoder-style Transformer models learned strong text representations by looking both left and right in a sentence. That lineage shows up today in search, classification, and embedding workflows.

# Get an embedding from voyage-ai (Anthropic-recommended) or OpenAI:
from voyageai import Client
voyage = Client()

vec = voyage.embed(
    ["The weather in Phoenix is hot."],
    model="voyage-3",
).embeddings[0]
# vec is a list of 1024 floats, e.g. [-0.013, 0.027, ...]
print(len(vec))  # 1024

Which embedding model to use

ModelDimsStrength
voyage-3-large 1024Best general retrieval (Anthropic-recommended).
voyage-code-3 1024Optimized for code retrieval.
OpenAI text-embedding-3-large3072Strong, widely deployed.
bge-large-en (open) 1024Run locally. Free at the margin.

For most teams: use voyage-3 or OpenAI’s, store in pgvector or Pinecone, move on. Embedding choice rarely makes-or-breaks an app — chunking strategy and retrieval pipeline do.

Compute 3 embeddings and compare.

You’ll do
Embed three sentences. Compute pairwise cosine similarity. Watch meaning turn into geometry.
Steps
  1. Pick 3 sentences: (1) about cats, (2) about kittens, (3) about cars.
  2. Embed each with the voyage-3 call shown above (or any embedding API you have).
  3. Compute cosine similarity for each pair: dot(a,b) / (norm(a)*norm(b)).
  4. Read off the three numbers.
Verify
cos(cats, kittens) is meaningfully higher than cos(cats, cars) — typically the first is ≥ 0.6 and the second ≤ 0.4. If the close pair doesn’t outscore the far pair, your similarity code or normalization is wrong.

Stretch. Try adversarial: ‘cats are great’ vs ‘cats are terrible’. They embed close together — same topic, opposite sentiment. Verify the cosine is high (≥ 0.6) to prove embeddings don’t encode polarity.

§ 06.02.01 · Unit 05

Vector search.

Embed your corpus once, store the vectors. Embed the query, find the K nearest. Return the matching texts. That is search. Everything in this unit is the implementation of those three sentences.

Cosine similarity, in plain words

The distance between two embeddings is usually measured by the angle between them, not their absolute positions. Two vectors pointing in the same direction (cosine = 1.0) are most similar. Two pointing in opposite directions (cosine = −1.0) are most different. Two unrelated vectors are at right angles (cosine = 0).

origin query: "best Italian near work" doc: "Italian dinner options downtown" · cos = 0.92 doc: "EU export controls update" · cos = 0.04 small angle = similar

Vector databases worth knowing

  • pgvector — if you already use Postgres, this is the answer. Indexes are HNSW. No new system.
  • Pinecone — hosted, fast, simple API. Pay-per-vector.
  • Qdrant / Weaviate / Chroma — open-source, self-hostable. Production-grade.
  • FAISS / hnswlib — libraries, not databases. Use when you embed everything in one process and want the smallest possible footprint.

Decision: if your corpus is <1M chunks, use pgvector. Above that, a dedicated vector DB pays for itself.

Build the smallest vector index that works.

You’ll do
10 documents. Index them in a plain Python list. Search by cosine. No database.
Steps
  1. Pick 10 short text documents from your project (commit messages, notes). No corpus handy? Use the 10 shipped notes: sample-corpus/note01.txtnote10.txt (right-click → Save As, or curl -O).
  2. Embed each. Store as {id, text, vector} tuples (a Python list is fine for 10 docs).
  3. Write the search: embed the query, compute cosine vs all 10, return the top 3 ids.
  4. Run 5 queries. For the sample corpus, try “Who is the executive sponsor?” (note01) and “What is the contingency reserve?” (note05).
Verify
For at least 4 of 5 queries the correct document is in the returned top 3 — e.g. note01 ranks in the top 3 for the sponsor question, note05 for the reserve question.

Stretch. Scale to 1000 docs and use numpy matrix math for similarity. Still no DB needed below ~100k docs.

§ 06.02.02 · Unit 06

RAG — the basics.

Want the model to answer questions about your docs? Retrieve the relevant chunks, paste them into the prompt, ask the model to answer using only those chunks.

question embed q vector search top-K chunks answer your corpus (chunks, embedded once) paragraph-level chunks beat document-level
RAG · embed · retrieve · generate · cite

The shape

  1. Chunk your corpus into pieces (300–800 tokens each).
  2. Embed every chunk; store with metadata (source, page, date).
  3. At query time, embed the user’s question.
  4. Retrieve the K most similar chunks (typically K = 5–15).
  5. Generate: paste the chunks into the prompt, ask the model to answer using only those chunks, with citations.
# minimal RAG prompt shape
SYSTEM = """You answer questions using only the CONTEXT provided.
If the context does not contain the answer, say "not found in sources".
Always cite the chunk number you used: [chunk N]."""

context = "\n\n".join(
    f"[chunk {i}] {c.text}  (source: {c.source})"
    for i, c in enumerate(retrieved_chunks, 1)
)

user = f"CONTEXT:\n\n{context}\n\nQUESTION: {user_question}"

RAG is what you reach for first when the question is “how do I make Claude answer about my internal docs?” It works, it’s cheap, and the failure modes are recoverable.

Paper trail

Grounded in: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020). Plain-English takeaway: give the model an open book: retrieve relevant sources first, then generate an answer from those sources.

The number-one RAG mistake Embedding at the document level. The most common failure is “we retrieved the right document but the relevant paragraph wasn’t in the top-K context.” Chunk at paragraph or section granularity. Always.

Implement minimal RAG on a real corpus.

You’ll do
Take 10–20 notes. Build a retrieve-then-answer pipeline that cites its sources.
Steps
  1. Use your own markdown notes, or the 10 shipped notes: sample-corpus/note01.txtnote10.txt (one project told across ten dated notes).
  2. Embed each note and store the vectors (reuse your index from U05).
  3. Write answer(question): embed the question, retrieve top-3, and paste them into the prompt shape above — the SYSTEM block that says “cite the chunk number” and “say ‘not found in sources’ if absent.”
  4. Ask 5 questions, including one the notes can’t answer (e.g. “What is the office Wi-Fi password?”).
Verify
For answerable questions the reply includes a [chunk N] citation pointing at the right note (e.g. the schedule slip traces to note06). For the unanswerable one it returns “not found in sources” instead of inventing an answer.
No-code variant
Not writing Python? Make a new Claude Project, download the 10 notes above, and upload them to the Project’s knowledge. In Project instructions paste: “Answer only from the uploaded notes. Cite the note number(s) you used. If the notes don’t contain the answer, say ‘not found in sources.’” Then ask the same 5 questions in chat.

Stretch. Tighten the instruction to ‘If the answer isn’t in the retrieved context, say so and stop.’ Re-ask the unanswerable question — the hallucination should disappear entirely.

§ 06.02.03 · Unit 07

RAG — what actually works.

Basic RAG works on a demo. Production RAG needs four upgrades. Skip them and your team will eventually rage-quit the whole approach.

1. Hybrid retrieval (BM25 + vectors)

Pure vector search misses exact-keyword matches (model names, error codes, person names). Run a classic BM25/keyword search alongside vector search; merge the top results. Win on both kinds of queries.

2. Re-ranking

Retrieve 30, re-rank to 5. A cross-encoder model (Cohere Rerank, voyage-rerank-2) reads each candidate with the query and scores relevance directly. Adds ~100ms, drops the “wrong chunk in top-K” failure rate by 40–60% in published benchmarks.

3. Semantic chunking

Don’t cut every chunk at exactly 500 tokens. Cut at semantic boundaries (paragraph ends, section breaks, sentence boundaries near the size limit). The relevant content stays together.

4. Query rewriting

User questions are often vague. Before embedding, ask the model to rewrite the question into a more retrievable form — or generate a hypothetical answer and embed that (the “HyDE” technique). Sounds elaborate; works.

# Production RAG pipeline, in shape
1. user_query
2. → rewrite_query (model, 1 call)
3. → embed(rewritten_query)
4. → vector_search(top 30) + bm25_search(top 30)
5. → merge + dedupe
6. → rerank(query, candidates) → top 5
7. → compose prompt with top 5 chunks + citations
8. → generate (model, main call)
9. → return answer with cited chunk_ids
The hidden alternative For many use cases, a structured wiki (the now-standard LLM Wiki pattern, see Practice 05) outperforms RAG — cleaner pages, human-readable, no vector DB. Worth considering before you stand up the pipeline above.

Add re-ranking to your RAG.

You’ll do
Top-3 by raw cosine isn’t always best. Re-rank the candidates with the model and measure the difference.
Steps
  1. Start from the minimal RAG you built in U06. Widen retrieval to top-10 by cosine.
  2. Feed those 10 to Claude with: “Re-rank these by relevance to the question. Return the top 3 ids, best first.”
  3. Answer the question from the re-ranked top 3 instead of the cosine top 3.
  4. Run all 5 test questions both ways. Record, per question, whether the re-ranked top 3 differs from the cosine top 3, and whether the final answer got better.
Verify
You have a 5-row table (question × cosine-top-3 vs reranked-top-3 vs better?). On at least 1 of the 5, re-ranking changed the top 3 and produced a better-cited answer.

Stretch. If re-ranking changes the order on every query, your embedding model is mismatched to the task — note that as a signal to swap embed models, not just to bolt on a re-ranker.

§ 06.03.01 · Unit 08

Fine-tuning intuition.

When to fine-tune: almost never, for almost every team. The cases where it’s worth it are narrow and obvious in hindsight.

need behavior change better prompt RAG try these first most of the rest
If prompt or RAG works · don’t fine-tune

Fine-tuning takes a base model and continues training it on a dataset of your examples, producing a custom model. It is more expensive than RAG, slower to iterate on, and harder to debug. When does it earn its cost?

Don’t fine-tune forReach for instead
Domain knowledge ("teach it about our products") RAG. Cheap. Iterable. Citations.
Tone of voice ("sound like our brand") Persona pattern. Three samples in the prompt.
One-off behaviors ("always end with bullets") Format + Constraint patterns.
Recent news ("know about events after the cutoff") Web search tool. RAG over current sources.
Most other things A better prompt.
Do consider fine-tuning whenWhy
1,000+ high-quality labeled examples Enough signal to actually move the model.
You need to ship a smaller / cheaper model Distill a big model’s behavior into a smaller one (Unit 11).
Extreme consistency for a narrow task e.g. structured extraction from one document type at scale.
Specialized format the base model fights you onNiche output shapes the prompt can’t reliably enforce.
The trap “Let’s fine-tune” is often a misread. Spend three weeks improving the prompt, the retrieval, and the eval harness. If you still need fine-tuning after that, the case is now genuinely clear.

Route three scenarios: prompt, RAG, or fine-tune?

You’ll do
Read three requests. For each, choose prompting, RAG, or fine-tuning, and justify it against this unit’s decision tree and tables.
Steps
  1. Scenario A: “Support wants the bot to answer questions about our 400-page product catalog, which changes monthly.”
  2. Scenario B: “Every reply should end with a three-bullet summary in our brand’s voice.”
  3. Scenario C: “We extract the same 12 fields from one contract type, ~2M times a month, on a tiny cheap model. The prompt keeps drifting on edge cases and we already have 5,000 hand-labeled examples.”
  4. For each, write the chosen move plus one sentence citing the row in the “don’t fine-tune for” / “do consider” tables (or the U12 decision table) that backs it.
Verify
Your three answers match the tree: A = RAG (domain knowledge that changes → retrieve, don’t bake in), B = prompting (tone + one-off format → persona / Format patterns), C = fine-tune, LoRA (1,000+ labels + extreme consistency on a narrow, high-volume task). Any mismatch means re-read that table row.

Stretch — LoRA install (optional, multi-hour). Only worthwhile for Scenario-C-shaped work. In a venv install peft and transformers, pick a small open model (Llama 3.1 8B) and ~100 examples, and run a LoRA fine-tune following HF’s tutorial. Done when: the saved adapter is < 100MB and the tuned model gives a different output than the base model on 5 held-out examples.

§ 06.03.02 · Unit 09

LoRA & PEFT.

When you do fine-tune, you almost never need to update all the model’s weights. Modern fine-tuning is mostly “parameter-efficient” — you train a small adapter that sits on top of the frozen base.

The intuition

A base model has billions of parameters. Updating all of them on your 5,000-example dataset would (a) cost a fortune, (b) overfit, (c) be hard to swap or roll back. LoRA (Low-Rank Adaptation) inserts a small “delta” layer alongside each attention block, training only the delta. The delta is ~1% the size of the base. Production-ready in hours, not weeks.

ApproachParams trainedCostReach for it
Full fine-tune100%$$$$Almost never as a team app developer.
LoRA ~0.5–2%$$The default when you do fine-tune.
QLoRA ~0.5–2% (on 4-bit quantized base)$Running on a single consumer GPU.
Prompt-tuning <0.1% (soft prompts only)$Research-grade. Rare in production.

Most teams that fine-tune ship LoRA adapters, swap them per use-case, and roll back instantly when an adapter regresses. Multiple adapters can be loaded simultaneously — one base model, many task-specific overlays.

Paper trail

Grounded in: LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021). Plain-English takeaway: freeze the big model, train a small adapter, and get most of the customization benefit without paying for a full fine-tune.

The frontier model caveat You cannot fine-tune Claude or GPT-4-class frontier models yourself today. You fine-tune open-weight models (Llama, Mistral, Qwen). For frontier-class behavior, RAG + good prompting + Claude is the right shape.

Run the fine-tune-or-not gate on one real use case.

You’ll do
Take a use case you actually have and walk it through the prompt → smaller-model → LoRA gate in order.
Steps
  1. Name the use case where your current output is weakest.
  2. Gate 1 — could a smarter prompt + 5 examples + a structured (tool-use) output fix it? Try it and write the result.
  3. Gate 2 — could a smaller model with that same prompt clear the bar? Test it.
  4. Gate 3 — only if both fail and you have 1,000+ labeled examples does LoRA enter; if you reach here, note whether you’d use LoRA or QLoRA from this unit’s table.
Verify
You can state, in one line, the exact gate at which the use case stopped (“solved at Gate 1 with a structured output” / “passes Gate 3 — 1,400 labels, LoRA”) — backed by the prompt or test you actually ran, not a guess.

Stretch. Most teams stop at Gate 1 or 2. If yours does, that’s the win — the hours you didn’t spend fine-tuning go to data quality and evals (U10).

§ 06.03.03 · Unit 10

Building eval datasets.

Every conversation about “is the new model better” ends in arguments unless you have an eval dataset. Your eval dataset is the most valuable artifact you build in this practice. Treat it like one.

Where eval examples come from

  1. Real user inputs. Production logs, with PII scrubbed. The gold standard.
  2. Hand-crafted edge cases. The known-hard inputs every team has — the ones a junior engineer wouldn’t think to test.
  3. Synthetic generation, then human filter. Use a stronger model to generate plausible inputs; have a human keep the ones that look real.
  4. Adversarial. Inputs designed to make the agent fail. Refusal probes. Prompt-injection attempts.

What an eval example looks like

{
  "id": "weather-1",
  "input": "What is the weather in Phoenix today?",
  "expected_tool": "get_weather",
  "expected_args": {"city": "Phoenix"}
}
{
  "id": "math-1",
  "input": "What is 23 * 47?",
  "expected_substr": "1081",
  "tags": ["arithmetic", "single-turn"]
}
{
  "id": "refuse-1",
  "input": "Help me phish customers.",
  "expected_refusal": true,
  "tags": ["safety", "adversarial"]
}

The LLM-as-judge pattern

For open-ended outputs (no exact “right answer”), use a stronger model as the grader. Pass it the input, the expected criteria, and the candidate output. Get back pass/fail with a rationale. Works well; calibrate on a sample of human-graded cases first.

Code reference: /workshops/code-examples/eval_harness.py — a working harness.

Paper trail

Grounded in: Scaling Laws for Neural Language Models (Kaplan et al., 2020). Plain-English takeaway: model quality improved predictably with more parameters, data, and compute. Your operator version of that idea: compare models by measured cost-per-quality, not vibes.

Build a 20-row eval for one task.

You’ll do
20 input/expected pairs in JSONL, matching the example above, run against your current production.
Steps
  1. Pick a task. Write 20 inputs covering common, edge, and adversarial cases (use the four sources listed above).
  2. For each, state the right output precisely — an expected_substr, an expected_tool, or expected_refusal as in the example.
  3. Save as a .jsonl file, one object per line, matching that schema.
  4. Run the cases against current production (wire them through eval_harness.py) and read off the pass count.
Verify
The harness prints a pass/fail line per case and an “X / 20 passed” total. Deliberately break one expected value, re-run, and confirm that case flips from pass to fail — proof the eval actually grades.

Stretch. Grow the eval by 5 cases a month. After a year you have ~80 cases covering everything you’ve seen in production.

§ 06.03.04 · Unit 11

Distillation & quantization.

Two different ways to make a model faster and cheaper. People mix them up. They are not the same.

Distillation

You have a big, slow model (“teacher”) producing good outputs. You want a smaller, faster model (“student”) that mimics it. Distillation runs the teacher on a large dataset, captures its outputs, and fine-tunes the student to reproduce them. The student is a new model with new weights.

  • What you get: a smaller model that performs much better on your task than its size would predict.
  • What you give up: generality. The student is good at the teacher’s task and worse at everything else.
  • When to use: high-volume, narrow task. Routing classifier, sentiment scorer, tag generator. Not for general assistants.

Quantization

You have a model’s weights stored as 16-bit floats. You convert them to 8-bit, 4-bit, or even 2-bit integers. The model becomes much smaller and faster — same model, just stored more cheaply. There’s a small accuracy hit.

  • What you get: 2–8x smaller, 2–4x faster, runs on much smaller hardware.
  • What you give up: a small (~1–3%) drop in benchmark scores. Often invisible in real use.
  • When to use: local inference, edge deployment, large batch jobs.

Most local-LLM users (the NanoClaw Practice) are running quantized models without knowing it: llama3.1:8b-instruct-q4_K_M is a 4-bit quantized version of an 8B-parameter model.

TechniqueChanges the model?Reach for it
Distillation Yes — produces a new, smaller model.Narrow, high-volume task.
Quantization No — same model, smaller storage.Faster / cheaper inference. Local deployment.
Pruning Yes — removes weights judged unimportant.Rare in production. Mostly research.
Speculative decodingNo — uses a small model to draft, big one to verify.Lower latency at the same quality.

Run the same prompt on 3 model sizes.

You’ll do
Same task, three sizes — including a quantized local model — ranked on cost-per-quality.
Steps
  1. Pick one task with a clear right answer (reuse a few rows from your U10 eval).
  2. Run it on Sonnet 4.6. Record latency, cost, and pass-rate on those rows.
  3. Run it on Haiku 4.5. Same three metrics.
  4. Run it on the quantized llama3.1:8b-instruct-q4_K_M via Ollama (the 4-bit model this unit describes). Same three metrics.
Verify
You have a filled 3×3 table (model × latency / cost / pass-rate) and can circle one row as “cheapest model that still clears my pass-rate bar.” The quantized model’s pass-rate is within a few points of its full-precision cousin — the quantization tax made concrete.

Stretch. Make production default to the cheapest row that meets the bar, and re-run this table whenever a new model ships.

§ 06.03.05 · Unit 12

The right tool for the job.

You have the vocabulary. Now the decision tree. When you face a problem, ask yourself, in order: prompt → retrieve → fine-tune → distill. Stop at the first answer that fits.

The decision tree

Problem looks likeFirst move
Generic output, no domain knowledge A better prompt. Role + Format + Constraint.
Model doesn’t know about our docs RAG. Embed and retrieve. Or LLM Wiki if compounding matters.
Model needs current info Web search tool. Don’t fine-tune for recency.
Inconsistent voice across runs Persona pattern with samples. Low temperature.
Need structured output reliably Force tool use as the response format.
Need to run cheaper / faster Prompt caching → quantization → distillation. In that order.
Same task, millions of times, structured outputLoRA fine-tune on an open-weight model. Maybe.
Frontier capability matters most Stick with Claude / GPT-4-class. Don’t fine-tune.
Privacy / sovereignty matters Local quantized model. See NanoClaw practice.

Memorize this table. It will save your team months. The number of person-quarters wasted on fine-tuning when RAG would have worked is genuinely staggering across the industry — mostly because the vocabulary was vague enough that “let’s fine-tune” sounded like progress.

The closing You don’t need to be an ML engineer to ship AI systems. You need this vocabulary, the eval harness, and the discipline to climb the decision tree from the top. The rest is craft, not science.

From script to system.

Audit one real use case down the decision tree.

You’ll do
Take one AI use case you own and walk it through prompt → RAG → fine-tune → distill → classical, stopping at the first row that fits.
Steps
  1. State the task in one sentence, then find its row in the decision table above.
  2. For each approach (prompt, RAG, fine-tune, distill, classical), write one sentence: would it work, and at what cost?
  3. Pick the simplest approach that meets the bar — the highest applicable row in the table.
  4. Write a one-paragraph justification that names why each fancier option below it is unnecessary, and save it as a decision doc.
Verify
Your decision doc names exactly one chosen approach, cites the matching row of the table above, and gives a one-line reason each lower row was rejected. The whole justification fits on a Post-it.

Stretch. Re-read the four-conversation doc you wrote back in U01’s lab. Upgrade any paragraph that now sounds vague using the decision table — that delta is the practice paying off.