Pradhya Practice 12 · Context Engineering Deep Dive Builder

The discipline that took over from prompt engineering.

Context engineering is what you do when prompts aren’t enough. It is the deliberate management of everything the model sees across many turns — system prompts, tool defs, history, files, retrieved chunks, and the cache hints that keep it all cheap.

Sources: Anthropic · Effective context engineering for AI agents, Anthropic · Harnesses for long-running agents, Galileo · Context engineering deep dive, Weaviate · Context engineering & memory.

Audience

Builders running long-horizon agents

Length

3 sessions · 90 min each

Walk-away

The context budget audit checklist

Prereq

Prompt Engineering or Agents

What you’ll be able to do by the end

Diagnose context rot vs context clash from agent traces
Apply the 4 moves (offload, retrieve, isolate, reduce) to a real agent
Implement compaction with a PreCompact hook that persists what’s about to be lost
Cut your agent’s input cost by 90% via prompt caching

Run-along: the context budget audit checklist at the bottom is your scorecard. Fill in one section as you finish each day — OFFLOAD/RETRIEVE after Day 1’s moves, ISOLATE after Day 2’s sub-agents, REDUCE after Day 2’s compaction — so by the close you have a filled-out audit for one real agent, not a blank template.

§ 12.01.01 · Unit 01

From prompt engineering to context engineering.

Prompt engineering: craft the perfect instruction. Context engineering: optimize everything the model sees on every turn. The field shifted because long-running agents stress the second much more than the first.

The shift · from one instruction to the whole window

The Anthropic line: “Find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome.”

Sort one prompt into prompt vs context problems.

You’ll do

Read a long system prompt and label every line as a wording problem (fix with prompt engineering) or a window problem (fix with the four moves) — the distinction this unit draws.

Steps

Open bloated_prompt.txt (no production agent yet? use ours — a real 1,100-word IT-support system prompt). Right-click → Save As, or open it and read.
Go line by line. Tag each line P (prompt problem: vague instruction, missing example, wrong tone) or C (context problem: duplicated, stale, belongs in a file/tool, eats tokens with no signal).
Count each tag. In this prompt the C’s dominate — the same persona is restated five-plus times.
Write one sentence: which discipline — prompt or context engineering — would fix this prompt faster, and why.

Verify

You have a count of P vs C lines, and C > P. Your one-sentence verdict names context engineering as the bigger lever — that’s the shift this unit is about.

Stretch. Do the same pass on a real prompt you ran today. Tally P vs C; notice which discipline your own work actually needs more of.

§ 12.01.02 · Unit 02

The window, anatomy.

Five slots, each one paid for in tokens, each one the responsibility of a different layer of the harness.

Five slots in the window · each one paid for · two are cacheable

Map one prompt onto the five slots.

You’ll do

Take a real prompt’s window and assign every part to one of the five slots above, then size each slot as a % of the total.

Steps

Use a prompt you ran today (system + tools + messages + last response). No production agent yet? Use bloated_prompt.txt as the system slot and invent a one-line user turn.
Color-code into the five slots from the diagram: system, tools, messages, files, current turn.
Estimate tokens per slot (words × 1.3 ≈ tokens, or paste into the count_tokens endpoint). Compute each as a % of the total.
Mark which two slots are cacheable (system, tools) and which one grows unboundedly (messages).

Verify

Every token in the window lands in exactly one of the five slots, the five percentages sum to ~100%, and you can name the single largest slot. Anything that fits no slot is dead weight to cut.

Stretch. Repeat on your 3 most-used prompts. The slot that’s consistently largest is your first compression target.

§ 12.01.03 · Unit 03

Context rot & context clash.

Two failure modes most teams haven’t named. Once you have names, you can diagnose them.

Rot · graceful degradation · Clash · sudden derailment

The Galileo blog formalized these terms; Anthropic's posts validate them. The fix is the four moves in the next unit.

Mark where a transcript rots and clashes.

You’ll do

Read a degrading 20-turn transcript and pinpoint, turn by turn, where it shows rot vs clash — the two failure modes this unit names.

Steps

Open rotted_transcript.txt (no degraded agent of your own? use ours — a planning agent that sets four constraints C1–C4 up front, then loses them). Right-click → Save As, or just read it.
Track the four constraints (PostgreSQL, $5k cap, summaries <80 words, March 14 launch). Note the first turn each one is violated.
For every violation, label it rot (the agent quietly forgets / blurs as the window grows) or clash (a later turn contradicts an earlier instruction and the agent picks the wrong one).
Write the smallest fix for each: compact at turn N, drop a stale tool result, or re-pin the constraint.

Verify

You named a specific turn number for each of the four broken constraints and tagged each as rot or clash. Cross-check: at least one is clash (a direct contradiction) and at least one is rot (a silent forget) — the transcript contains both.

Stretch. Build a rot detector: a separate prompt that re-reads the trace and flags any C1–C4 violation automatically. Run it on the file and check it catches the turns you found by hand.

§ 12.01.04 · Unit 04 · The framework

The four moves.

Good context engineering comes down to four moves. Every advanced technique is one of these four.

Four moves · every advanced technique is one of these

Pick the right move for one bottleneck.

You’ll do

Find the worst section of a too-long prompt and apply exactly one of the four moves — the one that fits the bottleneck.

Steps

Pick a prompt that’s too long. No production agent yet? Use bloated_prompt.txt.
Identify the bottleneck section: longest, least-changing, or most-repeated.
Match it to one move — stable content → offload to cache; on-demand content → retrieve; multi-task → isolate; bloated formatting/repetition → reduce. Apply just that one. Re-count tokens.
Re-run the prompt on 5 historical inputs (or 5 sample IT tickets if using ours). Compare quality.

Verify

Tokens drop ≥30% and quality holds within 5% on your 5 inputs. You can name which of the four moves you used and why it was the right fit.

Stretch. Apply a second move on top. The four moves stack — offload then reduce often compounds.

§ 12.02.01 · Unit 05 · Move 1

Offload to external memory.

Anthropic calls it “structured note-taking.” The agent writes notes to disk (or to a database) and retrieves them on demand — instead of keeping everything in the window.

Working memory ↔ external memory · the agent picks what to carry

Concrete implementations:

NOTES.md file — simplest. The agent reads/writes a markdown file. Used by Claude Code subagents.
The LLM Wiki pattern — structured pages with cross-links. Used by Practice 05 (NanoClaw).
A SQLite / Postgres database — when notes need queries, joins, indexes.

Offload 3 stable facts to tool calls.

You’ll do

Pull 3 things out of a system prompt and make them lookup-on-demand — the offload move in its simplest form.

Steps

Pick 3 facts in the system prompt that change rarely but eat tokens (user prefs, schema, reference tables). No production agent yet? Use bloated_prompt.txt — its # ESCALATION QUEUES (REFERENCE) block and KB-article list are exactly this kind of offloadable reference data.
Wrap each in a tool: get_user_prefs(), get_schema(), get_escalation_queues(), etc.
Remove them from the system prompt; let the tool serve them.
Run the agent. If it doesn’t call a new tool when it should, the description needs keywords matching when the model would reach for it.

Verify

System-prompt token count drops by the size of the 3 offloaded blocks (measure before/after). The tools fire only on turns that need them, and answer quality on 5 inputs is unchanged.

Stretch. If a tool ends up called every single turn, that fact is hot — put it back in the system prompt. Offload is for content used some of the time.

§ 12.02.02 · Unit 06 · Move 2

Retrieve, don’t front-load.

If you have a wiki of 800 pages, don’t paste 800 pages into the prompt “just in case.” Give the agent a search tool. Let it pull what it needs, when it needs it.

Dynamic retrieval beats “just in case” every time

Implementation note: pair retrieval with the wiki pattern. The agent calls search_wiki(query); the tool reads index.md first to pick which page, then reads that page. Two-stage retrieval saves you from vector-DB infrastructure for most personal-scale projects.

Replace a front-loaded corpus with a search tool.

You’ll do

Stop pasting a whole corpus into the prompt; give the model a search tool and let it pull only the chunk it needs.

Steps

Pick a long document you currently paste in full. No corpus handy? Use the 10-note sample-corpus (note01–note10) plus haystack.txt — a ~2,900-word doc with one planted fact.
Baseline (front-loaded): paste the whole haystack into context and ask the question only the planted sentence answers (see haystack_key.txt for the exact question and answer — don’t show the key to the model). Record the input token count.
Retrieved: add a tool search_doc(query) that keyword-greps the corpus and returns the top-3 chunks. Replace the full paste with the tool; ask the same question.
Compare: input tokens, and whether both runs return the correct answer (“$4,250 per quarter”).

Verify

The retrieved run returns the same correct answer as the front-loaded run while sending ≥70% fewer input tokens (it loads 3 chunks, not the whole doc). Confirm the answer string matches the key.

Stretch. Add re-ranking. Move the planted fact to a different note and confirm search_doc still surfaces it in the top-3 — the first hit isn’t always the right one.

§ 12.02.03 · Unit 07 · Move 3

Isolate via sub-agents.

Specialized agents handle focused tasks and return condensed summaries (typically 1,000-2,000 tokens) to a coordinating agent. The parent never sees the detail.

Detail stays in sub-agents · only summaries reach the parent

Split one agent into two isolated contexts.

You’ll do

Take one agent doing multiple jobs and break it into focused sub-agents, each carrying only the context its job needs.

Steps

Pick an agent that does several jobs. No production agent yet? Use support_bot_system_prompt.txt — one bot that triages, looks up orders, AND moves money.
List the jobs. For the support bot: read-only triage (lookup_order, track_shipment) vs money actions (issue_refund, issue_store_credit, start_return, escalate).
Write two trimmed system prompts, each with ONLY its job’s tools and rules. Count tokens for each — both should be well under the monolith.
Add a one-line dispatcher: triage runs first; it hands off to the money sub-agent only when an action is needed.

Verify

Each sub-agent’s system prompt is smaller than the monolith’s (measure tokens), and each exposes a strict subset of tools — the triage agent literally cannot issue a refund. The two prompts together cover every original job with no tool dropped.

Stretch. If both sub-agents would need 80% of the same context, you actually have one agent — merge back. Isolation only pays when the contexts genuinely differ.

§ 12.02.04 · Unit 08 · Move 4

Reduce via compaction.

When the window fills, summarize the past. The advice that pays off: “Maximize recall to ensure your compaction prompt captures every relevant piece of information from the trace.”

The trigger

Watch a context-utilization metric. When you cross ~70% of the cap, compact. Don’t wait for 100% — you need headroom to write the new response.

The prompt

You are about to summarize the conversation trace below.

Preserve everything that future turns might need:
- All facts and decisions
- All commitments by either side
- Any user preferences expressed
- Any open questions

Drop everything else: pleasantries, false starts, repeated explanations.

Output: a single message under 500 tokens, structured as
<summary>facts and decisions</summary>
<commitments>...</commitments>
<preferences>...</preferences>
<open>...</open>

The trace:
[paste trace]

The safeguard

Before the harness compacts, fire a PreCompact hook that persists everything to disk. If the summary loses a fact, you can retrieve it. Compaction is irreversible in context; not on disk.

Compact a 20-turn trace without losing a constraint.

You’ll do

Run the compaction prompt above on a real degraded transcript and prove it preserves the load-bearing facts — the whole point of “maximize recall.”

Steps

Get a long trace. No agent of your own? Use rotted_transcript.txt — 20 turns that establish four hard constraints (C1 PostgreSQL, C2 $5k cap, C3 summaries <80 words, C4 March 14 launch).
Copy the compaction prompt above; paste the full transcript where it says [paste trace]. Run it.
Read the <summary> / <commitments> / <open> output. Check it against the four constraints.
Count the output tokens — the prompt asks for under 500.

Verify

All four constraints (C1–C4) survive verbatim in the summary, and the summary is under 500 tokens. If any constraint is dropped, your compaction prompt under-recalls — add “preserve every numbered constraint” and re-run until 4/4 survive.

Stretch. Add the PreCompact safeguard: write the full trace to disk before compacting, so a dropped fact is still recoverable. Confirm the on-disk copy still has all four constraints even if the summary missed one.

§ 12.03.01 · Unit 09

Tiered memory architecture.

Production agents combine all four moves into a single memory architecture with multiple tiers. Each tier has a different cost, latency, and capacity profile.

Four tiers · cost goes up as you move down · capacity goes up too

Sort one agent’s context into the four tiers.

You’ll do

Place every piece of context an agent uses into L1–L4 of the architecture above, then fix anything sitting in the wrong tier.

Steps

List everything the agent carries: system prompt, recent turns, tool results, retrieved docs, persistent memory. No production agent yet? Inventory the support bot — persona/tools/rules, the live chat, order lookups, the returns policy doc.
Tag each into the diagram’s tiers: L1 context window (always in), L2 compacted summary (this session), L3 wiki/notes (structured, on demand), L4 DB/vector store (queried).
Flag anything in L1 used <50% of the time — it should drop to L3/L4 and be retrieved.
Flag anything in L4 that’s never queried — it can be archived or deleted.

Verify

Every item sits in exactly one tier, your always-in L1 set is <30% of the window’s budget, and you named at least one item to demote from L1. The four tiers each have a clear retrieval path.

Stretch. Schedule a monthly re-tiering review. Usage patterns drift — last quarter’s hot doc is this quarter’s cold archive.

§ 12.03.02 · Unit 10

Prompt caching for cost.

The single most under-used feature for production agents. Mark the system prompt and tool defs as cached; subsequent calls pay 10% of the cost on the cached prefix.

Cache the stable prefix · reads cost 0.1x base input

Validated against official Claude API docs: prompt caching references `tools`, `system`, and `messages` in that order; 5-minute cache writes are 1.25x base input and cache reads are 0.1x. Source: Prompt caching.

client.messages.create(
    model="claude-sonnet-4-6",
    system=[{
        "type": "text",
        "text": LARGE_STABLE_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"}   # ← cache this part
    }],
    tools=tools,                                  # also cached automatically when stable
    messages=messages,
)

Turn on caching and watch the read tokens.

You’ll do

Add cache_control to a stable prefix and confirm from the usage fields that repeat calls actually hit the cache.

Steps

Take a prompt that runs >5x/day. No production prompt? Use bloated_prompt.txt (or the support bot) as the stable system prefix — it’s long enough to clear the cache minimum.
Add cache_control: {type: "ephemeral"} at the system/tools boundary, exactly as the code block above shows.
Run the prompt twice with the same prefix.
On call 2, read resp.usage.cache_read_input_tokens.

Verify

cache_read_input_tokens > 0 on call 2 (it’s 0 on call 1, the write). Across 10 calls the cached prefix gives a hit rate >80% — reads bill at 0.1x base input.

Stretch. If hit rate is <50%, the prefix isn’t actually stable — something per-request (a timestamp, a user id) crept above the cache boundary. Find it and move it below.

§ 12.03.03 · Unit 11

Measuring context efficiency.

You can’t improve what you don’t measure. Four numbers worth tracking per agent.

Metric	Why
Average context utilization	Are you running close to the cap? When and why?
Cache hit rate	Should be 80%+ on the system prompt. If it isn’t, your “stable” prefix isn’t.
Compaction frequency	How often does the harness compact per session? Cluster shows where context grows fastest.
Sub-agent dispatch rate	Are you isolating expensive work? If not, your main agent is carrying too much.

These plug into the eval harness from Practice 04 Unit 02 — the shipped eval_harness.py already reads input_tokens / output_tokens per run and prints cost, latency, and pass-rate. Add the four numbers above as columns on the dashboard.

Capture a token baseline for one agent.

You’ll do

Log per-call token usage across a run so a future prompt edit that bloats context trips a threshold — the “average context utilization” metric, made real.

Steps

Wrap every Anthropic call to record resp.usage.input_tokens, output_tokens, and cache_read_input_tokens. No agent of your own? Run eval_harness.py over test_cases.jsonl — it logs input/output tokens and cost per case to results.jsonl (add the cache_read_input_tokens field yourself when you wrap your own calls).
Append each call’s numbers to a file (or your existing telemetry: Datadog, Prometheus).
Compute p50, p90, p99 input tokens across the run.
Pick a threshold: p99 input > 2× your p50 = a regression worth a Slack ping.

Verify

You can read three numbers off the log — p50, p90, p99 input tokens — and you have one written threshold. Sanity check: deliberately pad the prompt by 2× and confirm the next run crosses your threshold.

Stretch. Add a token-per-task metric. A task that burns 10× the baseline tokens is a regression even when the output still looks fine.

§ 12.03.04 · Unit 12 · The close

The closing maxim.

When an agent “hallucinates” or “forgets”, nine times out of ten the failure is in the context window assembly, not in the model itself.

The whole discipline · in one sentence

Master this and your agents will outperform agents twice their size. Skip it and bigger models won’t save you.

Run a 4-move audit on one production prompt.

You’ll do

Pick the prompt that runs most often. Apply offload + retrieve + isolate + reduce in sequence.

Steps

Baseline: measure input tokens and quality on 10 representative inputs.
Apply offload: pull stable content into a cached prefix or tool. Re-measure.
Apply retrieve: replace any pasted documents with on-demand search. Re-measure.
Apply isolate: if the prompt does multiple jobs, split it. Re-measure.
Apply reduce: trim any remaining bloat. Re-measure.

Verify

By the end, input tokens drop ≥40% with quality within 5% of baseline on your 10 inputs. You can state which single move bought the most reduction.

Stretch. Turn the maxim into a standing rule: add a 3-question check to your PR template — (1) Is this the smallest prompt? (2) Was it benchmarked? (3) What was removed and re-tested? Verify it bites by running it on the next prompt change. And if quality drops >5% during the audit, the last move went too far — revert it. Diminishing returns is normal.

§ Walk-away · The context budget audit

The four moves on one page.

Run this checklist on every new agent BEFORE writing prompts. Most context bloat is preventable upstream — by the time you're token-pruning, you've already designed yourself into a corner. This audit catches that.

# CONTEXT BUDGET AUDIT — run before building any agent

## 1. OFFLOAD — what lives outside the context window?
- [ ] Long documents → file system, not the prompt
- [ ] Historical conversations → external memory store
- [ ] Reference data → DB / search, looked up only when needed
- [ ] Code / config → repo files Claude reads with the file tool
- [ ] User profile / preferences → CLAUDE.md, not every prompt

## 2. RETRIEVE — what loads only when relevant?
- [ ] RAG over docs: which retrieval, which embed model, top-K
- [ ] Tools that fetch: lookup, search, query — not paste-in-bulk
- [ ] Conditional context loading by user intent
- [ ] Slow-changing reference data: cached, not re-loaded

## 3. ISOLATE — which steps run in fresh sub-agent contexts?
- [ ] Long-running tasks split into sub-agents (each fresh)
- [ ] Read-heavy steps separated from synthesis steps
- [ ] Tool-calling loops bounded — sub-agent per N calls
- [ ] Each sub-agent gets ONLY the context it needs, nothing else

## 4. REDUCE — what gets compacted when context fills?
- [ ] Compaction prompt written + tested before you need it
- [ ] Trigger: at 70% capacity, not 95%
- [ ] What to keep: decisions, open questions, the spine
- [ ] What to drop: tool-call logs, scratch reasoning, fixed bugs

## RED FLAGS
- Prompt > 4k tokens before any tool call → you skipped offload
- Same data appears in multiple sub-agent contexts → fix isolation
- Context fills before the task completes → compaction missing
- Agent forgets the goal mid-task → ISOLATE step lost the spine

## OUTPUT
After running this audit, write 2 sentences:
1. The biggest move you'll make (OFFLOAD/RETRIEVE/ISOLATE/REDUCE)
2. The token budget you're targeting per typical run

← Previous Prompt Engineering Next practice → Claude Memory