The discipline that took over from prompt engineering.
Context engineering is what you do when prompts aren’t enough. It is the deliberate management of everything the model sees across many turns — system prompts, tool defs, history, files, retrieved chunks, and the cache hints that keep it all cheap.
Sources: Anthropic · Effective context engineering for AI agents, Anthropic · Harnesses for long-running agents, Galileo · Context engineering deep dive, Weaviate · Context engineering & memory.
- Diagnose context rot vs context clash from agent traces
- Apply the 4 moves (offload, retrieve, isolate, reduce) to a real agent
- Implement compaction with a PreCompact hook that persists what’s about to be lost
- Cut your agent’s input cost by 90% via prompt caching
Run-along: the context budget audit checklist at the bottom is your scorecard. Fill in one section as you finish each day — OFFLOAD/RETRIEVE after Day 1’s moves, ISOLATE after Day 2’s sub-agents, REDUCE after Day 2’s compaction — so by the close you have a filled-out audit for one real agent, not a blank template.
From prompt engineering to context engineering.
Prompt engineering: craft the perfect instruction. Context engineering: optimize everything the model sees on every turn. The field shifted because long-running agents stress the second much more than the first.
The Anthropic line: “Find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome.”
Sort one prompt into prompt vs context problems.
- Open bloated_prompt.txt (no production agent yet? use ours — a real 1,100-word IT-support system prompt). Right-click → Save As, or open it and read.
- Go line by line. Tag each line P (prompt problem: vague instruction, missing example, wrong tone) or C (context problem: duplicated, stale, belongs in a file/tool, eats tokens with no signal).
- Count each tag. In this prompt the C’s dominate — the same persona is restated five-plus times.
- Write one sentence: which discipline — prompt or context engineering — would fix this prompt faster, and why.
Stretch. Do the same pass on a real prompt you ran today. Tally P vs C; notice which discipline your own work actually needs more of.
The window, anatomy.
Five slots, each one paid for in tokens, each one the responsibility of a different layer of the harness.
Map one prompt onto the five slots.
- Use a prompt you ran today (system + tools + messages + last response). No production agent yet? Use bloated_prompt.txt as the system slot and invent a one-line user turn.
- Color-code into the five slots from the diagram: system, tools, messages, files, current turn.
- Estimate tokens per slot (words × 1.3 ≈ tokens, or paste into the
count_tokensendpoint). Compute each as a % of the total. - Mark which two slots are cacheable (system, tools) and which one grows unboundedly (messages).
Stretch. Repeat on your 3 most-used prompts. The slot that’s consistently largest is your first compression target.
Context rot & context clash.
Two failure modes most teams haven’t named. Once you have names, you can diagnose them.
The Galileo blog formalized these terms; Anthropic's posts validate them. The fix is the four moves in the next unit.
Mark where a transcript rots and clashes.
- Open rotted_transcript.txt (no degraded agent of your own? use ours — a planning agent that sets four constraints C1–C4 up front, then loses them). Right-click → Save As, or just read it.
- Track the four constraints (PostgreSQL, $5k cap, summaries <80 words, March 14 launch). Note the first turn each one is violated.
- For every violation, label it rot (the agent quietly forgets / blurs as the window grows) or clash (a later turn contradicts an earlier instruction and the agent picks the wrong one).
- Write the smallest fix for each: compact at turn N, drop a stale tool result, or re-pin the constraint.
Stretch. Build a rot detector: a separate prompt that re-reads the trace and flags any C1–C4 violation automatically. Run it on the file and check it catches the turns you found by hand.
The four moves.
Good context engineering comes down to four moves. Every advanced technique is one of these four.
Pick the right move for one bottleneck.
- Pick a prompt that’s too long. No production agent yet? Use bloated_prompt.txt.
- Identify the bottleneck section: longest, least-changing, or most-repeated.
- Match it to one move — stable content → offload to cache; on-demand content → retrieve; multi-task → isolate; bloated formatting/repetition → reduce. Apply just that one. Re-count tokens.
- Re-run the prompt on 5 historical inputs (or 5 sample IT tickets if using ours). Compare quality.
Stretch. Apply a second move on top. The four moves stack — offload then reduce often compounds.
Offload to external memory.
Anthropic calls it “structured note-taking.” The agent writes notes to disk (or to a database) and retrieves them on demand — instead of keeping everything in the window.
Concrete implementations:
- NOTES.md file — simplest. The agent reads/writes a markdown file. Used by Claude Code subagents.
- The LLM Wiki pattern — structured pages with cross-links. Used by Practice 05 (NanoClaw).
- A SQLite / Postgres database — when notes need queries, joins, indexes.
Offload 3 stable facts to tool calls.
- Pick 3 facts in the system prompt that change rarely but eat tokens (user prefs, schema, reference tables). No production agent yet? Use bloated_prompt.txt — its
# ESCALATION QUEUES (REFERENCE)block and KB-article list are exactly this kind of offloadable reference data. - Wrap each in a tool:
get_user_prefs(),get_schema(),get_escalation_queues(), etc. - Remove them from the system prompt; let the tool serve them.
- Run the agent. If it doesn’t call a new tool when it should, the description needs keywords matching when the model would reach for it.
Stretch. If a tool ends up called every single turn, that fact is hot — put it back in the system prompt. Offload is for content used some of the time.
Retrieve, don’t front-load.
If you have a wiki of 800 pages, don’t paste 800 pages into the prompt “just in case.” Give the agent a search tool. Let it pull what it needs, when it needs it.
Implementation note: pair retrieval with the wiki pattern. The agent calls search_wiki(query); the tool reads index.md first to pick which page, then reads that page. Two-stage retrieval saves you from vector-DB infrastructure for most personal-scale projects.
Replace a front-loaded corpus with a search tool.
- Pick a long document you currently paste in full. No corpus handy? Use the 10-note sample-corpus (note01–note10) plus haystack.txt — a ~2,900-word doc with one planted fact.
- Baseline (front-loaded): paste the whole haystack into context and ask the question only the planted sentence answers (see haystack_key.txt for the exact question and answer — don’t show the key to the model). Record the input token count.
- Retrieved: add a tool
search_doc(query)that keyword-greps the corpus and returns the top-3 chunks. Replace the full paste with the tool; ask the same question. - Compare: input tokens, and whether both runs return the correct answer (“$4,250 per quarter”).
Stretch. Add re-ranking. Move the planted fact to a different note and confirm search_doc still surfaces it in the top-3 — the first hit isn’t always the right one.
Isolate via sub-agents.
Specialized agents handle focused tasks and return condensed summaries (typically 1,000-2,000 tokens) to a coordinating agent. The parent never sees the detail.
Split one agent into two isolated contexts.
- Pick an agent that does several jobs. No production agent yet? Use support_bot_system_prompt.txt — one bot that triages, looks up orders, AND moves money.
- List the jobs. For the support bot: read-only triage (
lookup_order,track_shipment) vs money actions (issue_refund,issue_store_credit,start_return,escalate). - Write two trimmed system prompts, each with ONLY its job’s tools and rules. Count tokens for each — both should be well under the monolith.
- Add a one-line dispatcher: triage runs first; it hands off to the money sub-agent only when an action is needed.
Stretch. If both sub-agents would need 80% of the same context, you actually have one agent — merge back. Isolation only pays when the contexts genuinely differ.
Reduce via compaction.
When the window fills, summarize the past. The advice that pays off: “Maximize recall to ensure your compaction prompt captures every relevant piece of information from the trace.”
The trigger
Watch a context-utilization metric. When you cross ~70% of the cap, compact. Don’t wait for 100% — you need headroom to write the new response.
The prompt
You are about to summarize the conversation trace below. Preserve everything that future turns might need: - All facts and decisions - All commitments by either side - Any user preferences expressed - Any open questions Drop everything else: pleasantries, false starts, repeated explanations. Output: a single message under 500 tokens, structured as <summary>facts and decisions</summary> <commitments>...</commitments> <preferences>...</preferences> <open>...</open> The trace: [paste trace]
The safeguard
Before the harness compacts, fire a PreCompact hook that persists everything to disk. If the summary loses a fact, you can retrieve it. Compaction is irreversible in context; not on disk.
Compact a 20-turn trace without losing a constraint.
- Get a long trace. No agent of your own? Use rotted_transcript.txt — 20 turns that establish four hard constraints (C1 PostgreSQL, C2 $5k cap, C3 summaries <80 words, C4 March 14 launch).
- Copy the compaction prompt above; paste the full transcript where it says
[paste trace]. Run it. - Read the
<summary>/<commitments>/<open>output. Check it against the four constraints. - Count the output tokens — the prompt asks for under 500.
Stretch. Add the PreCompact safeguard: write the full trace to disk before compacting, so a dropped fact is still recoverable. Confirm the on-disk copy still has all four constraints even if the summary missed one.
Tiered memory architecture.
Production agents combine all four moves into a single memory architecture with multiple tiers. Each tier has a different cost, latency, and capacity profile.
Sort one agent’s context into the four tiers.
- List everything the agent carries: system prompt, recent turns, tool results, retrieved docs, persistent memory. No production agent yet? Inventory the support bot — persona/tools/rules, the live chat, order lookups, the returns policy doc.
- Tag each into the diagram’s tiers: L1 context window (always in), L2 compacted summary (this session), L3 wiki/notes (structured, on demand), L4 DB/vector store (queried).
- Flag anything in L1 used <50% of the time — it should drop to L3/L4 and be retrieved.
- Flag anything in L4 that’s never queried — it can be archived or deleted.
Stretch. Schedule a monthly re-tiering review. Usage patterns drift — last quarter’s hot doc is this quarter’s cold archive.
Prompt caching for cost.
The single most under-used feature for production agents. Mark the system prompt and tool defs as cached; subsequent calls pay 10% of the cost on the cached prefix.
Validated against official Claude API docs: prompt caching references `tools`, `system`, and `messages` in that order; 5-minute cache writes are 1.25x base input and cache reads are 0.1x. Source: Prompt caching.
client.messages.create( model="claude-sonnet-4-6", system=[{ "type": "text", "text": LARGE_STABLE_SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"} # ← cache this part }], tools=tools, # also cached automatically when stable messages=messages, )
Turn on caching and watch the read tokens.
cache_control to a stable prefix and confirm from the usage fields that repeat calls actually hit the cache.- Take a prompt that runs >5x/day. No production prompt? Use bloated_prompt.txt (or the support bot) as the stable system prefix — it’s long enough to clear the cache minimum.
- Add
cache_control: {type: "ephemeral"}at the system/tools boundary, exactly as the code block above shows. - Run the prompt twice with the same prefix.
- On call 2, read
resp.usage.cache_read_input_tokens.
cache_read_input_tokens > 0 on call 2 (it’s 0 on call 1, the write). Across 10 calls the cached prefix gives a hit rate >80% — reads bill at 0.1x base input.Stretch. If hit rate is <50%, the prefix isn’t actually stable — something per-request (a timestamp, a user id) crept above the cache boundary. Find it and move it below.
Measuring context efficiency.
You can’t improve what you don’t measure. Four numbers worth tracking per agent.
| Metric | Why |
|---|---|
| Average context utilization | Are you running close to the cap? When and why? |
| Cache hit rate | Should be 80%+ on the system prompt. If it isn’t, your “stable” prefix isn’t. |
| Compaction frequency | How often does the harness compact per session? Cluster shows where context grows fastest. |
| Sub-agent dispatch rate | Are you isolating expensive work? If not, your main agent is carrying too much. |
These plug into the eval harness from Practice 04 Unit 02 — the shipped eval_harness.py already reads input_tokens / output_tokens per run and prints cost, latency, and pass-rate. Add the four numbers above as columns on the dashboard.
Capture a token baseline for one agent.
- Wrap every Anthropic call to record
resp.usage.input_tokens,output_tokens, andcache_read_input_tokens. No agent of your own? Run eval_harness.py over test_cases.jsonl — it logs input/output tokens and cost per case toresults.jsonl(add thecache_read_input_tokensfield yourself when you wrap your own calls). - Append each call’s numbers to a file (or your existing telemetry: Datadog, Prometheus).
- Compute p50, p90, p99 input tokens across the run.
- Pick a threshold: p99 input > 2× your p50 = a regression worth a Slack ping.
Stretch. Add a token-per-task metric. A task that burns 10× the baseline tokens is a regression even when the output still looks fine.
The closing maxim.
When an agent “hallucinates” or “forgets”, nine times out of ten the failure is in the context window assembly, not in the model itself.
Master this and your agents will outperform agents twice their size. Skip it and bigger models won’t save you.
Run a 4-move audit on one production prompt.
- Baseline: measure input tokens and quality on 10 representative inputs.
- Apply offload: pull stable content into a cached prefix or tool. Re-measure.
- Apply retrieve: replace any pasted documents with on-demand search. Re-measure.
- Apply isolate: if the prompt does multiple jobs, split it. Re-measure.
- Apply reduce: trim any remaining bloat. Re-measure.
Stretch. Turn the maxim into a standing rule: add a 3-question check to your PR template — (1) Is this the smallest prompt? (2) Was it benchmarked? (3) What was removed and re-tested? Verify it bites by running it on the next prompt change. And if quality drops >5% during the audit, the last move went too far — revert it. Diminishing returns is normal.
The four moves on one page.
Run this checklist on every new agent BEFORE writing prompts. Most context bloat is preventable upstream — by the time you're token-pruning, you've already designed yourself into a corner. This audit catches that.
# CONTEXT BUDGET AUDIT — run before building any agent ## 1. OFFLOAD — what lives outside the context window? - [ ] Long documents → file system, not the prompt - [ ] Historical conversations → external memory store - [ ] Reference data → DB / search, looked up only when needed - [ ] Code / config → repo files Claude reads with the file tool - [ ] User profile / preferences → CLAUDE.md, not every prompt ## 2. RETRIEVE — what loads only when relevant? - [ ] RAG over docs: which retrieval, which embed model, top-K - [ ] Tools that fetch: lookup, search, query — not paste-in-bulk - [ ] Conditional context loading by user intent - [ ] Slow-changing reference data: cached, not re-loaded ## 3. ISOLATE — which steps run in fresh sub-agent contexts? - [ ] Long-running tasks split into sub-agents (each fresh) - [ ] Read-heavy steps separated from synthesis steps - [ ] Tool-calling loops bounded — sub-agent per N calls - [ ] Each sub-agent gets ONLY the context it needs, nothing else ## 4. REDUCE — what gets compacted when context fills? - [ ] Compaction prompt written + tested before you need it - [ ] Trigger: at 70% capacity, not 95% - [ ] What to keep: decisions, open questions, the spine - [ ] What to drop: tool-call logs, scratch reasoning, fixed bugs ## RED FLAGS - Prompt > 4k tokens before any tool call → you skipped offload - Same data appears in multiple sub-agent contexts → fix isolation - Context fills before the task completes → compaction missing - Agent forgets the goal mid-task → ISOLATE step lost the spine ## OUTPUT After running this audit, write 2 sentences: 1. The biggest move you'll make (OFFLOAD/RETRIEVE/ISOLATE/REDUCE) 2. The token budget you're targeting per typical run