Build the agents that last.
A working playbook for engineers shipping agents in production. Twelve units across three days: the five workflow patterns and when each one wins, context as a finite resource (offload, retrieve, isolate, reduce), sub-agent architectures, and the tool-design decisions that pay back in token cost every single request.
This is the practice where AI engineering stops being "let's see what works" and starts being a craft with a vocabulary. Every unit ends with a concrete decision you'll re-use in your next agent build.
- Pick the right workflow pattern for any new agent in under 5 minutes
- Apply the 4 context-engineering moves (offload, retrieve, isolate, reduce) to halve your token cost without losing quality
- Write tool descriptions agents actually use correctly — not API parity, agent affordances
- Decide between a single-context agent and a sub-agent architecture using the cost/coordination trade-off, not gut feel
Workflows vs agents.
The clearest framing: “Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage.”
The line matters because almost every “agent” failure in industry is actually a workflow that was prematurely promoted. If you can write down the steps in advance, build a workflow. Reach for an agent only when the steps cannot be predicted — the model has to decide.
Classify three real use cases as workflow or agent.
- Write each task as a one-sentence goal (“email summary every Friday”, “triage support inbox”, “close month-end books”).
- For each, list the steps you’d hand to a junior engineer. If you reach 3+ steps without saying “it depends”, it’s a workflow.
- If you hit “it depends” or “the model decides” in the first 2 steps, it’s an agent.
- Tag each task:
Wfor workflow,Afor agent.
W — most production wins are workflows, not agents.Stretch. For each A you tagged, sketch how you’d demote it to a workflow by pre-defining the branches. Most agents survive this reframing only when truly unpredictable.
The five workflow patterns.
Five composable building blocks. Memorize them; almost every production system is a combination of these.
| Pattern | Use when |
|---|---|
| Prompt chaining | Steps are sequential and known. Each LLM call refines the prior output. |
| Routing | Classify the input, send it to the specialist. Triage, intent detection. |
| Parallelization | Independent sub-tasks. Run them concurrently; aggregate. |
| Orchestrator-workers | Sub-tasks emerge dynamically. The orchestrator decides per input. |
| Evaluator-optimizer | Iterative improvement. One LLM critiques another’s output. |
Match real tasks to the five patterns.
- List 5 things you ask Claude to do at work.
- For each, pick the pattern that best fits. If multiple fit, prefer the simplest.
- For 2 of them, sketch the data flow on paper (input → steps → output).
- If a task doesn’t fit any pattern cleanly, it’s usually too vague — refine the goal.
Stretch. Pick the most-used pattern across your 5 tasks. That’s your team’s default pattern — build a reusable template for it before building anything new.
The orchestrator-workers pattern.
The pattern most people reach for, often wrong. The orchestrator decides the sub-tasks on every input. Workers execute. The orchestrator synthesizes.
Coding agents are the canonical example: when fixing a complex bug across multiple files, the orchestrator reads the symptom, decides which files to touch, dispatches a worker per file, then synthesizes the diff. The sub-tasks aren’t predefined — they emerge from reading the symptom.
Why this is harder than it looks
- The orchestrator has to be good at decomposition, not just retrieval. Few-shot examples of good plans are gold.
- Workers each need their own context. The orchestrator’s prompt should give each worker only what that worker needs.
- Synthesis is its own skill. Combining N worker outputs into one coherent response is where teams under-invest.
Distinguish from parallelization
Parallelization runs pre-defined sub-tasks. Orchestrator-workers generates the sub-tasks per input. The first is a static fan-out. The second is a dynamic delegation. Different cost, different complexity, different debugging story.
The orchestrator prompt template
The prompt structure that consistently produces well-shaped sub-task lists instead of "let me think about this":
# Drop in as the orchestrator's system prompt
You are the orchestrator. You do not solve the user's problem directly —
you decide how to decompose it and dispatch workers.
For this input: [user's request]
Step 1 — Decompose
List 2-5 sub-tasks, each scoped narrowly enough that one specialized
worker can handle it alone. Each sub-task should be:
- Independently completable (no cross-worker dependency)
- One sentence describing the worker's job
- One sentence describing what "done" looks like
Step 2 — Dispatch
For each sub-task, draft the exact prompt you would send to a worker.
The worker has no context other than this prompt.
Step 3 — Synthesis plan
Describe how you'll merge the workers' outputs into a single answer.
State the merge rule BEFORE you see the outputs:
- Union (combine everything)
- Vote (majority wins)
- Prioritize (worker N's output takes precedence)
- Reduce (extract a single field from each)
DO NOT produce the user's answer in this step. Only the plan and
the worker prompts.
Wait for me to approve before dispatching.
Why the “wait to approve” step matters: Most orchestrator failures aren’t bad workers — they’re bad decomposition. Forcing the orchestrator to expose its sub-task list before dispatching saves you from running 5 workers on the wrong 5 questions.
Sketch a 3-worker orchestrator for one of your tasks.
- Write the orchestrator’s system prompt: its job is to split, dispatch, and combine — never produce the substantive answer itself.
- Write 3 worker prompts. Each has a single, narrowly-scoped job (e.g. “read this file, output JSON”).
- Decide the merge rule: union, vote, prioritize.
Stretch. Swap one worker for a different model (Haiku) and see if quality holds. Routing by cost is the next pattern up.
The simplicity-first principle.
The principle that prevents most "agent" disasters: “Find the simplest solution possible, and only increase complexity when needed.”
Walk up the ladder, never the lift:
The line you will save your team: “Workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale.”
Audit one of your agents for over-engineering.
- Print out the full system prompt + tool list.
- Cross out everything that doesn’t directly enable the core job.
- If the agent still works in your head with crossed-out parts removed, delete them.
- Re-run the agent on 3 representative inputs. Compare quality.
Stretch. Repeat monthly. Prompts grow features the way old code grows TODOs.
Context as a finite resource.
The framing: “Find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome.”
Three problems that scale together as agents run longer:
- The context window fills up. Requests start getting rejected.
- Token costs scale with context size. Every additional turn pays the cumulative price.
- Model performance degrades as context grows. Empirically measured. Not a small effect.
Long-running agents need explicit strategies for keeping the context lean. Three of them are the rest of this day.
The cheapest win once your stable prefix is identified is prompt caching. Rather than re-teach the mechanics here, do the runnable drill in Practice 04 · “make the cache fire” — it walks the cache_control placement and has you watch cache_read_input_tokens jump on the second call. This unit’s lab below uses that same usage field as its verify.
Measure tokens in your real prompt.
- Paste the prompt into the tokenizer tool or use
anthropic.Anthropic().beta.messages.count_tokens(...). - Note the count. Compare to your typical max_tokens output.
- If prompt > 5x your output, you’re wasting cache budget on stale context.
- Identify which sections rarely change vs change every turn. Move stable sections behind
cache_control.
resp.usage.cache_read_input_tokens.Stretch. Wrap the count_tokens call in a unit test so future prompt edits fail loudly when they bloat.
Compaction.
Summarize the conversation history when approaching context limits. Distil critical information; discard the redundant. The advice that pays off: “Maximize recall to ensure your compaction prompt captures every relevant piece of information from the trace.”
The implementation
- Trigger. When context utilization crosses a threshold (e.g. 70%), kick off compaction.
- Prompt. “Summarize the trace so far. Preserve every fact, decision, and open commitment. Drop everything else.”
- Replace. Substitute the new summary message in place of the compacted turns. Continue.
- Hook before, save after. Claude Code’s
PreCompacthook lets you persist anything about to be lost.
Write a compactor prompt for one agent.
- Define what facts MUST survive (decisions made, open commitments, error states).
- Define what to drop (verbatim tool outputs, intermediate scratch, repeated context).
- Write the prompt: ‘Summarize this trace into <500 words preserving X, Y, Z. Drop A, B, C.’
- Run it on 3 different transcripts. Check the summaries still contain the load-bearing facts.
No long, out-of-context agent transcript on hand? Grab the shipped rotted_transcript.txt (a deliberately bloated multi-turn trace) and compact that — right-click → Save As. Or generate fresh traces by running Practice 04’s eval harness (eval_harness.py) and compacting the verbose rows it leaves in results.jsonl.
Stretch. Track compaction loss: how much does behavior degrade after 5 compactions? That’s your context-half-life.
Structured note-taking.
Instead of keeping everything in context, the agent writes notes to external storage and retrieves them on demand. A NOTES.md file. A wiki. A database.
This is exactly the LLM Wiki pattern (Practice 05) at a more general level. The conversation history grows linearly with turns; the notes file grows logarithmically with new knowledge. The model attends to what it needs, when it needs it.
save_note(key, value) and read_note(key). Watch what it chooses to save. The agent’s save decisions are often more interesting than its outputs — they reveal what it considers load-bearing.
Add structured notes to one agent.
- Add a
notesfield to your agent’s output schema (JSON or markdown frontmatter). - Have the agent write 3 lines per turn: what I learned, what I tried, what’s next.
- Persist the notes to a file or sidecar.
- On next invocation, load the most recent notes back into the system prompt.
Stretch. Track which notes the agent actually re-reads. The unread ones are noise; tighten the format.
Sub-agent architectures.
Specialized agents handle focused tasks and return condensed summaries (typically 1,000-2,000 tokens) to a coordinating agent. The main agent doesn’t carry the full detail; only the conclusions.
The key insight: each sub-agent has its own context window. The exploration cost stays inside the sub-agent. The parent only ever sees the synthesized output. This is why Claude Code uses subagent dispatch for parallel investigation — the main session’s context stays clean.
Sketch a 2-level architecture for your use case.
- Identify the ‘always run’ logic → lead agent.
- Identify 2-3 deep-but-narrow sub-tasks → specialist agents.
- Define the hand-off contract: what the lead passes to each specialist, what each returns.
- On paper, run one input through the architecture and check no specialist needs another’s output.
Stretch. Test: can the lead survive if you swap any specialist for a smaller model? If yes, the boundary is well-drawn.
Agent affordances, not API parity.
The biggest mistake in tool design: thin wrappers around your existing API. The correction: “Build tools matching how agents think.”
An example from the post: implement search_contacts instead of list_contacts, because agents have limited context and shouldn’t waste tokens reading irrelevant data.
Consolidate, don’t fragment
Instead of separate list_users, list_events, create_event tools, build a unified schedule_event tool that handles multiple steps internally. The agent thinks in jobs, not endpoints.
Inventory the affordances of one tool you use.
- Read the tool’s input_schema and description.
- List every distinct action the description claims it can do.
- Cross-reference against actual tool calls in 30 days of logs.
- Anything in the description that has never been called in 30 days — remove from the description.
No 30 days of logs yet? Generate a trace set in ~5 minutes by running Practice 04’s eval harness — eval_harness.py against the shipped agent writes a results.jsonl with one graded row per case, each recording the tool_calls it made. Run it on 10 cases, then count tool calls per action across those 10 rows instead of 30 days of production logs.
Stretch. Pair: when usage drops to zero, the action is probably done by a smarter tool. Find it.
Namespacing.
As agents gain access to dozens of MCP servers and hundreds of tools, namespacing is what keeps them selecting the right one.
# Pick a convention and stick with it. # By service (asana_*, jira_*, slack_*): asana_search · asana_create_task · asana_get_user jira_search · jira_create_issue · jira_get_user # Or by resource within a service: asana_projects_search · asana_users_search · asana_tasks_create
Don’t mix conventions across servers. Once you have 50+ tools, namespace consistency is what separates an agent that picks the right tool from one that picks the closest-sounding tool.
Audit your codebase for namespacing.
- List every tool name your agent exposes. The fastest way: grep your repo for the tool-definition key and pull the names out —
grep -rhoE '"name"[[:space:]]*:[[:space:]]*"[^"]+"' . \ | sed -E 's/.*"name"[[:space:]]*:[[:space:]]*"([^"]+)"/\1/' \ | sort -u > /tmp/toolnames.txt cat /tmp/toolnames.txt
- Pick a namespace scheme:
<domain>_<action>(e.g.github_create_pr,github_read_issue). - Flag the violators — any name without a
<domain>_prefix:grep -vE '^[a-z]+_[a-z_]+$' /tmp/toolnames.txt - Rename every line that prints. Re-run the grep. Add the regex to your style guide / CI.
grep -vE '^[a-z]+_[a-z_]+$') prints nothing — zero un-namespaced tools remain. Every name in /tmp/toolnames.txt matches <domain>_<action>.Stretch. Wire that same grep -vE into CI as a failing check — a new tool whose name breaks the namespace regex fails the build.
Return meaningful context, not raw data.
The principle: “Agents reason better with human-readable fields and simplified outputs than with raw technical IDs.”
Skip the IDs
UUIDs, mime types, internal flags — the model can’t reason about them. Return name, not uuid. Return file_type: "pdf", not mime_type: "application/pdf". Save the technical fields for the next-call inputs.
Offer two response modes
- Concise — high-signal fields only (~72 tokens in a well-tuned tool).
- Detailed — includes IDs for chaining (~206 tokens in their example).
Errors should teach, not just signal failure
We recommend: “Provide constructive feedback rather than opaque codes.” An error message like “Try a more specific search filter” is worth ten of “ERR_TOO_MANY_RESULTS”.
Rewrite a tool to return summaries, not raw data.
- Find a tool that returns > 1000 tokens regularly (likely a search, fetch, or list).
- Add a small summarization step inside the tool itself: top-N, sorted, structured.
- Return both:
{ summary: ..., raw_url: ... }— agent can fetch raw on demand. - Re-run 10 historical agent traces. Compare quality and token usage.
No 10 historical traces to re-run? Make them. Practice 04’s eval harness (eval_harness.py) runs the shipped agent over its 10-case set and records the real answer + token usage per case. Run it once with your tool returning raw data, once with the digest version, and diff the two results.jsonl files — that’s your before/after on the same 10 traces.
Stretch. The pattern works for retrieval too — never return all hits; return the top 5 with snippets.
Prototype · Evaluate · Collaborate.
An iterative three-phase process for tool design, in this order.
The result on internal benchmarks: “Testing on internal Slack and Asana tools, Claude-optimized implementations significantly outperformed human-written versions on held-out test sets.”
The implication: paste your tool descriptions into Claude, ask it to make them more agent-friendly, run the new versions through your eval suite, keep the wins. Most of your tool-quality improvements live one prompt away.
Run the prototype · evaluate · collaborate loop end-to-end.
- Prototype (30 min): get a single end-to-end run working on 1 input. Skip evals, skip polish.
- Evaluate (30 min): write 10 test cases. Run them. Note the failure modes.
- Collaborate (30 min): share the prototype + eval results with one teammate. Adjust based on their first question.
- Repeat the loop. Each cycle should take half as long as the previous.
Stretch. The loop is the meta-skill. The artifacts are bonus.
One page you keep on your desk.
By the end of this practice, you should be able to look at a new agent task and pick the right pattern in under 5 minutes. This cheatsheet is the working artifact — copy it into a doc, pin it, refer to it on every new build.
## AGENT PATTERN SELECTION — pick the right one in 5 minutes
## STEP 1 — Is this a workflow or an agent?
- Predefined steps, deterministic order? → WORKFLOW (cheaper, more debuggable)
- Steps need to be decided per input, model-driven? → AGENT
- When in doubt: start as workflow. Promote to agent only if
step-determination is itself the hard problem.
## STEP 2 — Pick the workflow pattern (if workflow)
| Pattern | Pick when… |
|-----------------------|------------------------------------------------------------------|
| Prompt chaining | Output of step N is input to step N+1; clear handoff |
| Routing | Input falls into one of K categories; each gets a different path |
| Parallelization | Sub-tasks are PREDEFINED and independent |
| Orchestrator-workers | Sub-tasks must be DECIDED per input by an LLM |
| Evaluator-optimizer | Quality of output matters more than latency; iterate to improve |
## STEP 3 — Context budget check (every pattern)
For every new agent, run these 4 checks BEFORE writing prompts:
- [ ] OFFLOAD: what can live outside the context window? (files, DB, search)
- [ ] RETRIEVE: what's loaded only when relevant? (RAG, lookup)
- [ ] ISOLATE: which steps run in fresh sub-agent contexts?
- [ ] REDUCE: what gets compacted/summarized when context fills?
## STEP 4 — Tool design audit
For every tool the agent uses:
- [ ] Match agent affordances, not API parity. (Don't expose 50 raw endpoints;
expose 5 verbs the agent thinks in.)
- [ ] Return MEANINGFUL fields, not raw IDs.
- [ ] Namespace clearly: `gh.pr.merge`, not `merge`.
- [ ] Errors teach the agent what to do next, not just signal failure.
- [ ] Tool description starts with WHEN to use it, not HOW.
## STEP 5 — Eval & ship
- [ ] One test case per intended use, written before coding.
- [ ] One adversarial test case (input that should fail safely).
- [ ] Cost estimate per typical run (tokens × rate × steps).
- [ ] Decide simplicity-first: if a single prompt to a strong model gets
80% of the value, ship that first.
## RED FLAGS — stop and reconsider
- Your orchestrator is doing the work itself (not delegating)
- Your workers need to talk to each other (couple them or merge them)
- Your context is over 80% full mid-task (compaction time)
- You picked agent because it "felt" right — workflow probably wins
- You have 7+ tools per agent (consolidate)
Use it on your next real build.
- Copy the cheatsheet into a doc you'll have open while coding.
- For the next agent you build, run through steps 1-5 BEFORE writing any prompts.
- Note which red flags you triggered. Adjust the design before code.
- After shipping, mark which steps you actually used vs skipped. The skipped ones are where bugs hide.