Practices/ The Agent Engineering Playbook
3 days · 12 units
Pradhya Practice 07 · The Agent Engineering Playbook Builder

Build the agents that last.

A working playbook for engineers shipping agents in production. Twelve units across three days: the five workflow patterns and when each one wins, context as a finite resource (offload, retrieve, isolate, reduce), sub-agent architectures, and the tool-design decisions that pay back in token cost every single request.

This is the practice where AI engineering stops being "let's see what works" and starts being a craft with a vocabulary. Every unit ends with a concrete decision you'll re-use in your next agent build.

Audience
Engineers shipping agents in production
Length
3 sessions · 90 min each
Walk-away
The five patterns + tool-design checklist
Prereq
The Agents Practice or equivalent
What you’ll be able to do by the end
  • Pick the right workflow pattern for any new agent in under 5 minutes
  • Apply the 4 context-engineering moves (offload, retrieve, isolate, reduce) to halve your token cost without losing quality
  • Write tool descriptions agents actually use correctly — not API parity, agent affordances
  • Decide between a single-context agent and a sub-agent architecture using the cost/coordination trade-off, not gut feel
§ 07.01.01 · Unit 01

Workflows vs agents.

The clearest framing: “Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage.”

Workflow predefined code paths step 1 step 2 step 3 Agent model directs itself model search? read file? ask user? commit?
Two systems · same model · different control flow

The line matters because almost every “agent” failure in industry is actually a workflow that was prematurely promoted. If you can write down the steps in advance, build a workflow. Reach for an agent only when the steps cannot be predicted — the model has to decide.

The test Can a junior engineer write the steps as code, in order, with no “it depends”? Build the workflow. Otherwise, you need the model in the loop.

Classify three real use cases as workflow or agent.

You’ll do
Pick 3 jobs you do at work this week. For each, decide whether the steps are knowable in advance.
Steps
  1. Write each task as a one-sentence goal (“email summary every Friday”, “triage support inbox”, “close month-end books”).
  2. For each, list the steps you’d hand to a junior engineer. If you reach 3+ steps without saying “it depends”, it’s a workflow.
  3. If you hit “it depends” or “the model decides” in the first 2 steps, it’s an agent.
  4. Tag each task: W for workflow, A for agent.
Verify
You should have at least one W — most production wins are workflows, not agents.

Stretch. For each A you tagged, sketch how you’d demote it to a workflow by pre-defining the branches. Most agents survive this reframing only when truly unpredictable.

§ 07.01.02 · Unit 02

The five workflow patterns.

Five composable building blocks. Memorize them; almost every production system is a combination of these.

1. Prompt chaining LLM LLM LLM → done 2. Routing router spec A spec B 3. Parallelization split aggregate 4. Orchestrator-workers orchestrator worker A worker B worker C 5. Evaluator-optimizer generator evaluator → ok / fail retry with critique The point Most production systems are composed from these five. Master the parts; assemble.
The five composable building blocks · practice reference
PatternUse when
Prompt chaining Steps are sequential and known. Each LLM call refines the prior output.
Routing Classify the input, send it to the specialist. Triage, intent detection.
Parallelization Independent sub-tasks. Run them concurrently; aggregate.
Orchestrator-workersSub-tasks emerge dynamically. The orchestrator decides per input.
Evaluator-optimizer Iterative improvement. One LLM critiques another’s output.

Match real tasks to the five patterns.

You’ll do
Take 5 tasks from your week and assign each to one of: prompt chaining, routing, parallelization, orchestrator-workers, or evaluator-optimizer.
Steps
  1. List 5 things you ask Claude to do at work.
  2. For each, pick the pattern that best fits. If multiple fit, prefer the simplest.
  3. For 2 of them, sketch the data flow on paper (input → steps → output).
  4. If a task doesn’t fit any pattern cleanly, it’s usually too vague — refine the goal.
Verify
You should be able to name the inputs and outputs of each step within 30 seconds. If you can’t, the pattern is wrong.

Stretch. Pick the most-used pattern across your 5 tasks. That’s your team’s default pattern — build a reusable template for it before building anything new.

§ 07.01.03 · Unit 03

The orchestrator-workers pattern.

The pattern most people reach for, often wrong. The orchestrator decides the sub-tasks on every input. Workers execute. The orchestrator synthesizes.

Coding agents are the canonical example: when fixing a complex bug across multiple files, the orchestrator reads the symptom, decides which files to touch, dispatches a worker per file, then synthesizes the diff. The sub-tasks aren’t predefined — they emerge from reading the symptom.

Why this is harder than it looks

  • The orchestrator has to be good at decomposition, not just retrieval. Few-shot examples of good plans are gold.
  • Workers each need their own context. The orchestrator’s prompt should give each worker only what that worker needs.
  • Synthesis is its own skill. Combining N worker outputs into one coherent response is where teams under-invest.

Distinguish from parallelization

Parallelization runs pre-defined sub-tasks. Orchestrator-workers generates the sub-tasks per input. The first is a static fan-out. The second is a dynamic delegation. Different cost, different complexity, different debugging story.

The orchestrator prompt template

The prompt structure that consistently produces well-shaped sub-task lists instead of "let me think about this":

# Drop in as the orchestrator's system prompt
You are the orchestrator. You do not solve the user's problem directly —
you decide how to decompose it and dispatch workers.

For this input: [user's request]

Step 1 — Decompose
List 2-5 sub-tasks, each scoped narrowly enough that one specialized
worker can handle it alone. Each sub-task should be:
- Independently completable (no cross-worker dependency)
- One sentence describing the worker's job
- One sentence describing what "done" looks like

Step 2 — Dispatch
For each sub-task, draft the exact prompt you would send to a worker.
The worker has no context other than this prompt.

Step 3 — Synthesis plan
Describe how you'll merge the workers' outputs into a single answer.
State the merge rule BEFORE you see the outputs:
- Union (combine everything)
- Vote (majority wins)
- Prioritize (worker N's output takes precedence)
- Reduce (extract a single field from each)

DO NOT produce the user's answer in this step. Only the plan and
the worker prompts.

Wait for me to approve before dispatching.

Why the “wait to approve” step matters: Most orchestrator failures aren’t bad workers — they’re bad decomposition. Forcing the orchestrator to expose its sub-task list before dispatching saves you from running 5 workers on the wrong 5 questions.

Sketch a 3-worker orchestrator for one of your tasks.

You’ll do
Pick a task that needs parallel sub-jobs (research a topic, audit a doc, review a PR). Design the orchestrator + 3 worker prompts.
Steps
  1. Write the orchestrator’s system prompt: its job is to split, dispatch, and combine — never produce the substantive answer itself.
  2. Write 3 worker prompts. Each has a single, narrowly-scoped job (e.g. “read this file, output JSON”).
  3. Decide the merge rule: union, vote, prioritize.
Verify
Run it end-to-end on one input. The orchestrator should never need to know the worker’s implementation; the workers should never need to coordinate.

Stretch. Swap one worker for a different model (Haiku) and see if quality holds. Routing by cost is the next pattern up.

§ 07.01.04 · Unit 04

The simplicity-first principle.

The principle that prevents most "agent" disasters: “Find the simplest solution possible, and only increase complexity when needed.”

Walk up the ladder, never the lift:

prompt + tools workflow agent multi-agent stop at the first rung that solves the problem
Complexity ladder · climb only as high as the task requires

The line you will save your team: “Workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale.”

Audit one of your agents for over-engineering.

You’ll do
Pick the most complex prompt or agent you’ve built. Find the parts you could remove.
Steps
  1. Print out the full system prompt + tool list.
  2. Cross out everything that doesn’t directly enable the core job.
  3. If the agent still works in your head with crossed-out parts removed, delete them.
  4. Re-run the agent on 3 representative inputs. Compare quality.
Verify
Quality should be ≥95% of original with ≤70% of the prompt length. If not, restore the cut and try again.

Stretch. Repeat monthly. Prompts grow features the way old code grows TODOs.

§ 07.02.01 · Unit 05

Context as a finite resource.

The framing: “Find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome.”

Three problems that scale together as agents run longer:

  1. The context window fills up. Requests start getting rejected.
  2. Token costs scale with context size. Every additional turn pays the cumulative price.
  3. Model performance degrades as context grows. Empirically measured. Not a small effect.

Long-running agents need explicit strategies for keeping the context lean. Three of them are the rest of this day.

The maxim The context engineer’s job is not including things. It is excluding things. What does the agent need to see for this turn? Less is the answer almost every time.

The cheapest win once your stable prefix is identified is prompt caching. Rather than re-teach the mechanics here, do the runnable drill in Practice 04 · “make the cache fire” — it walks the cache_control placement and has you watch cache_read_input_tokens jump on the second call. This unit’s lab below uses that same usage field as its verify.

Measure tokens in your real prompt.

You’ll do
Take a system prompt you use weekly. Count its tokens. Decide what to cut.
Steps
  1. Paste the prompt into the tokenizer tool or use anthropic.Anthropic().beta.messages.count_tokens(...).
  2. Note the count. Compare to your typical max_tokens output.
  3. If prompt > 5x your output, you’re wasting cache budget on stale context.
  4. Identify which sections rarely change vs change every turn. Move stable sections behind cache_control.
Verify
Cache hit rate should be ≥80% on subsequent runs. Check resp.usage.cache_read_input_tokens.

Stretch. Wrap the count_tokens call in a unit test so future prompt edits fail loudly when they bloat.

§ 07.02.02 · Unit 06

Compaction.

Summarize the conversation history when approaching context limits. Distil critical information; discard the redundant. The advice that pays off: “Maximize recall to ensure your compaction prompt captures every relevant piece of information from the trace.”

Before 50 turns of conversation compact After one summarized message summary of last 50 turns facts · decisions · open commitments
Compact when approaching the cap · preserve recall, drop the noise

The implementation

  • Trigger. When context utilization crosses a threshold (e.g. 70%), kick off compaction.
  • Prompt. “Summarize the trace so far. Preserve every fact, decision, and open commitment. Drop everything else.”
  • Replace. Substitute the new summary message in place of the compacted turns. Continue.
  • Hook before, save after. Claude Code’s PreCompact hook lets you persist anything about to be lost.

Write a compactor prompt for one agent.

You’ll do
Take an agent that ran out of context. Write the compactor that turns its trace into a summary.
Steps
  1. Define what facts MUST survive (decisions made, open commitments, error states).
  2. Define what to drop (verbatim tool outputs, intermediate scratch, repeated context).
  3. Write the prompt: ‘Summarize this trace into <500 words preserving X, Y, Z. Drop A, B, C.’
  4. Run it on 3 different transcripts. Check the summaries still contain the load-bearing facts.
Verify
A new agent given the summary should be able to continue without asking clarifying questions about prior context.

No long, out-of-context agent transcript on hand? Grab the shipped rotted_transcript.txt (a deliberately bloated multi-turn trace) and compact that — right-click → Save As. Or generate fresh traces by running Practice 04’s eval harness (eval_harness.py) and compacting the verbose rows it leaves in results.jsonl.

Stretch. Track compaction loss: how much does behavior degrade after 5 compactions? That’s your context-half-life.

§ 07.02.03 · Unit 07

Structured note-taking.

Instead of keeping everything in context, the agent writes notes to external storage and retrieves them on demand. A NOTES.md file. A wiki. A database.

agent in-context save_note read_note NOTES.md key/value persistent
Working memory ↔ disk · context stays lean

This is exactly the LLM Wiki pattern (Practice 05) at a more general level. The conversation history grows linearly with turns; the notes file grows logarithmically with new knowledge. The model attends to what it needs, when it needs it.

The pattern Give the agent two tools: save_note(key, value) and read_note(key). Watch what it chooses to save. The agent’s save decisions are often more interesting than its outputs — they reveal what it considers load-bearing.

Add structured notes to one agent.

You’ll do
Make the agent emit a structured journal after each turn. The journal is the agent’s memory across calls.
Steps
  1. Add a notes field to your agent’s output schema (JSON or markdown frontmatter).
  2. Have the agent write 3 lines per turn: what I learned, what I tried, what’s next.
  3. Persist the notes to a file or sidecar.
  4. On next invocation, load the most recent notes back into the system prompt.
Verify
After 5 turns, the agent should reference its own earlier decisions when asked.

Stretch. Track which notes the agent actually re-reads. The unread ones are noise; tighten the format.

§ 07.02.04 · Unit 08

Sub-agent architectures.

Specialized agents handle focused tasks and return condensed summaries (typically 1,000-2,000 tokens) to a coordinating agent. The main agent doesn’t carry the full detail; only the conclusions.

The key insight: each sub-agent has its own context window. The exploration cost stays inside the sub-agent. The parent only ever sees the synthesized output. This is why Claude Code uses subagent dispatch for parallel investigation — the main session’s context stays clean.

When to dispatch a sub-agent When the work is genuinely independent and the intermediate results would pollute the parent’s context. Searching 30 files for a pattern: dispatch. Reading 30 unrelated docs: dispatch. Single linear refactor: don’t.

Sketch a 2-level architecture for your use case.

You’ll do
Pick a task that’s currently one monolithic prompt. Decompose it into a lead + 2-3 specialists.
Steps
  1. Identify the ‘always run’ logic → lead agent.
  2. Identify 2-3 deep-but-narrow sub-tasks → specialist agents.
  3. Define the hand-off contract: what the lead passes to each specialist, what each returns.
  4. On paper, run one input through the architecture and check no specialist needs another’s output.
Verify
If two specialists need to coordinate, they’re actually one specialist — merge them.

Stretch. Test: can the lead survive if you swap any specialist for a smaller model? If yes, the boundary is well-drawn.

§ 07.03.01 · Unit 09

Agent affordances, not API parity.

The biggest mistake in tool design: thin wrappers around your existing API. The correction: “Build tools matching how agents think.”

API parity list_users list_events create_event list_rooms book_room agent has to call many · fragile Agent affordance schedule_meeting handles users + room + event one tool · agent thinks in jobs
Agents think in jobs · API thinks in resources · build tools for jobs

An example from the post: implement search_contacts instead of list_contacts, because agents have limited context and shouldn’t waste tokens reading irrelevant data.

Consolidate, don’t fragment

Instead of separate list_users, list_events, create_event tools, build a unified schedule_event tool that handles multiple steps internally. The agent thinks in jobs, not endpoints.

The one-line test If a human engineer can’t definitively say which of two tools to use, the agent won’t either. Merge or differentiate. Don’t ship overlapping tools.

Inventory the affordances of one tool you use.

You’ll do
Pick one tool exposed to a Claude agent. List every action it can take. Find the unused ones.
Steps
  1. Read the tool’s input_schema and description.
  2. List every distinct action the description claims it can do.
  3. Cross-reference against actual tool calls in 30 days of logs.
  4. Anything in the description that has never been called in 30 days — remove from the description.
Verify
Calls per action / total calls = your activation rate. Anything below 5% is description bloat.

No 30 days of logs yet? Generate a trace set in ~5 minutes by running Practice 04’s eval harnesseval_harness.py against the shipped agent writes a results.jsonl with one graded row per case, each recording the tool_calls it made. Run it on 10 cases, then count tool calls per action across those 10 rows instead of 30 days of production logs.

Stretch. Pair: when usage drops to zero, the action is probably done by a smarter tool. Find it.

§ 07.03.02 · Unit 10

Namespacing.

As agents gain access to dozens of MCP servers and hundreds of tools, namespacing is what keeps them selecting the right one.

# Pick a convention and stick with it.

# By service (asana_*, jira_*, slack_*):
asana_search · asana_create_task · asana_get_user
jira_search  · jira_create_issue · jira_get_user

# Or by resource within a service:
asana_projects_search · asana_users_search · asana_tasks_create

Don’t mix conventions across servers. Once you have 50+ tools, namespace consistency is what separates an agent that picks the right tool from one that picks the closest-sounding tool.

Audit your codebase for namespacing.

You’ll do
Tool names should be predictable. Find the ones that aren’t.
Steps
  1. List every tool name your agent exposes. The fastest way: grep your repo for the tool-definition key and pull the names out —
    grep -rhoE '"name"[[:space:]]*:[[:space:]]*"[^"]+"' . \
      | sed -E 's/.*"name"[[:space:]]*:[[:space:]]*"([^"]+)"/\1/' \
      | sort -u > /tmp/toolnames.txt
    cat /tmp/toolnames.txt
  2. Pick a namespace scheme: <domain>_<action> (e.g. github_create_pr, github_read_issue).
  3. Flag the violators — any name without a <domain>_ prefix:
    grep -vE '^[a-z]+_[a-z_]+$' /tmp/toolnames.txt
  4. Rename every line that prints. Re-run the grep. Add the regex to your style guide / CI.
Verify
The second grep (grep -vE '^[a-z]+_[a-z_]+$') prints nothing — zero un-namespaced tools remain. Every name in /tmp/toolnames.txt matches <domain>_<action>.

Stretch. Wire that same grep -vE into CI as a failing check — a new tool whose name breaks the namespace regex fails the build.

§ 07.03.03 · Unit 11

Return meaningful context, not raw data.

The principle: “Agents reason better with human-readable fields and simplified outputs than with raw technical IDs.”

Skip the IDs

UUIDs, mime types, internal flags — the model can’t reason about them. Return name, not uuid. Return file_type: "pdf", not mime_type: "application/pdf". Save the technical fields for the next-call inputs.

Offer two response modes

  • Concise — high-signal fields only (~72 tokens in a well-tuned tool).
  • Detailed — includes IDs for chaining (~206 tokens in their example).

Errors should teach, not just signal failure

We recommend: “Provide constructive feedback rather than opaque codes.” An error message like “Try a more specific search filter” is worth ten of “ERR_TOO_MANY_RESULTS”.

Rewrite a tool to return summaries, not raw data.

You’ll do
Pick the tool with the biggest token output. Make it return a digest instead.
Steps
  1. Find a tool that returns > 1000 tokens regularly (likely a search, fetch, or list).
  2. Add a small summarization step inside the tool itself: top-N, sorted, structured.
  3. Return both: { summary: ..., raw_url: ... } — agent can fetch raw on demand.
  4. Re-run 10 historical agent traces. Compare quality and token usage.
Verify
Tokens should drop ≥60% with no observable quality loss. If quality drops, your summarizer is too aggressive.

No 10 historical traces to re-run? Make them. Practice 04’s eval harness (eval_harness.py) runs the shipped agent over its 10-case set and records the real answer + token usage per case. Run it once with your tool returning raw data, once with the digest version, and diff the two results.jsonl files — that’s your before/after on the same 10 traces.

Stretch. The pattern works for retrieval too — never return all hits; return the top 5 with snippets.

§ 07.03.04 · Unit 12

Prototype · Evaluate · Collaborate.

An iterative three-phase process for tool design, in this order.

1. Prototype build it rough; run locally 2. Evaluate real tasks; multi-call eval 3. Collaborate have Claude analyze + rewrite iterate until evals plateau
Tool-writing process · practice reference

The result on internal benchmarks: “Testing on internal Slack and Asana tools, Claude-optimized implementations significantly outperformed human-written versions on held-out test sets.”

The implication: paste your tool descriptions into Claude, ask it to make them more agent-friendly, run the new versions through your eval suite, keep the wins. Most of your tool-quality improvements live one prompt away.

The closing The four canonical posts together are about 12,000 words. They will give you a year of corrections you would otherwise pay for in production. The price of admission is reading them slowly. This practice is the index; the source is the textbook.

Run the prototype · evaluate · collaborate loop end-to-end.

You’ll do
Take a task you haven’t built yet. Run the full loop in 90 minutes.
Steps
  1. Prototype (30 min): get a single end-to-end run working on 1 input. Skip evals, skip polish.
  2. Evaluate (30 min): write 10 test cases. Run them. Note the failure modes.
  3. Collaborate (30 min): share the prototype + eval results with one teammate. Adjust based on their first question.
  4. Repeat the loop. Each cycle should take half as long as the previous.
Verify
After 3 loops you should have either shipped or killed the idea. Anything in between is debt.

Stretch. The loop is the meta-skill. The artifacts are bonus.

§ Walk-away · The pattern-selection cheatsheet

One page you keep on your desk.

By the end of this practice, you should be able to look at a new agent task and pick the right pattern in under 5 minutes. This cheatsheet is the working artifact — copy it into a doc, pin it, refer to it on every new build.

## AGENT PATTERN SELECTION — pick the right one in 5 minutes

## STEP 1 — Is this a workflow or an agent?
- Predefined steps, deterministic order? → WORKFLOW (cheaper, more debuggable)
- Steps need to be decided per input, model-driven? → AGENT
- When in doubt: start as workflow. Promote to agent only if
  step-determination is itself the hard problem.

## STEP 2 — Pick the workflow pattern (if workflow)
| Pattern               | Pick when…                                                       |
|-----------------------|------------------------------------------------------------------|
| Prompt chaining       | Output of step N is input to step N+1; clear handoff             |
| Routing               | Input falls into one of K categories; each gets a different path |
| Parallelization       | Sub-tasks are PREDEFINED and independent                         |
| Orchestrator-workers  | Sub-tasks must be DECIDED per input by an LLM                    |
| Evaluator-optimizer   | Quality of output matters more than latency; iterate to improve  |

## STEP 3 — Context budget check (every pattern)
For every new agent, run these 4 checks BEFORE writing prompts:
- [ ] OFFLOAD: what can live outside the context window? (files, DB, search)
- [ ] RETRIEVE: what's loaded only when relevant? (RAG, lookup)
- [ ] ISOLATE: which steps run in fresh sub-agent contexts?
- [ ] REDUCE: what gets compacted/summarized when context fills?

## STEP 4 — Tool design audit
For every tool the agent uses:
- [ ] Match agent affordances, not API parity. (Don't expose 50 raw endpoints;
      expose 5 verbs the agent thinks in.)
- [ ] Return MEANINGFUL fields, not raw IDs.
- [ ] Namespace clearly: `gh.pr.merge`, not `merge`.
- [ ] Errors teach the agent what to do next, not just signal failure.
- [ ] Tool description starts with WHEN to use it, not HOW.

## STEP 5 — Eval & ship
- [ ] One test case per intended use, written before coding.
- [ ] One adversarial test case (input that should fail safely).
- [ ] Cost estimate per typical run (tokens × rate × steps).
- [ ] Decide simplicity-first: if a single prompt to a strong model gets
      80% of the value, ship that first.

## RED FLAGS — stop and reconsider
- Your orchestrator is doing the work itself (not delegating)
- Your workers need to talk to each other (couple them or merge them)
- Your context is over 80% full mid-task (compaction time)
- You picked agent because it "felt" right — workflow probably wins
- You have 7+ tools per agent (consolidate)

Use it on your next real build.

Steps
  1. Copy the cheatsheet into a doc you'll have open while coding.
  2. For the next agent you build, run through steps 1-5 BEFORE writing any prompts.
  3. Note which red flags you triggered. Adjust the design before code.
  4. After shipping, mark which steps you actually used vs skipped. The skipped ones are where bugs hide.