When one agent isn't enough.
Multi-agent systems are the next architecture wave — and also the place engineers most often over-engineer. This practice gives you the three patterns that actually work, the one pattern that almost always fails, and the coordination prompts that turn "many agents flailing" into "many agents shipping."
Most teams reach for multi-agent because it sounds modern. Most teams should have stuck with a workflow + tools. This practice teaches you to know which side you're on before you start building — and how to build it properly when you do.
- Decide single-agent vs multi-agent for any task in ≤5 min using the cost/coordination matrix
- Pick between orchestrator-worker, supervisor-subagent, and peer-to-peer with intent, not gut feel
- Write coordination prompts that survive 30+ tool calls without losing the goal
- Diagnose three classes of multi-agent failure (rogue worker, drift, deadlock) in ≤10 min
Validation: this pattern follows Anthropic’s published agent workflow guidance for orchestrator-workers and evaluator-optimizer loops: anthropic.com/engineering/building-effective-agents.
When NOT to use multi-agent.
Motto: the right number of agents is one, until it provably isn't.
Single-agent always wins on simplicity, cost, and debuggability. Multi-agent is justified ONLY when at least one of these is true:
- The task has independent sub-tasks that can run in parallel. (Five workers researching five companies; the orchestrator merges. Not "one worker that takes 5 sequential steps.")
- Different sub-tasks need genuinely different context, system prompts, or tools. (A code-review subagent has different rules than a security-review subagent.)
- A long-running task needs context isolation — without it, the main context fills up and the agent loses the goal halfway through.
- You need parallel execution for latency, not just cost.
If none of these apply, build a single agent with more tools, a richer system prompt, and a Plan-then-Execute loop. You will ship faster and debug easier. Promote to multi-agent only when measurement, not vibes, says you have to.
The decision prompt
Before building anything multi-agent, run this prompt on Claude with your design spec:
# Before you build any multi-agent system, run this I'm considering a multi-agent architecture for this task: [describe in 3 sentences]. Question 1: Could one agent with the right tools and a longer system prompt do this? If yes, name the tools and sketch the system prompt. Question 2: If multi-agent IS justified, which one of these is the reason? A) Genuine parallelism — independent sub-tasks B) Different sub-tasks need different system prompts / tools C) Long-running, need context isolation per sub-task D) Parallel execution for latency reduction Question 3: For my specific case, estimate: - Total tokens per task at single-agent - Total tokens per task at multi-agent (sum across all agents) - Wall-clock latency for each Question 4: Be honest — am I reaching for multi-agent because it actually solves my problem, or because it sounds cool? Push back if I'm in the second camp.
Why Q4 matters: most over-engineered multi-agent systems were built because someone wanted to ship a "multi-agent system," not because the task demanded one. Forcing the model to push back makes the over-engineering case explicit.
Run the decision prompt on a real system idea.
- Pick a system you’re considering. No idea on hand? Use the fallback: “a support bot that escalates billing disputes.”
- Copy the decision prompt above (Copy button) into a fresh Claude chat.
- Replace
[describe in 3 sentences]with your system (for the fallback: “Users message a bot about their bill. The bot answers routine questions and, when a charge is genuinely disputed, opens a ticket and hands off to a human. It must never promise a refund itself.”). - Send it. Read Question 2 and Question 4 in the reply.
Stretch. Paste the Question 3 token/latency estimates next to each other. If single-agent and multi-agent are within 2× on tokens and latency, multi-agent is buying you nothing — write that conclusion in one sentence.
Orchestrator-worker.
Motto: one agent decides what to do; many agents do it; the first one synthesizes the answer.
The most common and most useful pattern. The orchestrator reads the input, decides the sub-tasks, dispatches workers, then merges results. Workers don't know about each other and can run in parallel.
Use when: the task has sub-jobs that can be done independently, by workers with the same or similar capability. Examples: research 5 companies, audit 10 files, draft 3 alternatives.
You already built the core of this. The orchestrator prompt template in the Agent Engineering Playbook, §07.01.03 taught the split/dispatch/combine skeleton. The version below is the production header around it — it adds an explicit time-out rule, a named merge strategy chosen before outputs land, and a mandatory show-the-plan-first gate. Import that one; extend it with these.
The orchestrator system prompt
You are an orchestrator. You do not solve the user's problem directly. Your job: 1. DECOMPOSE the request into 2-7 independent sub-tasks. Each must be: - Completable by a single worker with no help from siblings - Phrased as a self-contained prompt (one paragraph max) - Tagged with a clear "done" signal you can recognize 2. DISPATCH all workers in parallel. Workers share NO state. Each worker gets exactly the prompt you wrote — no extra context. 3. WAIT for all workers to return. Time-out at [N seconds] per worker; if a worker times out, retry once or proceed without. 4. SYNTHESIZE using the MERGE STRATEGY you decided BEFORE you saw the outputs. Pick one: - UNION: combine all worker outputs into a single structured doc - VOTE: take the majority answer when workers disagree - PRIORITIZE: worker N's answer wins on conflict - REDUCE: extract one field from each worker, then collapse 5. RETURN the synthesis + a one-line summary of which workers contributed what. Before dispatching, show me the decomposition and merge strategy. Wait for my OK.
Make the orchestrator decompose a folder you know.
- Pick a folder you know well — a side project, a work repo, or even
~/Documentswith a few sub-folders. - Copy the orchestrator system prompt above into a fresh Claude chat.
- As your first user message, paste a plain list of its top-level contents (run
ls/ls -d */and paste the output) and say: “Decompose understanding this into worker tasks.” - Stop at the plan. The prompt ends with “Wait for my OK” — do not approve. Just read the decomposition it shows you.
Stretch. Ask it to also print the merge strategy it picked (UNION / VOTE / PRIORITIZE / REDUCE) and one sentence on why. A good orchestrator commits to the merge before seeing outputs — confirm it did.
Supervisor-subagent.
Motto: one agent watches the work; the other does it; the supervisor steps in when the subagent loses the thread.
Different from orchestrator-worker: there's only one subagent doing the work at a time, and the supervisor is reading the subagent's output as it streams, intervening when the subagent goes off-track. Use for long-running tasks (15+ minutes, 30+ tool calls) where context drift is the main risk.
Use when: a single agent is the right shape but the context fills up before the task finishes. The supervisor's job is to summarize and redirect — it's a context-management agent, not a co-worker.
The supervisor system prompt
You are a supervisor watching a subagent work. The subagent is
doing this task: [describe].
Your responsibilities, in priority order:
1. WATCH the subagent's output. Every 5 messages or 10 tool calls
(whichever comes first), pause and ask yourself:
- Is the subagent still working on the right thing?
- Is the subagent's reasoning becoming circular or stuck?
- Is the subagent's context getting close to full?
2. If everything's fine, do nothing. Let the subagent work.
3. If something's off, INTERRUPT with one of:
- REDIRECT: "Stop. You were doing X. The goal is Y. Restart from [step]."
- COMPACT: "Summarize what you've learned so far in ≤200 words.
Then continue with that summary as your only state."
- HANDOFF: "You've done the analysis. Stop. Hand off to a fresh
subagent with this summary as input: [...]"
4. NEVER do the subagent's work yourself. Your role is meta — you
manage the work, you don't do it.
5. Log every intervention so the operator can review. Format:
[TIMESTAMP] INTERVENTION: [redirect|compact|handoff]
| REASON: [...] | RESULT: [...]
The "log every intervention" rule is the part that makes this debuggable. Multi-agent systems without intervention logs are unfixable; you can't tell what the supervisor changed and why.
Force one logged intervention out of the supervisor.
- Copy the supervisor system prompt above into a fresh Claude chat. Set the subagent’s task to “summarize this 10-page report into 5 bullets.”
- Now feed it a subagent transcript that goes off the rails. Paste this as the subagent’s output: “Bullet 1 done. Actually, before I continue, let me re-read the whole report from the top, then re-read it again to be thorough, then start a glossary of every term…”
- Ask the supervisor: “Given that subagent output, what do you do?”
[TIMESTAMP] INTERVENTION: [redirect|compact|handoff] | REASON: … | RESULT: … — with redirect or compact chosen (re-reading forever is drift, not a handoff). If it starts summarizing the report itself, it broke rule 4 — that’s the failure to catch.Stretch. Re-run with a healthy subagent output (“Bullets 1–3 done, on track, 2 to go”). Verify the supervisor does nothing — a supervisor that interrupts a working agent is its own failure mode.
Peer-to-peer (rare, dangerous).
Motto: two agents talking to each other will either converge or spiral; there's no third option.
Sometimes you want two agents that interact directly — a generator and a critic, a buyer and a seller, a player and an opponent. This is the highest-risk pattern. Without strict structure, peer agents will:
- Infinite loop. Each politely passes the question back: "What do you think?" "I think we should consider what YOU think."
- Drift together. Two agents start with different views; after 5 turns they're amplifying each other's wrong assumption.
- Race-condition output. One agent decides while the other is still thinking; the "agreement" is fictional.
If you must use peer-to-peer, use this structure
- Fixed turns. Cap the number of exchanges (e.g., 3 rounds max). Force a decision after.
- Explicit roles, asymmetric goals. "Generator: produce X. Critic: find ONE flaw and stop. Generator: revise, addressing that flaw. Stop."
- Different system prompts, ideally different models. Same model + similar prompt = high agreement bias; both end up at the same wrong answer.
- Final-decision arbiter. A third agent (or human) reads the transcript and picks the answer. Don't let the peers decide their own consensus.
Generator-critic prompt pair
# GENERATOR You are the generator. You will be given a task. Produce a draft answer. Constraint: be specific, concrete, take a position. After my message, draft the answer. Stop after one draft. # CRITIC You are the critic. You will be given a draft. Find the SINGLE biggest flaw and explain it in 1-2 sentences. Do NOT propose a fix. Do NOT find multiple flaws. Pick the one that, if wrong, breaks the whole thing. Stop after one critique. # ARBITER (run after both) You are the arbiter. Read the generator's draft and the critic's critique. Decide: ACCEPT (the draft holds despite the critique), REVISE (the critique is fatal, run another generator round), or REJECT (the whole approach is wrong; restart). Output one of those three labels + one sentence of justification. Nothing more.
Run generator→critic→arbiter and force a decision.
- Pick a small decision you actually face (“should we cache the system prompt here?”, “ship feature X this sprint?”).
- Open chat 1, paste the GENERATOR block, give it the task, take its one draft.
- Open chat 2, paste the CRITIC block, give it that draft, take its one flaw.
- Open chat 3, paste the ARBITER block, give it the draft + the critique.
ACCEPT / REVISE / REJECT plus one sentence — nothing more. A decision landed in three turns; no “what do you think?” ping-pong. That bounded exit is the whole point of the pattern.Stretch. Deliberately break it: skip the arbiter and instead let the generator and critic keep replying to each other. Count how many turns until it converges or stalls — that’s the failure the arbiter exists to prevent.
Context isolation.
Motto: the second agent should know nothing the first agent didn't explicitly hand it.
Most multi-agent failures come from leaky context. Agent A produces 5000 tokens of reasoning + 50 tokens of conclusion. The naive design hands the entire 5050 tokens to Agent B. Agent B is now drowning in A's deliberation and miss-attends to the conclusion.
The discipline: explicitly construct what each agent receives. Three patterns:
- Summary handoff. Agent A's final step is "summarize your output in ≤200 words." Agent B gets only the summary.
- Field extraction. Agent A returns structured JSON; Agent B receives only the fields it needs.
- Fresh context restart. Agent B starts with no history; only the input prompt + Agent A's distilled output.
Pick one per handoff. Don't mix.
Shrink a leaky handoff with a summary boundary.
- In Claude, ask: “Research whether we should adopt Postgres or SQLite for a 5-person internal tool. Show all your reasoning, step by step, at length.” This is Agent A’s raw output — the full transcript a naive design would forward.
- Note its size: paste it into the token counter (or just count words × ~1.3). Call this N_full.
- Now apply the discipline. Append to A’s output: “Summarize your conclusion in ≤200 words: the decision, the two deciding reasons, and nothing else.” This is what Agent B should receive.
- Count the summary’s tokens. Call this N_summary.
N_summary is at least 5× smaller than N_full, and the summary still contains the decision + both reasons (so B loses nothing it needs). That gap is the context bloat you just stopped from leaking downstream.Stretch. Redo it as field extraction instead: tell A to output strict JSON {decision, reason_1, reason_2}. Compare that token count to your 200-word summary — structured handoffs are usually the tightest of the three.
Coordination prompts.
Motto: the prompt that ties N agents together is more important than any of their individual prompts.
The coordination layer needs three things every prompt should include:
- The goal in one sentence. Repeated to every agent in the system. When workers drift, they drift toward different interpretations of an ambiguous goal.
- The blast-radius rule. What this agent can touch vs not. ("You may only call tools X, Y, Z. You may not access [file pattern]. You may not start a sub-agent.")
- The exit condition. What "done" looks like. ("Stop when you have N items in the output array. Stop when the user's question is answered with confidence ≥4. Stop after 10 tool calls regardless.")
The coordination header (paste at top of every agent's prompt)
## COORDINATION HEADER (this is your contract with the system) GOAL: [one sentence — the same on every agent's prompt] YOUR ROLE: [one sentence — what this agent's job is] BLAST RADIUS: [tools you can call, paths you can touch, agents you can spawn] STOP WHEN: [explicit exit condition] HANDOFF FORMAT: [if you produce output for another agent, the exact format] ESCALATE WHEN: [the condition that means stop and ask a human / supervisor] ## YOUR TASK [the per-agent instructions go here]
Why the header repeats on every agent: in multi-agent systems, individual agents have no global view. The header gives each one enough context to know when it's about to step outside its role. The "STOP WHEN" and "ESCALATE WHEN" lines are the difference between a system that runs forever and one that knows when it's done.
Fill the coordination header for one real worker.
- Copy the COORDINATION HEADER block above.
- Pick the single worker you understand best (e.g. the “summarize
src/auth/” worker from your §18.01.02 plan, or the fallback support bot’s “look up the disputed charge” worker). - Replace every
[bracketed]slot with a concrete value.BLAST RADIUSmust name actual tools/paths;STOP WHENmust be a condition you could check by looking (a count, a field, a tool-call cap), not “when it’s done.” - Paste your filled header into Claude and ask: “Is any field still vague or unenforceable? Name it.”
STOP WHEN line states a condition observable from outside the agent (e.g. “3 records in the array” / “after 10 tool calls”), not a feeling.Stretch. Write the header for a second worker in the same system and confirm the GOAL line is byte-for-byte identical to the first — that shared goal is what keeps independent workers pointed the same way.
Debugging multi-agent failures.
Motto: three failure modes; one cure per mode.
| Failure mode | What you observe | Cure |
|---|---|---|
| Rogue worker | One worker decides to do the orchestrator's job; produces a full answer instead of its sub-task output | Tighter coordination header; explicit "DO NOT" rules; smaller model for workers |
| Drift | By turn 5, the agents have drifted from the original goal; outputs look productive but answer a different question | Repeat the GOAL line at every handoff; supervisor compacts every N steps |
| Deadlock | Agents pass requests to each other indefinitely; never produce final output | Hard turn cap; explicit arbiter agent to break ties; budget-based stop condition |
| Cost explosion | Token usage 10x what you estimated; can't tell which agent caused it | Per-agent token budgets enforced at runtime; observability dashboard per agent role |
| Context leak | Agent B produces output that references something it shouldn't know | Fresh-context restart at handoff; assert allowlist of input fields |
Diagnose three broken transcripts against the table.
- Open
broken_transcripts.txt(right-click → Save As, or just read it in the tab). It holds Transcripts A, B, C plus an answer key at the bottom. - Read A, B, and C without scrolling to the key.
- For each, write one word — drift, deadlock, or rogue worker — and copy the matching Cure cell from the table above.
- Now scroll to the ANSWER KEY and compare.
Stretch. Pick the transcript whose cure lives in tool permissions, not the prompt (it’s the rogue worker), and write the one allowlist rule that would have blocked it — e.g. “deny delete, rename, push; single-file write only.”
Cost-control patterns.
Motto: the orchestrator should be the only expensive agent; workers should be small models doing one thing.
Multi-agent costs add up fast. Three patterns that keep them in check:
- Strong orchestrator, cheap workers. The orchestrator (which decides what to do) needs reasoning quality. The workers (which execute) often don't. Use Opus or Sonnet for the orchestrator; Haiku for workers.
- Cache the orchestrator's system prompt. If the orchestrator's prompt is 4000 tokens and it runs 100 times a day, caching saves ~$2-5/day on a single endpoint.
- Worker budget enforcement. Pass each worker a max_tokens cap matching its job. A worker that's supposed to return a JSON record doesn't need 8000 output tokens.
The cost monitoring prompt
Given this multi-agent system spec: Orchestrator: [model, system prompt size, calls/day] Worker A: [model, prompt size, expected output size, parallel count] Worker B: [model, prompt size, expected output size, parallel count] Worker C: [model, prompt size, expected output size, parallel count] Estimate: 1. Total tokens per run, broken down by agent 2. Cost per run, with and without orchestrator-prompt caching 3. Daily cost at the volume above 4. Which agent is the cost driver 5. Three specific changes that would cut cost >30% without changing what the system does
Estimate single-agent vs multi-agent cost for one task.
- Take the “audit 10 files” task. Fill the cost-monitoring prompt above with a concrete spec, or use this one: orchestrator Opus, 3K-token system prompt, 1 call; 10 workers Haiku, 1.5K-token prompt each, ~800 output tokens each.
- Compute the multi-agent total by hand: orchestrator tokens + (10 × per-worker tokens). Call it T_multi.
- Now the single-agent version: one Opus agent reading all 10 files in one context — roughly the sum of the same inputs in one long prompt, ~15K input + 3K output. Call it T_single.
- Paste both specs into Claude and ask it to confirm your two totals and name the cost driver.
T_single, T_multi, and the ratio T_multi / T_single. Multi-agent uses more tokens (the ratio is > 1) — you’re buying parallel speed, not cheapness. The dollar cost can still flip in multi-agent’s favor because the workers are Haiku, not Opus; note whether it does.Stretch. Apply pattern 2 from this unit — cache the orchestrator’s 3K-token system prompt — and recompute the daily cost at 100 runs/day. Write the before/after dollar figure.
What it looks like in practice.
Motto: the patterns become real when you watch them solve a problem you actually have.
One of the cleanest real-world fits for an orchestrator-worker system: understand a codebase you’ve never seen. 200,000 lines, 1,500 files, three frameworks, no one to ask. A new team member opens it Monday morning and has no idea where to start.
The naïve single-agent approach: hand the agent the repo and ask "explain this." The agent loses the goal at file 30, the context fills at file 80, the answer it produces is a generic restatement of the README. Useless.
The multi-agent design that actually works
Three roles. Strict context isolation. The orchestrator never reads file contents itself.
- Orchestrator (strong model, runs once). Reads the repo’s top-level structure — folder names, package manifests, top-of-tree READMEs. Decomposes into 30-100 worker tasks ("summarise files in
src/auth/", "summarise the database schema inmigrations/"). Dispatches in parallel. - Workers (cheap model, run 30-100x in parallel). Each gets ONE folder or related group of files. Each returns a structured record: {purpose, key entities, dependencies, outward-facing API, smells}. No worker reads more than its slice. Context budget per worker: tight cap.
- Synthesizer (strong model, runs once at the end). Reads all worker records. Builds the dependency graph. Identifies the 5-10 highest-leverage files. Produces a 3-page narrative "how this codebase fits together" plus a "guided tour" ordered by dependency — what to read first, what builds on what.
Cost / latency: single-agent would have used ~500K tokens sequentially over 30 minutes and produced garbage. Multi-agent uses ~1.5M tokens but in parallel over 4-5 minutes — and produces something that actually onboards a human.
The orchestrator prompt for this exact case
# Orchestrator for codebase understanding
You are the orchestrator for a codebase-understanding pipeline.
INPUT: a fresh checkout at [path]. You have read access to all files
but you will NOT read individual file contents yourself.
YOUR JOB:
1. Read only: top-level folder structure, package manifests
(package.json / pyproject / Cargo.toml / etc), top-of-tree
README, and CONTRIBUTING.md if present.
2. Decompose the codebase into 20-60 worker tasks. Each task targets
ONE folder or ONE coherent group of files. Each task gets:
- Worker ID
- Folder/file glob
- Question to answer: "What is this directory's purpose?
Main entities? Outward-facing API? Dependencies on other
parts of the repo?"
- Expected output: structured JSON with keys {purpose, entities,
exports, depends_on, smells}
3. Dispatch all workers in parallel. Wait for all to return. If >20%
time out, retry once; otherwise proceed without them.
4. Hand the merged worker records to the synthesizer agent (NOT to me).
Synthesizer's job is to build the dependency graph, identify
high-leverage files, and produce the 3-page narrative + guided tour.
5. Final output to me: the synthesizer's narrative + the list of
worker tasks that failed (so I know what's unmapped).
DO NOT produce code summaries yourself. You orchestrate; the workers
read; the synthesizer narrates.
Why this maps cleanly to the decision matrix
- Step 1 (single vs multi): ✓ truly independent sub-tasks (each folder summarised in isolation), ✓ context isolation needed (no one agent can hold all 200K LOC).
- Step 2 (pattern): orchestrator-worker, with a one-off synthesizer at the end.
- Step 3 (coordination header): every worker gets the same GOAL line ("explain this folder’s purpose so a new engineer can onboard"), the same BLAST RADIUS ("only the files in your glob"), and the same HANDOFF FORMAT (JSON).
- Step 4 (isolation): workers return JSON records, not transcripts. Synthesizer gets the records, not the raw file contents.
- Step 5 (cost): orchestrator + synthesizer use the strong model (each runs once). Workers use the cheap model (run dozens of times). Total cost is dominated by parallel worker calls, which is exactly where the cheap model lives.
The general lesson: when you find yourself reaching for multi-agent, look for the “thirty independent things, all the same shape” pattern. Codebase understanding has it. So does: auditing a doc library, researching N companies, drafting variants of a piece of content, classifying a backlog. If your problem looks like “thirty of the same little task,” orchestrator-worker is the right tool. If your problem looks like “one long task with many steps,” reach for supervisor-subagent or stay single-agent.
Run the codebase-tour orchestrator on a repo you know.
- Before you run anything, write down the 3 files you would personally open first for a new hire on a repo you know well. Seal that guess.
- In Claude Code (or Claude with the repo’s file tree pasted in), paste the orchestrator prompt above and set
[path]to that repo. - No repo handy? Use a public one: open
github.com/anthropics/anthropic-sdk-pythonand paste its top-level file/folder list. - Let it run (or, if single-session, have it produce the decomposition + a synthesized guided tour). Read the tour’s “read these first” list.
Stretch. If it missed, add one line to the orchestrator’s GOAL (“prioritize the files a new engineer must read on day one”) and re-run. Did the tour’s top-3 move toward your list?
One page on the wall.
The decision tree every multi-agent design should pass through, in one printable page. Pin this above your laptop; reference it before the first commit on any new agent system.
# MULTI-AGENT DESIGN — DECISION MATRIX ## STEP 1 — Single or multi? Build single-agent unless ALL of: [ ] Task has 2+ truly independent sub-tasks [ ] Sub-tasks need different system prompts / tools / models [ ] OR: long-running, need context isolation If single-agent works, stop here. Go build that. It will ship faster. ## STEP 2 — If multi, which pattern? - Independent sub-tasks, parallel? → ORCHESTRATOR-WORKER - One long-running agent that drifts? → SUPERVISOR-SUBAGENT - Two agents that need to talk to each other? → PEER-TO-PEER (and you'll regret it) ## STEP 3 — Coordination header (copy into every agent's prompt) GOAL: [one sentence — same on every agent] YOUR ROLE: [one sentence per agent] BLAST RADIUS: [tools, paths, sub-agents allowed] STOP WHEN: [explicit exit condition] HANDOFF FORMAT: [structured output spec] ESCALATE WHEN: [stop-and-ask condition] ## STEP 4 — Context isolation (every handoff) Pick ONE per agent boundary: - SUMMARY HANDOFF: A summarizes; B gets only the summary - FIELD EXTRACTION: A returns JSON; B gets specific fields - FRESH RESTART: B starts blank; gets only distilled output NEVER mix. NEVER pass A's full transcript to B. ## STEP 5 — Cost guardrails - Orchestrator = strong model (Opus/Sonnet) - Workers = cheap model (Haiku) unless quality demands more - Cache orchestrator system prompt (always) - Per-worker max_tokens cap matched to expected output ## RED FLAGS - Orchestrator is producing the substantive answer itself → wrong pattern - Workers reference each other → fix isolation - Turn count climbing past your cap → add hard limit or break - Cost > 2x what you estimated → cost-monitoring prompt + tighten - Output quality drops as turns climb → context bloat; compact ## FAILURE MODES → CURES - Rogue worker → tighter coordination header, smaller worker model - Drift → repeat GOAL line every handoff; supervisor compacts - Deadlock → hard turn cap + arbiter agent - Cost explosion → per-agent budgets + dashboard - Context leak → fresh-context restart at handoff ## EVALUATION QUESTIONS — answer YES or kill the design 1. Could I explain this system to a new engineer in 5 minutes? 2. Do I know which agent caused a bad output, by reading logs alone? 3. Can I test each agent's prompt in isolation? 4. Have I run this end-to-end at the highest expected input? 5. Do I have a fallback if any single agent times out? A NO on any question = ship-blocker.
Run the matrix end-to-end on your real design.
- Copy the matrix above into a doc (or print it). Use the system from your §18.01.01 decision lab — or the fallback support bot.
- Fill STEP 1–5 for that system: single or multi, which pattern, the coordination header, the isolation choice per handoff, the model per role.
- Answer the five EVALUATION QUESTIONS at the bottom with a literal YES or NO each — no maybes.
Stretch. Pin the filled matrix where you’ll see it, and re-run questions 2 and 4 after your first real end-to-end run — “can I tell which agent caused a bad output from logs alone?” is the one that fails most often in production.