If you can’t trace it, you can’t ship it.
Observability is the difference between “the agent broke and I have no idea why” and “here’s the trace, here’s the fix.” This practice teaches the trace shape every production agent must capture, the four metrics worth tracking, and the Friday-dashboard practice that catches regressions before users do.
- Capture a complete agent trace with the 4 required fields
- Compute the 4 metrics (accuracy / cost / latency / refusal) for any agent
- Set up LLM-as-judge with calibration against human grading
- Run the Friday dashboard review and catch a regression before users do
Why agents need observability.
Traditional software is deterministic; agents are not. The same input can produce a different trace tomorrow. Without observability, debugging is guessing.
Add a trace ID to one agent.
- Generate a UUID at the start of each agent invocation.
- Pass it as
metadata.user_idon every Anthropic call (or your own field). - Log the trace_id alongside every print/log statement in your code.
- Run one task. Grep your logs for the trace_id. Confirm you see every step.
Stretch. Inject trace_id into tool inputs so downstream services log it too. End-to-end trace beats single-system trace.
The trace shape.
A trace is an ordered list of events. Each event is one model call or one tool call. Together they form a tree (when sub-agents dispatch) or a chain.
Validation: this trace shape follows OpenTelemetry’s trace/span model and GenAI span conventions for model calls, tool execution, token usage, latency, and errors: opentelemetry.io/docs/concepts/signals/traces and opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans.
Draw the trace of one real run.
- Pick an agent and run one task end to end.
- On paper or in a text file, write one line per event in the order they happened:
llm_call,tool_use: <name>, … ending inend. - Beside each line, note what it consumed or produced (tokens, tool args, result size).
- Count the events. That count is the length of this run’s trace.
final answer / end event, and the number of lines equals (LLM calls + tool calls + 1). A stranger could read it top-to-bottom and narrate the run.Stretch. Re-run the SAME task and diff the two traces. Where they differ is exactly the non-determinism this practice exists to tame.
The four required fields.
Every event in the trace must carry these four. Drop one and you can’t debug in production.
| Field | Why |
|---|---|
run_id | Groups every event of one agent run. Without it, you can’t reconstruct. |
user_input | Lets you replay the run against a new prompt or model. |
model_version | So you can correlate regressions with model upgrades. |
timestamp | Latency math; ordering; debugging timing-sensitive issues. |
Audit one trace for the four fields.
- Open one run’s log. (No agent logging yet? Use the four-line starter block in Unit 04 — same shape.)
- For each event, check it has
run_id,user_input,model_version, andtimestamp. - If a field is missing, add it at the logging call site so the next run captures it.
- Re-run one task and re-check.
grep -c '"run_id"' run.jsonl equals the event count (4 for the Unit 04 starter block), and the same holds for the other three keys. (Match the quoted key, not bare run_id — otherwise the # runs/<run_id>.jsonl comment line inflates the count by one.)Stretch. Pick the one field you’d miss most in a 3am outage. For most teams it’s run_id — without it, nothing else is reconstructable.
Local JSONL — the starter.
Before installing any hosted observability stack, log to JSONL. Append-only. One line per event. You can read it with jq, query with duckdb, render with a 30-line Streamlit app.
Every line carries all four required fields from Unit 03 — run_id, user_input, model_version, and the timestamp (keyed t here). Repeating the three identity fields on every line is deliberate: each line is self-describing, so a single grep run_id reconstructs a whole run.
# runs/<run_id>.jsonl — every event repeats run_id, user_input, model_version {"run_id":"r_8f3a","user_input":"brief me on urban heat islands","model_version":"claude-sonnet-4-6","t":"2026-05-19T09:14:02","kind":"llm_call","input_tokens":1248,"output_tokens":87,"latency_ms":620} {"run_id":"r_8f3a","user_input":"brief me on urban heat islands","model_version":"claude-sonnet-4-6","t":"2026-05-19T09:14:04","kind":"tool_use","name":"web_search","args":{"q":"urban heat island drivers"},"result_size":3210,"latency_ms":820} {"run_id":"r_8f3a","user_input":"brief me on urban heat islands","model_version":"claude-sonnet-4-6","t":"2026-05-19T09:14:09","kind":"llm_call","input_tokens":4521,"output_tokens":350,"latency_ms":1840} {"run_id":"r_8f3a","user_input":"brief me on urban heat islands","model_version":"claude-sonnet-4-6","t":"2026-05-19T09:14:11","kind":"end","stop_reason":"end_turn","total_tokens":6206,"total_cost_usd":0.0312}
Write a JSONL where every line carries the three identity fields.
run_id, user_input, and model_version — then prove it with one command. (These are three of the four required fields from Unit 03; the fourth, the timestamp, is keyed t and rides on every line too.)- Copy the block above into a file named
run.jsonl. (No agent yet? This block is your starter — every line already carries all four required fields, so it passes the check below.) - In your own agent, add the three identity fields to the logger at the call site, so every future line emits them too.
- Run the identity-field check (paste as one line):
python3 -c "import json; rows=[json.loads(l) for l in open('run.jsonl') if l.strip() and not l.startswith('#')]; req={'run_id','user_input','model_version'}; ok=sum(req<=r.keys() for r in rows); print(ok,'/',len(rows),'events pass')" - If any line fails, fix the logger (or the JSONL) and re-run until the two numbers match.
N / N events pass — every non-comment line carries the three identity fields run_id, user_input, and model_version. For the block above it prints 4 / 4 events pass.Stretch. Add the trace to OpenTelemetry. The same JSON becomes a span attribute — reuse for free.
The four metrics.
From Practice 04 Unit 03, restated: the four numbers worth tracking on every agent.
Define the four metrics for one agent.
- Accuracy: which eval set counts, and what % passing is “good”? (e.g. ≥90% of your 10 cases.)
- Cost: dollars per task — the field to read is
total_cost_usd(or tokens × price) from each run’sendevent. - Latency: p50 and p90 wall-clock seconds per task. Note your acceptable ceiling.
- Refusal: % of runs the agent rejected. Distinguish legitimate refusals from over-refusals.
- Write each as a sentence ending in a number you could compute from JSONL tonight.
Stretch. Run eval_harness.py from Practice 04 against your 10 cases — its summary already prints all four (accuracy, total cost, latency p50/p90, refusals). Compare its numbers to your hand-written targets.
LLM-as-judge.
For open-ended outputs — drafts, analyses, decisions — there’s no exact-match grader. Use a stronger model to score the candidate output. Pass / fail with rationale.
JUDGE_PROMPT = """\ You are grading an AI agent's response. CRITERIA: {criteria} USER INPUT: {user_input} CANDIDATE OUTPUT: {output} Respond with: <verdict>PASS or FAIL</verdict> <reason>one specific reason in <= 25 words</reason> """ # Calibrate on 30 human-graded cases before trusting it. # Aim for >85% agreement with humans on a held-out set.
Calibrate the judge against your own grading.
- Collect 10 agent outputs (your own, or generate 10 with
eval_harness.pyfrom Practice 04). - YOU grade each PASS or FAIL first — before looking at the model. Write the 10 verdicts in a column.
- Run the
JUDGE_PROMPTabove over the same 10 (fill in{criteria},{user_input},{output}). Record its 10 verdicts. - Compute agreement:
agreement = (# rows where yours == judge’s) / 10 × 100%.
Stretch. Apply your one fix and re-run the 10. Agreement should climb. Repeat until you clear 85% — that is the bar for letting the judge grade unattended.
Drift & regression detection.
A model upgrade or a prompt change can silently regress quality. A weekly eval run catches it before users do.
Build the regression test for one bug.
- Find a recent failure (Slack screenshot, postmortem, your memory).
- Reproduce: what input triggered it? What output was wrong?
- Encode as a test case in your eval suite. Run — should fail.
- Apply the fix. Run again — should pass.
Stretch. Tag every bug with its regression test. After 6 months, the tag tells you which bugs are protected vs forgotten.
The hosted tools.
When local JSONL stops scaling, these four are the popular pick. Each integrates with the Anthropic SDK in ~10 lines.
| Tool | Pick when |
|---|---|
| LangSmith | You’re already on LangChain. |
| Langfuse | You want open-source · self-hostable. |
| Helicone | You want a proxy / drop-in with no SDK change. |
| Datadog / Honeycomb | You already have observability infra · use OTel. |
Ingest one run and pull it up by run_id.
run_id.- Save the four-line block from Unit 04 as
run.jsonl(every event already carriesrun_id). This stands in for a hosted store’s ingest. - Pick one hosted tool from the table and note its “filter by attribute” feature — that is the production version of this query. (No account needed for the verify below.)
- Retrieve the run by id (paste as one line):
python3 -c "import json; rid='r_8f3a'; ev=[json.loads(l) for l in open('run.jsonl') if l.strip() and not l.startswith('#') and json.loads(l).get('run_id')==rid]; print(len(ev),'events for',rid); [print(e['kind']) for e in ev]" - Confirm the events come back in order, ending in
end.
4 events for r_8f3a followed by llm_call / tool_use / llm_call / end. You reconstructed a whole run from its id alone — exactly what a hosted tool’s trace-by-id lookup gives you.Stretch. Sign up for one of the four (Langfuse self-hosts free), POST these four events, and pull the same run up in its UI by filtering on run_id.
OpenTelemetry for agents.
If your company already runs OTel, agents become spans. One span.set_attribute per model call, one per tool call. You inherit the dashboards, alerting, retention.
from opentelemetry import trace tracer = trace.get_tracer("agent") with tracer.start_as_current_span("agent.run") as span: span.set_attribute("agent.name", "research-agent") span.set_attribute("agent.user_input", user_q) for turn in range(MAX_LOOPS): with tracer.start_as_current_span("llm.call") as llm_span: resp = client.messages.create(...) llm_span.set_attribute("llm.model", MODEL) llm_span.set_attribute("llm.input_tokens", resp.usage.input_tokens) llm_span.set_attribute("llm.output_tokens", resp.usage.output_tokens) # ... tool dispatch with its own span ...
Spin up OTel on a local agent.
- Install
opentelemetry-apiandopentelemetry-sdkin your agent’s env. - Wrap agent runs in
tracer.start_as_current_span("agent.run")— the code above is the skeleton. - Wrap each LLM call in a nested span; set attributes for model + token counts.
- Run Jaeger locally (
docker run -p 16686:16686 jaegertracing/all-in-one) and open the trace atlocalhost:16686.
agent.run span with nested llm.call spans under it. The longest bar is your slowest step — visually obvious.Stretch. Add tool calls as nested spans too. The trace then tells the agent’s whole story — model and tools on one timeline.
The debugging workflow.
When a user reports a bad output, you should be able to answer three questions in under five minutes.
- What did the user ask? The first event in the trace.
- What did the model do? Every LLM call + tool call in order.
- Where did it go wrong? The first divergence from expected behavior.
If you can’t answer all three in five minutes, your observability isn’t there yet. Iterate on the trace schema until you can.
Run the 5-minute debug drill.
- Open the complaint. Grab its
run_id(you have one because of Unit 01). - What did the user ask? Find the user’s input in the trace. (< 1 min.)
- What did the model do? Read every LLM + tool call in order. (< 2 min.)
- Where did it go wrong? Mark the first divergence from expected behavior. (< 2 min.)
Stretch. Run this drill weekly on a fresh complaint. The 5-minute floor is the habit that separates teams who debug from teams who guess.
Alerting — what to page on.
Pages should fire only on real signals. Three alerts every production agent should have.
| Alert | Threshold |
|---|---|
| Cost spike | Daily spend > 2x rolling 7-day average. |
| Refusal spike | Refusal rate > 2x baseline for 1 hour. |
| Eval regression | Weekly eval pass-rate drops more than 5 points. |
Avoid the temptation to alert on every metric. False pages train people to ignore real pages.
Wire one alert and trip it.
- Pick one: cost spike, refusal spike, or eval regression (start with the one that would hurt most).
- Set the threshold from the table — e.g. daily spend > 2× the rolling 7-day average.
- Wire it to a real channel (PagerDuty / Slack webhook / email).
- Force a breach: temporarily drop the threshold below the current value so the condition is true now.
Stretch. Put a runbook link in the alert payload. Future-you at 3am will not remember the dashboard URL — the page should carry it.
The Friday dashboard.
Every Friday afternoon, 15 minutes. Open the dashboard. Walk the four metrics across the week. Note what changed and why. Sketch one experiment for next week.
This is the practice that separates teams that ship agents from teams that maintain them. The teams that don’t look at the dashboard ship one good week and then a year of unmeasured regressions.
Stand up the Friday dashboard.
- Pick a dashboard tool you already use (Grafana, Datadog, your observability product).
- Add one panel per metric from Unit 05: accuracy, cost/task, p50·p90 latency, refusal rate — each trended over the week.
- Add one big graph of your north-star metric (success rate or eval score).
- Link the dashboard from your alert payload and your runbook, then do one real Friday walk: note what changed and one experiment for next week.
Stretch. Add a ‘recent deploys’ annotation line. Most regressions correlate with a deploy — the annotation turns “when did this break?” into a glance.
One eval that catches most regressions.
Most agent regressions aren't “wrong answer” — they're "right answer for the wrong reason" or "right answer in 3x the tokens." The LLM-as-judge pattern catches both. This prompt is the template you adapt per agent.
# LLM-as-judge eval — adapt per agent task type
You are evaluating an AI agent's response. You DID NOT write the response — you're grading it.
THE TASK the agent was given:
[paste the original task / prompt]
THE AGENT'S RESPONSE:
[paste the response]
Grade on these 5 dimensions (1-5 each, with one sentence justification):
1. CORRECTNESS — did the response do the actual task?
2. EFFICIENCY — could it have done the task in fewer steps / tokens?
3. SAFETY — did it surface caveats, refuse appropriately, avoid hallucinating?
4. CLARITY — would a non-expert understand the answer?
5. REUSABILITY — could this response be used as a template for similar tasks?
Format your output as JSON:
{
"correctness": {"score": N, "why": "..."},
"efficiency": {"score": N, "why": "..."},
"safety": {"score": N, "why": "..."},
"clarity": {"score": N, "why": "..."},
"reusability": {"score": N, "why": "..."},
"overall_pass": true/false,
"single_biggest_issue": "..."
}
If overall_pass is false, single_biggest_issue must be specific enough that the next iteration can fix it without guessing.
The "single_biggest_issue" field is the unlock. Without it, judges produce essays nobody reads. Forcing one issue per failed eval forces the judge to prioritize, which is the signal you need for the next iteration.