Practices/Agent Observability
3 days · 12 units
PradhyaPractice 14 · Agent ObservabilityBuilder

If you can’t trace it, you can’t ship it.

Observability is the difference between “the agent broke and I have no idea why” and “here’s the trace, here’s the fix.” This practice teaches the trace shape every production agent must capture, the four metrics worth tracking, and the Friday-dashboard practice that catches regressions before users do.

Audience
Builders running agents past 10 users
Length
3 sessions · 90 min each
Walk-away
A working dashboard for your agent
Prereq
The Agents Practice or equivalent
What you’ll be able to do by the end
  • Capture a complete agent trace with the 4 required fields
  • Compute the 4 metrics (accuracy / cost / latency / refusal) for any agent
  • Set up LLM-as-judge with calibration against human grading
  • Run the Friday dashboard review and catch a regression before users do
§ 14.01.01 · Unit 01

Why agents need observability.

Traditional software is deterministic; agents are not. The same input can produce a different trace tomorrow. Without observability, debugging is guessing.

Traditional service request → deterministic code → response stack trace = answer Agent request → model decides → tools → more model → branching tree trace = answer
Stack traces solve services · agent traces solve agents

Add a trace ID to one agent.

You’ll do
Pick an agent. Stamp every call with a UUID. Log it everywhere.
Steps
  1. Generate a UUID at the start of each agent invocation.
  2. Pass it as metadata.user_id on every Anthropic call (or your own field).
  3. Log the trace_id alongside every print/log statement in your code.
  4. Run one task. Grep your logs for the trace_id. Confirm you see every step.
Verify
You can reconstruct one full agent run from logs alone, given just the trace_id.

Stretch. Inject trace_id into tool inputs so downstream services log it too. End-to-end trace beats single-system trace.

§ 14.01.02 · Unit 02

The trace shape.

A trace is an ordered list of events. Each event is one model call or one tool call. Together they form a tree (when sub-agents dispatch) or a chain.

LLM tool: search LLM tool: fetch×3 LLM final answer each box = one logged event
A trace · ordered LLM + tool events · per agent run

Validation: this trace shape follows OpenTelemetry’s trace/span model and GenAI span conventions for model calls, tool execution, token usage, latency, and errors: opentelemetry.io/docs/concepts/signals/traces and opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans.

Draw the trace of one real run.

You’ll do
Take one agent run and write out its ordered list of events — one line per LLM call or tool call.
Steps
  1. Pick an agent and run one task end to end.
  2. On paper or in a text file, write one line per event in the order they happened: llm_call, tool_use: <name>, … ending in end.
  3. Beside each line, note what it consumed or produced (tokens, tool args, result size).
  4. Count the events. That count is the length of this run’s trace.
Verify
Your list ends in exactly one final answer / end event, and the number of lines equals (LLM calls + tool calls + 1). A stranger could read it top-to-bottom and narrate the run.

Stretch. Re-run the SAME task and diff the two traces. Where they differ is exactly the non-determinism this practice exists to tame.

§ 14.01.03 · Unit 03

The four required fields.

Every event in the trace must carry these four. Drop one and you can’t debug in production.

FieldWhy
run_id Groups every event of one agent run. Without it, you can’t reconstruct.
user_input Lets you replay the run against a new prompt or model.
model_version So you can correlate regressions with model upgrades.
timestamp Latency math; ordering; debugging timing-sensitive issues.

Audit one trace for the four fields.

You’ll do
Take one logged run and confirm every event carries all four required fields — or find the one that’s missing.
Steps
  1. Open one run’s log. (No agent logging yet? Use the four-line starter block in Unit 04 — same shape.)
  2. For each event, check it has run_id, user_input, model_version, and timestamp.
  3. If a field is missing, add it at the logging call site so the next run captures it.
  4. Re-run one task and re-check.
Verify
Every event in the run has all four fields. Concretely: grep -c '"run_id"' run.jsonl equals the event count (4 for the Unit 04 starter block), and the same holds for the other three keys. (Match the quoted key, not bare run_id — otherwise the # runs/<run_id>.jsonl comment line inflates the count by one.)

Stretch. Pick the one field you’d miss most in a 3am outage. For most teams it’s run_id — without it, nothing else is reconstructable.

§ 14.01.04 · Unit 04 · Hands-on

Local JSONL — the starter.

Before installing any hosted observability stack, log to JSONL. Append-only. One line per event. You can read it with jq, query with duckdb, render with a 30-line Streamlit app.

Every line carries all four required fields from Unit 03run_id, user_input, model_version, and the timestamp (keyed t here). Repeating the three identity fields on every line is deliberate: each line is self-describing, so a single grep run_id reconstructs a whole run.

# runs/<run_id>.jsonl  — every event repeats run_id, user_input, model_version
{"run_id":"r_8f3a","user_input":"brief me on urban heat islands","model_version":"claude-sonnet-4-6","t":"2026-05-19T09:14:02","kind":"llm_call","input_tokens":1248,"output_tokens":87,"latency_ms":620}
{"run_id":"r_8f3a","user_input":"brief me on urban heat islands","model_version":"claude-sonnet-4-6","t":"2026-05-19T09:14:04","kind":"tool_use","name":"web_search","args":{"q":"urban heat island drivers"},"result_size":3210,"latency_ms":820}
{"run_id":"r_8f3a","user_input":"brief me on urban heat islands","model_version":"claude-sonnet-4-6","t":"2026-05-19T09:14:09","kind":"llm_call","input_tokens":4521,"output_tokens":350,"latency_ms":1840}
{"run_id":"r_8f3a","user_input":"brief me on urban heat islands","model_version":"claude-sonnet-4-6","t":"2026-05-19T09:14:11","kind":"end","stop_reason":"end_turn","total_tokens":6206,"total_cost_usd":0.0312}

Write a JSONL where every line carries the three identity fields.

You’ll do
Produce a run log where every event repeats the three identity fields — run_id, user_input, and model_version — then prove it with one command. (These are three of the four required fields from Unit 03; the fourth, the timestamp, is keyed t and rides on every line too.)
Steps
  1. Copy the block above into a file named run.jsonl. (No agent yet? This block is your starter — every line already carries all four required fields, so it passes the check below.)
  2. In your own agent, add the three identity fields to the logger at the call site, so every future line emits them too.
  3. Run the identity-field check (paste as one line):
    python3 -c "import json; rows=[json.loads(l) for l in open('run.jsonl') if l.strip() and not l.startswith('#')]; req={'run_id','user_input','model_version'}; ok=sum(req<=r.keys() for r in rows); print(ok,'/',len(rows),'events pass')"
  4. If any line fails, fix the logger (or the JSONL) and re-run until the two numbers match.
Verify
The check prints N / N events pass — every non-comment line carries the three identity fields run_id, user_input, and model_version. For the block above it prints 4 / 4 events pass.

Stretch. Add the trace to OpenTelemetry. The same JSON becomes a span attribute — reuse for free.

§ 14.02.01 · Unit 05

The four metrics.

From Practice 04 Unit 03, restated: the four numbers worth tracking on every agent.

accuracy % evals passing cost $ per task latency p50 / p90 sec refusal % rejected
Four numbers · per agent · per day

Define the four metrics for one agent.

You’ll do
Pick one agent. Write a one-line, log-computable definition for each of accuracy, cost, latency, and refusal.
Steps
  1. Accuracy: which eval set counts, and what % passing is “good”? (e.g. ≥90% of your 10 cases.)
  2. Cost: dollars per task — the field to read is total_cost_usd (or tokens × price) from each run’s end event.
  3. Latency: p50 and p90 wall-clock seconds per task. Note your acceptable ceiling.
  4. Refusal: % of runs the agent rejected. Distinguish legitimate refusals from over-refusals.
  5. Write each as a sentence ending in a number you could compute from JSONL tonight.
Verify
You have exactly four lines, each naming the log field it reads and a target number. Each is a query you could run nightly — no human judgement in the loop.

Stretch. Run eval_harness.py from Practice 04 against your 10 cases — its summary already prints all four (accuracy, total cost, latency p50/p90, refusals). Compare its numbers to your hand-written targets.

§ 14.02.02 · Unit 06

LLM-as-judge.

For open-ended outputs — drafts, analyses, decisions — there’s no exact-match grader. Use a stronger model to score the candidate output. Pass / fail with rationale.

JUDGE_PROMPT = """\
You are grading an AI agent's response.

CRITERIA:
{criteria}

USER INPUT:
{user_input}

CANDIDATE OUTPUT:
{output}

Respond with:
<verdict>PASS or FAIL</verdict>
<reason>one specific reason in <= 25 words</reason>
"""

# Calibrate on 30 human-graded cases before trusting it.
# Aim for >85% agreement with humans on a held-out set.

Calibrate the judge against your own grading.

You’ll do
Hand-grade 10 traces PASS/FAIL, run the judge prompt over the same 10, and compute how often they agree.
Steps
  1. Collect 10 agent outputs (your own, or generate 10 with eval_harness.py from Practice 04).
  2. YOU grade each PASS or FAIL first — before looking at the model. Write the 10 verdicts in a column.
  3. Run the JUDGE_PROMPT above over the same 10 (fill in {criteria}, {user_input}, {output}). Record its 10 verdicts.
  4. Compute agreement: agreement = (# rows where yours == judge’s) / 10 × 100%.
Verify
You have one number: the agreement %. If it is below 85%, write down ONE concrete change to the judge prompt (sharpen a criterion, add an example of a borderline FAIL, tighten the verdict format) before you trust the judge.

Stretch. Apply your one fix and re-run the 10. Agreement should climb. Repeat until you clear 85% — that is the bar for letting the judge grade unattended.

§ 14.02.03 · Unit 07

Drift & regression detection.

A model upgrade or a prompt change can silently regress quality. A weekly eval run catches it before users do.

w1 w12 acc regression!
Weekly eval line · drops catch silent regressions

Build the regression test for one bug.

You’ll do
Pick a real bug your agent had. Write the eval case that would have caught it.
Steps
  1. Find a recent failure (Slack screenshot, postmortem, your memory).
  2. Reproduce: what input triggered it? What output was wrong?
  3. Encode as a test case in your eval suite. Run — should fail.
  4. Apply the fix. Run again — should pass.
Verify
The regression test reliably catches the bug. If you intentionally revert the fix, the test fails.

Stretch. Tag every bug with its regression test. After 6 months, the tag tells you which bugs are protected vs forgotten.

§ 14.02.04 · Unit 08

The hosted tools.

When local JSONL stops scaling, these four are the popular pick. Each integrates with the Anthropic SDK in ~10 lines.

ToolPick when
LangSmith You’re already on LangChain.
Langfuse You want open-source · self-hostable.
Helicone You want a proxy / drop-in with no SDK change.
Datadog / HoneycombYou already have observability infra · use OTel.

Ingest one run and pull it up by run_id.

You’ll do
Take the JSONL from Unit 04, treat it as your ingested store, and retrieve one whole run given only its run_id.
Steps
  1. Save the four-line block from Unit 04 as run.jsonl (every event already carries run_id). This stands in for a hosted store’s ingest.
  2. Pick one hosted tool from the table and note its “filter by attribute” feature — that is the production version of this query. (No account needed for the verify below.)
  3. Retrieve the run by id (paste as one line):
    python3 -c "import json; rid='r_8f3a'; ev=[json.loads(l) for l in open('run.jsonl') if l.strip() and not l.startswith('#') and json.loads(l).get('run_id')==rid]; print(len(ev),'events for',rid); [print(e['kind']) for e in ev]"
  4. Confirm the events come back in order, ending in end.
Verify
The query prints 4 events for r_8f3a followed by llm_call / tool_use / llm_call / end. You reconstructed a whole run from its id alone — exactly what a hosted tool’s trace-by-id lookup gives you.

Stretch. Sign up for one of the four (Langfuse self-hosts free), POST these four events, and pull the same run up in its UI by filtering on run_id.

§ 14.03.01 · Unit 09

OpenTelemetry for agents.

If your company already runs OTel, agents become spans. One span.set_attribute per model call, one per tool call. You inherit the dashboards, alerting, retention.

from opentelemetry import trace

tracer = trace.get_tracer("agent")

with tracer.start_as_current_span("agent.run") as span:
    span.set_attribute("agent.name", "research-agent")
    span.set_attribute("agent.user_input", user_q)
    for turn in range(MAX_LOOPS):
        with tracer.start_as_current_span("llm.call") as llm_span:
            resp = client.messages.create(...)
            llm_span.set_attribute("llm.model", MODEL)
            llm_span.set_attribute("llm.input_tokens", resp.usage.input_tokens)
            llm_span.set_attribute("llm.output_tokens", resp.usage.output_tokens)
        # ... tool dispatch with its own span ...

Spin up OTel on a local agent.

You’ll do
Wrap one local agent in OpenTelemetry spans and view one run as a flame graph in Jaeger.
Steps
  1. Install opentelemetry-api and opentelemetry-sdk in your agent’s env.
  2. Wrap agent runs in tracer.start_as_current_span("agent.run") — the code above is the skeleton.
  3. Wrap each LLM call in a nested span; set attributes for model + token counts.
  4. Run Jaeger locally (docker run -p 16686:16686 jaegertracing/all-in-one) and open the trace at localhost:16686.
Verify
One agent run shows up in Jaeger as a flame graph: the agent.run span with nested llm.call spans under it. The longest bar is your slowest step — visually obvious.

Stretch. Add tool calls as nested spans too. The trace then tells the agent’s whole story — model and tools on one timeline.

§ 14.03.02 · Unit 10

The debugging workflow.

When a user reports a bad output, you should be able to answer three questions in under five minutes.

  1. What did the user ask? The first event in the trace.
  2. What did the model do? Every LLM call + tool call in order.
  3. Where did it go wrong? The first divergence from expected behavior.

If you can’t answer all three in five minutes, your observability isn’t there yet. Iterate on the trace schema until you can.

Run the 5-minute debug drill.

You’ll do
Pick a recent user complaint about your agent. Answer the three questions above — against the clock.
Steps
  1. Open the complaint. Grab its run_id (you have one because of Unit 01).
  2. What did the user ask? Find the user’s input in the trace. (< 1 min.)
  3. What did the model do? Read every LLM + tool call in order. (< 2 min.)
  4. Where did it go wrong? Mark the first divergence from expected behavior. (< 2 min.)
Verify
Total elapsed: under 5 minutes, and you can point to the exact event where it went wrong. If you blew past 5 minutes, your trace schema is missing something — iterate on it, not on the agent.

Stretch. Run this drill weekly on a fresh complaint. The 5-minute floor is the habit that separates teams who debug from teams who guess.

§ 14.03.03 · Unit 11

Alerting — what to page on.

Pages should fire only on real signals. Three alerts every production agent should have.

AlertThreshold
Cost spike Daily spend > 2x rolling 7-day average.
Refusal spike Refusal rate > 2x baseline for 1 hour.
Eval regression Weekly eval pass-rate drops more than 5 points.

Avoid the temptation to alert on every metric. False pages train people to ignore real pages.

Wire one alert and trip it.

You’ll do
Pick one of the three alerts above, wire it to a channel, and force it to fire with a test breach.
Steps
  1. Pick one: cost spike, refusal spike, or eval regression (start with the one that would hurt most).
  2. Set the threshold from the table — e.g. daily spend > 2× the rolling 7-day average.
  3. Wire it to a real channel (PagerDuty / Slack webhook / email).
  4. Force a breach: temporarily drop the threshold below the current value so the condition is true now.
Verify
The alert actually lands in your channel, and it pages you in under 30 seconds. Then restore the real threshold and confirm it goes quiet.

Stretch. Put a runbook link in the alert payload. Future-you at 3am will not remember the dashboard URL — the page should carry it.

§ 14.03.04 · Unit 12 · Practice

The Friday dashboard.

Every Friday afternoon, 15 minutes. Open the dashboard. Walk the four metrics across the week. Note what changed and why. Sketch one experiment for next week.

This is the practice that separates teams that ship agents from teams that maintain them. The teams that don’t look at the dashboard ship one good week and then a year of unmeasured regressions.

The closing You earn the right to change the prompt by writing a test case. You earn the right to ship a new agent version by reading the dashboard. The dashboard is the system; the agent is just one of its dependents.

Stand up the Friday dashboard.

You’ll do
Build the one-page view you’ll open every Friday — the four metrics across the week, readable in 30 seconds.
Steps
  1. Pick a dashboard tool you already use (Grafana, Datadog, your observability product).
  2. Add one panel per metric from Unit 05: accuracy, cost/task, p50·p90 latency, refusal rate — each trended over the week.
  3. Add one big graph of your north-star metric (success rate or eval score).
  4. Link the dashboard from your alert payload and your runbook, then do one real Friday walk: note what changed and one experiment for next week.
Verify
An oncall who has never seen this agent can open the page and answer “is anything on fire?” within 30 seconds — and your Friday walk produced one written note plus one next-week experiment.

Stretch. Add a ‘recent deploys’ annotation line. Most regressions correlate with a deploy — the annotation turns “when did this break?” into a glance.

§ Walk-away · The LLM-as-judge eval prompt

One eval that catches most regressions.

Most agent regressions aren't “wrong answer” — they're "right answer for the wrong reason" or "right answer in 3x the tokens." The LLM-as-judge pattern catches both. This prompt is the template you adapt per agent.

# LLM-as-judge eval — adapt per agent task type
You are evaluating an AI agent's response. You DID NOT write the response — you're grading it.

THE TASK the agent was given:
[paste the original task / prompt]

THE AGENT'S RESPONSE:
[paste the response]

Grade on these 5 dimensions (1-5 each, with one sentence justification):

1. CORRECTNESS — did the response do the actual task?
2. EFFICIENCY — could it have done the task in fewer steps / tokens?
3. SAFETY — did it surface caveats, refuse appropriately, avoid hallucinating?
4. CLARITY — would a non-expert understand the answer?
5. REUSABILITY — could this response be used as a template for similar tasks?

Format your output as JSON:
{
  "correctness": {"score": N, "why": "..."},
  "efficiency": {"score": N, "why": "..."},
  "safety": {"score": N, "why": "..."},
  "clarity": {"score": N, "why": "..."},
  "reusability": {"score": N, "why": "..."},
  "overall_pass": true/false,
  "single_biggest_issue": "..."
}

If overall_pass is false, single_biggest_issue must be specific enough that the next iteration can fix it without guessing.

The "single_biggest_issue" field is the unlock. Without it, judges produce essays nobody reads. Forcing one issue per failed eval forces the judge to prioritize, which is the signal you need for the next iteration.