Skip to content

Production AI agent observability stack

An AI agent observability stack should explain what the agent did, why it mattered, and whether the outcome was acceptable.

That requires more than API latency, token usage, and error rates.

A useful stack usually connects:

  • traces,
  • structured logs,
  • workflow metrics,
  • approval and escalation records,
  • evaluation labels,
  • cost attribution,
  • and incident notes.

If those signals live in unrelated tools with no shared run ID, production review becomes guesswork.

The weak stack is:

“We have logs from the app and traces from the LLM provider.”

That is not enough when the incident question is:

  • Did the agent choose the wrong tool?
  • Did it act without enough evidence?
  • Did a human approval gate fail?
  • Did retries hide a bad workflow path?
  • Did the result cost more than the task was worth?

General observability tools can show that something happened. Agent observability has to show whether the behavior was acceptable.

Every run needs a stable ID that ties together:

  • user or tenant scope,
  • workflow type,
  • release version,
  • model lane,
  • tool configuration,
  • approval policy,
  • and final status.

Without a stable run identity, traces, logs, evals, and support tickets cannot be joined later.

The trace layer explains the path.

It should show:

  • model calls,
  • tool calls,
  • retrieval or search steps,
  • intermediate decisions,
  • retries,
  • fallback path,
  • and approval requests.

Traces are best for debugging a specific run. They are weaker as long-term reporting if they are the only evidence layer.

Logs preserve durable facts.

The strongest fields are usually:

  • run ID,
  • workflow class,
  • tool name and outcome,
  • approval decision,
  • final status,
  • failure class,
  • latency,
  • cost,
  • version,
  • and reviewer label.

The log layer should be compact enough to retain, query, and sample over time.

Metrics translate events into operating signals.

Useful production metrics include:

  • successful outcome rate,
  • high-severity failure rate,
  • escalation rate,
  • approval rate,
  • manual rescue rate,
  • retry rate,
  • time to trusted completion,
  • and cost per successful outcome.

These metrics should be segmented by workflow type, risk class, model lane, and release version.

The eval layer turns observed behavior into judgment.

It should capture:

  • pass or fail labels,
  • severity,
  • failure taxonomy,
  • reviewer notes,
  • ground truth when available,
  • and whether the example should enter a regression set.

Observability without review becomes dashboards. Review without observability becomes anecdote.

Do not use observability as an excuse to retain everything.

Be deliberate with:

  • raw prompts,
  • customer payloads,
  • tool outputs,
  • files,
  • credentials,
  • private messages,
  • and regulated data.

The practical pattern is structured operational evidence plus selective secure retention, not infinite transcript hoarding.

Alerts should be built from business-sensitive behavior, not only infrastructure symptoms.

Strong alert candidates include:

  • high-severity failure spikes,
  • approval bypass patterns,
  • manual rescue jumps,
  • retry storms,
  • cost spikes without success-rate improvement,
  • sudden tool failure concentration,
  • and regressions tied to a release version.

The alert should point to the run IDs, traces, and recent examples that explain the change.

When evaluating observability tooling, ask:

  1. Can it connect model calls, tool calls, approvals, costs, and outcomes under one run ID?
  2. Can non-engineering reviewers label examples safely?
  3. Can it produce regression datasets from real incidents?
  4. Can it support retention rules instead of storing everything forever?
  5. Can it trigger operating decisions such as rollback, canary pause, or approval tightening?

If the answer is no, the tool may be useful for debugging but weak for operating production agents.

Your stack is probably healthy when:

  • every run has one durable identity;
  • traces explain path-level behavior;
  • structured logs preserve long-term evidence;
  • metrics reflect outcome, risk, cost, and review burden;
  • eval labels can be attached to real runs;
  • and alerts route directly into owners, examples, and response actions.