Skip to content

How to review AI agent production incidents

How to review AI agent production incidents

Section titled “How to review AI agent production incidents”

An AI agent incident review should produce operating changes, not just a story.

The review should answer:

  • what failed,
  • why existing controls did not stop it,
  • what evidence was missing,
  • which eval should have caught it,
  • which alert should have surfaced it,
  • and what release or approval rule changes before the next rollout.

If the review ends with “the model made a bad choice” and nothing changes, the team has not learned enough.

The weak model is:

“The agent hallucinated. We improved the prompt.”

Sometimes that is true. Often it is incomplete.

Incidents may come from:

  • wrong tool selection,
  • weak retrieval,
  • missing approval boundary,
  • unclear workflow ownership,
  • prompt drift,
  • model-route change,
  • untested edge case,
  • or downstream system behavior.

The review should classify the system failure, not only the output failure.

Capture:

  • incident ID,
  • date and detection source,
  • affected workflow,
  • severity,
  • run IDs,
  • agent version,
  • model lane,
  • tool configuration,
  • approval policy,
  • release or configuration changes,
  • customer or operator impact,
  • containment action,
  • and final corrective actions.

This is the evidence base for improving the operating system.

Every incident should receive one primary failure class and optional secondary classes.

Useful classes include:

  • instruction failure,
  • tool selection failure,
  • tool argument failure,
  • retrieval or evidence failure,
  • approval boundary failure,
  • workflow routing failure,
  • escalation failure,
  • cost-control failure,
  • latency or timeout failure,
  • and release-process failure.

The taxonomy matters because each class has a different fix.

The trigger is what started the incident.

Examples:

  • new prompt version,
  • changed tool schema,
  • model lane switch,
  • updated retrieval corpus,
  • larger customer workload,
  • or unusual user input.

The failed control is what should have contained it.

Examples:

  • eval gap,
  • missing canary,
  • weak approval policy,
  • absent alert,
  • incomplete logging,
  • no fallback lane,
  • or unclear incident owner.

Good reviews separate these two. Otherwise the team fixes the trigger and misses the control gap.

For every serious incident, decide whether to add:

  • the exact run to a regression set,
  • a simplified synthetic version to a release gate,
  • a tool-selection test,
  • an approval-boundary test,
  • a retrieval evidence case,
  • or a reviewer-training example.

The point is not to overfit to one incident. The point is to protect the class of failure.

Ask:

  • Was the incident detected by a user, operator, dashboard, alert, or review queue?
  • Should it have been urgent?
  • Which metric or event changed first?
  • Did the alert include enough examples to act?
  • Was the owner obvious?

If the answer is no, update alert design or review sampling.

Incidents often reveal weak rollout discipline.

Possible release changes:

  • new canary threshold,
  • stricter approval for one action class,
  • mandatory eval pass before release,
  • release notes for model-route changes,
  • stronger rollback metadata,
  • or a freeze rule after repeated failures.

The review should specify which release gate changes and who owns it.

Keep the meeting tight:

  1. Facts and timeline.
  2. Impact and severity.
  3. Failure taxonomy.
  4. Missed controls.
  5. Evidence gaps.
  6. Eval, alert, release, and ownership changes.

Avoid long speculation about model personality. Focus on observable system behavior.

Your review process is probably healthy when:

  • every serious incident gets a failure class;
  • trigger and failed control are separated;
  • missing evidence becomes a logging or tracing task;
  • representative examples enter eval or reviewer workflows;
  • alert thresholds or review sampling improve;
  • and release gates change when rollout discipline failed.