What should happen when an AI agent fails in production?
What should happen when an AI agent fails in production?
Section titled “What should happen when an AI agent fails in production?”Quick answer
Section titled “Quick answer”When an AI agent fails in production, the system should:
- stop unsafe or unclear actions,
- classify the failure,
- preserve the evidence,
- route the case to the right human or fallback path,
- and decide whether the issue is local, systemic, or rollout-related.
The worst production pattern is silent failure followed by hidden manual rescue.
The wrong failure plan
Section titled “The wrong failure plan”The weak plan is:
“Retry until it works.”
That only helps when the failure was:
- transient,
- low-risk,
- and tied to an idempotent action.
If the run failed because of missing evidence, wrong authority, weak approval logic, or a dangerous side effect, blind retries make the incident worse.
The first decision: is this a safe-stop failure?
Section titled “The first decision: is this a safe-stop failure?”Some failures should stop immediately:
- policy or permission violations,
- missing required evidence,
- high-consequence ambiguity,
- tool actions that are not safely repeatable,
- or any run that may have crossed the wrong authority boundary.
These are not retry cases. They are containment cases.
The second decision: is this retryable?
Section titled “The second decision: is this retryable?”Retries are justified only when:
- the failure is transient,
- the step is idempotent,
- the system knows what changed,
- and retrying does not widen the blast radius.
Examples include flaky upstream services or temporary tool timeouts. Even then, retries should be bounded and logged.
The third decision: who owns the handoff?
Section titled “The third decision: who owns the handoff?”A failed run should not disappear into generic “manual review.”
The system should hand off:
- what the task was,
- what evidence it used,
- which tools were called,
- what failed,
- and what the likely next action is.
This prevents the human rescue path from turning into full rediscovery.
What the system should record every time
Section titled “What the system should record every time”Every meaningful failure should capture:
- a stable run ID,
- workflow class,
- failure class,
- retry count,
- approval state,
- tool and model context,
- final handoff target,
- and whether the issue implies rollback or narrower permissions.
Without this, teams remember dramatic failures and forget the expensive repeated ones.
When failure should trigger rollback
Section titled “When failure should trigger rollback”Rollback should be considered when:
- a new version created a clear regression,
- high-severity failures increased,
- approval boundaries drifted,
- or operator rescue work spiked after a release.
Not every bad run is a rollback event. But every rollback event starts as a set of badly understood failures.
The practical production rule
Section titled “The practical production rule”For each failure class, decide in advance:
- stop or retry,
- who gets the handoff,
- what evidence must be preserved,
- what metric would trigger rollback or tighter scope.
That turns failure handling from improvisation into operations.
Implementation checklist
Section titled “Implementation checklist”Your production failure plan is probably healthy when:
- unsafe cases fail closed instead of retrying blindly;
- retryable cases are narrow and idempotent;
- handoff packets preserve context instead of discarding it;
- logs can distinguish one-off failure from rollout regression;
- and owners know which failure patterns trigger rollback, approval tightening, or evaluation updates.