Skip to content

Tool timeouts, retries, and idempotency for AI agents

Agents should retry only when the failure is recoverable, the action is safe to repeat, and the team can explain why another attempt is more likely to succeed.

If the tool can write, purchase, submit, delete, or trigger side effects, idempotency matters before retry logic does.

Without explicit timeout and retry rules, agents tend to do the wrong thing in production:

  • wait too long on slow tools,
  • retry the wrong calls,
  • duplicate side effects,
  • or keep trying until the workflow looks busy enough to appear intelligent.

This is not resilience. It is uncontrolled repetition.

Every tool class needs a deadline.

If the task is interactive, timeout budgets should be tighter. If the task is backgrounded, longer budgets may be fine, but they still need a stop condition.

Retries should happen only for failures likely to recover:

  • transient network issues,
  • temporary rate limits,
  • or short-lived upstream instability.

They should not be the default response to ambiguous tool errors.

Any tool with side effects needs a model for safe repetition. If the same request can create duplicate writes or repeated submissions, retries can make the system actively dangerous.

Retry only when all three are true:

  1. the failure class is known and transient,
  2. the tool action is idempotent or safely replayable,
  3. another attempt has a realistic chance of success.

If one of those is false, escalate or stop.

The most common failures are:

  • using the same retry policy for read and write tools,
  • letting the model decide to “try again” without system constraints,
  • and ignoring whether the target system supports idempotency keys or deduplication logic.

That leads to invisible duplication and confusing audit trails.

Tool classHealthier retry model
Read-only search or fetchLimited retries on known transient errors
File or retrieval lookupsRetries with bounded attempt count and timeout budget
Write actionsRetry only with explicit idempotency support or human confirmation
High-risk side effectsPrefer escalation over autonomous retries

This split is more useful than a global retry count.

At minimum, log:

  • original tool request,
  • failure class,
  • retry count,
  • timeout reason,
  • idempotency token or dedupe key if used,
  • and whether the final action was completed, abandoned, or escalated.

Without those logs, the team cannot tell whether the agent is resilient or just noisy.