Tool timeouts, retries, and idempotency for AI agents
Quick answer
Section titled “Quick answer”Agents should retry only when the failure is recoverable, the action is safe to repeat, and the team can explain why another attempt is more likely to succeed.
If the tool can write, purchase, submit, delete, or trigger side effects, idempotency matters before retry logic does.
Why this is an agent control problem
Section titled “Why this is an agent control problem”Without explicit timeout and retry rules, agents tend to do the wrong thing in production:
- wait too long on slow tools,
- retry the wrong calls,
- duplicate side effects,
- or keep trying until the workflow looks busy enough to appear intelligent.
This is not resilience. It is uncontrolled repetition.
The three controls that matter
Section titled “The three controls that matter”1. Timeouts
Section titled “1. Timeouts”Every tool class needs a deadline.
If the task is interactive, timeout budgets should be tighter. If the task is backgrounded, longer budgets may be fine, but they still need a stop condition.
2. Retries
Section titled “2. Retries”Retries should happen only for failures likely to recover:
- transient network issues,
- temporary rate limits,
- or short-lived upstream instability.
They should not be the default response to ambiguous tool errors.
3. Idempotency
Section titled “3. Idempotency”Any tool with side effects needs a model for safe repetition. If the same request can create duplicate writes or repeated submissions, retries can make the system actively dangerous.
The practical retry rule
Section titled “The practical retry rule”Retry only when all three are true:
- the failure class is known and transient,
- the tool action is idempotent or safely replayable,
- another attempt has a realistic chance of success.
If one of those is false, escalate or stop.
Where teams get this wrong
Section titled “Where teams get this wrong”The most common failures are:
- using the same retry policy for read and write tools,
- letting the model decide to “try again” without system constraints,
- and ignoring whether the target system supports idempotency keys or deduplication logic.
That leads to invisible duplication and confusing audit trails.
A practical policy split
Section titled “A practical policy split”| Tool class | Healthier retry model |
|---|---|
| Read-only search or fetch | Limited retries on known transient errors |
| File or retrieval lookups | Retries with bounded attempt count and timeout budget |
| Write actions | Retry only with explicit idempotency support or human confirmation |
| High-risk side effects | Prefer escalation over autonomous retries |
This split is more useful than a global retry count.
What to log
Section titled “What to log”At minimum, log:
- original tool request,
- failure class,
- retry count,
- timeout reason,
- idempotency token or dedupe key if used,
- and whether the final action was completed, abandoned, or escalated.
Without those logs, the team cannot tell whether the agent is resilient or just noisy.