Agent evals for tool-using AI systems
Quick answer
Section titled “Quick answer”Tool-using agents should not be evaluated like ordinary chatbots. A useful eval must score the chain of operational decisions:
- whether the agent understood the task boundary;
- whether it planned a safe path;
- whether it chose the right tool;
- whether it called the tool with valid arguments;
- whether it respected approval boundaries;
- whether it handled tool failure correctly;
- whether the final state was correct and auditable.
If you only score the final answer, you will miss the failures that matter most in production.
Why tool-use evals are different
Section titled “Why tool-use evals are different”A normal prompt eval can often ask, “Was the answer correct?” A tool-using agent eval has to ask, “Was the path correct?”
The final answer may look good even when the agent:
- used the wrong data source;
- skipped a required approval;
- called a write tool with risky arguments;
- retried an ambiguous failure;
- completed the task but left the system in an unsafe state;
- fabricated a summary after a failed tool call.
Those are not writing-quality problems. They are control-system problems.
The evaluation layers
Section titled “The evaluation layers”A practical tool-use eval should separate at least six layers.
| Layer | What it measures | Failure example |
|---|---|---|
| Task understanding | Did the agent identify the real goal and constraints? | Treats a read-only request as permission to update records. |
| Plan quality | Did the agent choose a safe sequence? | Writes before checking state. |
| Tool selection | Did the agent choose the right tool or no tool? | Uses web search when internal policy requires a CRM lookup. |
| Tool arguments | Were parameters correct and scoped? | Searches the wrong customer ID or date range. |
| Approval behavior | Did the agent pause when required? | Sends email, deploys code, or edits billing without confirmation. |
| Recovery behavior | Did the agent handle failure safely? | Retries a write after an unknown timeout. |
| Final state | Was the real-world outcome correct? | Response says success, but the ticket was not updated. |
Teams often combine these into one score too early. Keep them separate until the system is mature enough to know where it fails.
The dataset should include operational cases
Section titled “The dataset should include operational cases”Many eval sets are too clean. They test happy paths and obvious instructions, then fail to catch production problems.
A stronger dataset includes:
- normal successful tasks;
- tasks that require no tool call;
- tasks where one tool is clearly wrong;
- ambiguous user requests;
- missing permissions;
- stale or conflicting records;
- tool timeouts;
- rate limits;
- validation errors;
- write actions that require approval;
- requests that should be refused or escalated;
- duplicate-action traps;
- partial success cases where final state must be verified.
The point is not to make the agent fail. The point is to learn whether the system fails safely.
Golden traces matter more than golden answers
Section titled “Golden traces matter more than golden answers”For tool-using agents, a golden answer is not enough. You need golden traces.
A golden trace should define:
- expected tool path;
- allowed alternative paths;
- forbidden tools;
- required approval checkpoints;
- expected arguments or argument constraints;
- acceptable retry behavior;
- required final state;
- expected user-facing explanation.
This lets evaluators distinguish a correct outcome reached safely from a correct-looking outcome reached by luck.
Scoring rubric
Section titled “Scoring rubric”Use a rubric that can diagnose failures.
| Score area | Example scoring question |
|---|---|
| Plan | Did the agent choose a sequence that protects data and user intent? |
| Tool choice | Did it use the right tool, avoid unnecessary tools, and avoid forbidden tools? |
| Arguments | Were IDs, filters, dates, permissions, and payload fields correct? |
| Approval | Did it ask before high-risk side effects? |
| Failure handling | Did it stop, retry, or escalate according to policy? |
| Final outcome | Was the task completed correctly in the external system? |
| Explanation | Did the user get a truthful summary of what happened? |
For high-risk workflows, a single approval failure should be a hard fail even if every other score is high.
What to evaluate by workflow type
Section titled “What to evaluate by workflow type”| Workflow | Highest-value eval focus |
|---|---|
| Customer support agent | Policy compliance, CRM lookup correctness, escalation, safe note creation. |
| Coding agent | Patch scope, test behavior, approval boundaries, no unrelated reversions. |
| Research agent | Source choice, citation accuracy, query strategy, uncertainty handling. |
| Sales or RevOps agent | Account matching, CRM writes, duplicate prevention, communication approval. |
| Data analyst agent | Query correctness, schema awareness, privacy boundaries, calculation accuracy. |
| Infrastructure agent | Permission boundaries, deployment gates, rollback planning, audit logs. |
The tool-use eval should reflect what a mistake would actually cost.
Human review is still part of the loop
Section titled “Human review is still part of the loop”Automated graders are useful, but tool-use evals often need human review for edge cases:
- Was the plan reasonable under uncertainty?
- Was the approval request clear enough?
- Did the agent expose the right risk?
- Did the trace show enough evidence?
- Did the final answer overstate success?
Human review should not be random opinion. Reviewers need a rubric, labeled examples, and disagreement review so the eval set improves over time.
Release gates
Section titled “Release gates”Agent evals become valuable when they control releases.
Useful gates include:
- no regression on high-risk approval cases;
- no increase in forbidden tool calls;
- no increase in duplicate-write attempts;
- minimum pass rate on core happy paths;
- minimum pass rate on failure recovery cases;
- trace completeness threshold;
- cost and latency budget threshold.
If evals do not affect release decisions, they become dashboards instead of controls.
Common mistakes
Section titled “Common mistakes”Avoid these patterns:
- scoring only the final response;
- mixing read-only and write workflows in one accuracy number;
- using only synthetic happy paths;
- ignoring tool arguments;
- ignoring no-tool cases;
- allowing the model to self-grade risky actions without trace evidence;
- treating retries as success when they create duplicate side effects;
- failing to preserve production failures as regression tests.
The best eval set grows from real incidents, near misses, and support escalations.
Implementation checklist
Section titled “Implementation checklist”Your tool-use eval system is probably healthy when:
- every workflow has a risk class;
- each eval case defines allowed and forbidden tools;
- tool arguments are scored, not only tool names;
- approval behavior has hard-fail cases;
- failure recovery cases are included;
- traces are stored and reviewable;
- production failures become regression cases;
- release gates block changes that weaken safety or reliability.