Skip to content

Agent evals for tool-using AI systems

Tool-using agents should not be evaluated like ordinary chatbots. A useful eval must score the chain of operational decisions:

  • whether the agent understood the task boundary;
  • whether it planned a safe path;
  • whether it chose the right tool;
  • whether it called the tool with valid arguments;
  • whether it respected approval boundaries;
  • whether it handled tool failure correctly;
  • whether the final state was correct and auditable.

If you only score the final answer, you will miss the failures that matter most in production.

A normal prompt eval can often ask, “Was the answer correct?” A tool-using agent eval has to ask, “Was the path correct?”

The final answer may look good even when the agent:

  • used the wrong data source;
  • skipped a required approval;
  • called a write tool with risky arguments;
  • retried an ambiguous failure;
  • completed the task but left the system in an unsafe state;
  • fabricated a summary after a failed tool call.

Those are not writing-quality problems. They are control-system problems.

A practical tool-use eval should separate at least six layers.

LayerWhat it measuresFailure example
Task understandingDid the agent identify the real goal and constraints?Treats a read-only request as permission to update records.
Plan qualityDid the agent choose a safe sequence?Writes before checking state.
Tool selectionDid the agent choose the right tool or no tool?Uses web search when internal policy requires a CRM lookup.
Tool argumentsWere parameters correct and scoped?Searches the wrong customer ID or date range.
Approval behaviorDid the agent pause when required?Sends email, deploys code, or edits billing without confirmation.
Recovery behaviorDid the agent handle failure safely?Retries a write after an unknown timeout.
Final stateWas the real-world outcome correct?Response says success, but the ticket was not updated.

Teams often combine these into one score too early. Keep them separate until the system is mature enough to know where it fails.

The dataset should include operational cases

Section titled “The dataset should include operational cases”

Many eval sets are too clean. They test happy paths and obvious instructions, then fail to catch production problems.

A stronger dataset includes:

  • normal successful tasks;
  • tasks that require no tool call;
  • tasks where one tool is clearly wrong;
  • ambiguous user requests;
  • missing permissions;
  • stale or conflicting records;
  • tool timeouts;
  • rate limits;
  • validation errors;
  • write actions that require approval;
  • requests that should be refused or escalated;
  • duplicate-action traps;
  • partial success cases where final state must be verified.

The point is not to make the agent fail. The point is to learn whether the system fails safely.

Golden traces matter more than golden answers

Section titled “Golden traces matter more than golden answers”

For tool-using agents, a golden answer is not enough. You need golden traces.

A golden trace should define:

  • expected tool path;
  • allowed alternative paths;
  • forbidden tools;
  • required approval checkpoints;
  • expected arguments or argument constraints;
  • acceptable retry behavior;
  • required final state;
  • expected user-facing explanation.

This lets evaluators distinguish a correct outcome reached safely from a correct-looking outcome reached by luck.

Use a rubric that can diagnose failures.

Score areaExample scoring question
PlanDid the agent choose a sequence that protects data and user intent?
Tool choiceDid it use the right tool, avoid unnecessary tools, and avoid forbidden tools?
ArgumentsWere IDs, filters, dates, permissions, and payload fields correct?
ApprovalDid it ask before high-risk side effects?
Failure handlingDid it stop, retry, or escalate according to policy?
Final outcomeWas the task completed correctly in the external system?
ExplanationDid the user get a truthful summary of what happened?

For high-risk workflows, a single approval failure should be a hard fail even if every other score is high.

WorkflowHighest-value eval focus
Customer support agentPolicy compliance, CRM lookup correctness, escalation, safe note creation.
Coding agentPatch scope, test behavior, approval boundaries, no unrelated reversions.
Research agentSource choice, citation accuracy, query strategy, uncertainty handling.
Sales or RevOps agentAccount matching, CRM writes, duplicate prevention, communication approval.
Data analyst agentQuery correctness, schema awareness, privacy boundaries, calculation accuracy.
Infrastructure agentPermission boundaries, deployment gates, rollback planning, audit logs.

The tool-use eval should reflect what a mistake would actually cost.

Automated graders are useful, but tool-use evals often need human review for edge cases:

  • Was the plan reasonable under uncertainty?
  • Was the approval request clear enough?
  • Did the agent expose the right risk?
  • Did the trace show enough evidence?
  • Did the final answer overstate success?

Human review should not be random opinion. Reviewers need a rubric, labeled examples, and disagreement review so the eval set improves over time.

Agent evals become valuable when they control releases.

Useful gates include:

  • no regression on high-risk approval cases;
  • no increase in forbidden tool calls;
  • no increase in duplicate-write attempts;
  • minimum pass rate on core happy paths;
  • minimum pass rate on failure recovery cases;
  • trace completeness threshold;
  • cost and latency budget threshold.

If evals do not affect release decisions, they become dashboards instead of controls.

Avoid these patterns:

  • scoring only the final response;
  • mixing read-only and write workflows in one accuracy number;
  • using only synthetic happy paths;
  • ignoring tool arguments;
  • ignoring no-tool cases;
  • allowing the model to self-grade risky actions without trace evidence;
  • treating retries as success when they create duplicate side effects;
  • failing to preserve production failures as regression tests.

The best eval set grows from real incidents, near misses, and support escalations.

Your tool-use eval system is probably healthy when:

  • every workflow has a risk class;
  • each eval case defines allowed and forbidden tools;
  • tool arguments are scored, not only tool names;
  • approval behavior has hard-fail cases;
  • failure recovery cases are included;
  • traces are stored and reviewable;
  • production failures become regression cases;
  • release gates block changes that weaken safety or reliability.