Skip to content

Tool-call success rates and ground truth for agent evals

Tool-call success rates and ground truth for agent evals

Section titled “Tool-call success rates and ground truth for agent evals”

Teams often say an agent “worked” because the final answer looked plausible. That is not enough once the system starts using search, retrieval, file access, browser control, or internal APIs. A tool-using workflow can fail long before the last answer. If evals only grade the answer, the team never learns whether the failure came from the wrong tool, the wrong arguments, the wrong sequence, or a weak approval decision.

Tool-using agent evals should measure at least three layers:

  1. tool-call success: did the tool run correctly and return the needed data?
  2. workflow success: did the agent choose and sequence tools correctly?
  3. task success: did the overall outcome satisfy the user need or business rule?

If those layers are collapsed into one pass/fail score, the eval system will hide the exact engineering work that needs to change.

A tool call should usually be graded successful only if all of these are true:

  • the tool selected was acceptable for the step;
  • required arguments were present and materially correct;
  • the call completed without invalid side effects;
  • the returned data was actually usable for the next step.

This matters because “HTTP 200” is not the same thing as workflow success.

LayerWhat ground truth should describe
Tool layerWhich tool should have been used, with what argument expectations
Workflow layerWhat sequence, approvals, retries, or fallbacks were acceptable
Outcome layerWhat answer, state change, or artifact was the real desired result

Teams often have outcome labels but no tool-layer or workflow-layer truth. That makes diagnosis weak.

At minimum, split failures into these buckets:

  • wrong tool chosen;
  • correct tool, bad arguments;
  • correct tool and arguments, bad sequencing;
  • approval or permission failure;
  • good trace, weak synthesis or final answer;
  • infrastructure failure such as timeout or stale dependency.

This taxonomy matters because each bucket points to a different owner: prompting or policy, tool schema design, runtime reliability, or evaluation data quality.

Strong eval sets for tool use usually include tasks with one clearly right tool choice, tasks with multiple plausible tools but one better path, tasks where the correct behavior is to refuse or escalate, tasks where tool output is noisy or incomplete, and tasks where retries should stop instead of continue.

If the eval set only covers happy-path calls, the team is measuring demo quality, not operating quality.