Tool-call success rates and ground truth for agent evals
Tool-call success rates and ground truth for agent evals
Section titled “Tool-call success rates and ground truth for agent evals”Teams often say an agent “worked” because the final answer looked plausible. That is not enough once the system starts using search, retrieval, file access, browser control, or internal APIs. A tool-using workflow can fail long before the last answer. If evals only grade the answer, the team never learns whether the failure came from the wrong tool, the wrong arguments, the wrong sequence, or a weak approval decision.
Quick answer
Section titled “Quick answer”Tool-using agent evals should measure at least three layers:
- tool-call success: did the tool run correctly and return the needed data?
- workflow success: did the agent choose and sequence tools correctly?
- task success: did the overall outcome satisfy the user need or business rule?
If those layers are collapsed into one pass/fail score, the eval system will hide the exact engineering work that needs to change.
What should count as tool-call success
Section titled “What should count as tool-call success”A tool call should usually be graded successful only if all of these are true:
- the tool selected was acceptable for the step;
- required arguments were present and materially correct;
- the call completed without invalid side effects;
- the returned data was actually usable for the next step.
This matters because “HTTP 200” is not the same thing as workflow success.
The three ground-truth layers
Section titled “The three ground-truth layers”| Layer | What ground truth should describe |
|---|---|
| Tool layer | Which tool should have been used, with what argument expectations |
| Workflow layer | What sequence, approvals, retries, or fallbacks were acceptable |
| Outcome layer | What answer, state change, or artifact was the real desired result |
Teams often have outcome labels but no tool-layer or workflow-layer truth. That makes diagnosis weak.
The failure taxonomy that matters
Section titled “The failure taxonomy that matters”At minimum, split failures into these buckets:
- wrong tool chosen;
- correct tool, bad arguments;
- correct tool and arguments, bad sequencing;
- approval or permission failure;
- good trace, weak synthesis or final answer;
- infrastructure failure such as timeout or stale dependency.
This taxonomy matters because each bucket points to a different owner: prompting or policy, tool schema design, runtime reliability, or evaluation data quality.
What high-value eval sets look like
Section titled “What high-value eval sets look like”Strong eval sets for tool use usually include tasks with one clearly right tool choice, tasks with multiple plausible tools but one better path, tasks where the correct behavior is to refuse or escalate, tasks where tool output is noisy or incomplete, and tasks where retries should stop instead of continue.
If the eval set only covers happy-path calls, the team is measuring demo quality, not operating quality.