Skip to content

Trace grading for tool-using AI agents

Trace grading means evaluating the whole agent run:

  • what it planned,
  • which tools it chose,
  • how it used them,
  • where it escalated,
  • and whether the final outcome was acceptable.

If you only score the last answer, you miss the most expensive agent failures.

Tool-using agents can fail in ways that a final-text score hides:

  • wrong plan but lucky final answer,
  • right plan but wrong tool,
  • right tool with wrong arguments,
  • no approval when approval was required,
  • expensive or unnecessary tool use,
  • failure to stop when evidence was insufficient.

Those are system failures, not wording failures.

A practical trace-grading rubric should usually cover:

DimensionWhat to check
Plan qualityDid the agent choose a reasonable approach for the task?
Tool selectionWere the right tools used and the wrong ones avoided?
Tool argumentsWere inputs specific and correct enough to trust the call?
Approval behaviorDid the agent pause, escalate, or seek review when required?
Outcome qualityDid the final result solve the task acceptably?

This makes evaluation much closer to the real operating risk.

Trace grading is especially important when:

  • the agent can call multiple tools,
  • tool usage carries real cost,
  • approvals are part of the workflow,
  • or a bad decision can still produce a superficially plausible answer.

That is why it matters more as agents become more capable.

Ask:

  1. Did the agent understand what kind of task this was?
  2. Did it choose the correct tool path?
  3. Did it stop or escalate when uncertainty increased?
  4. Did it make unnecessary calls that increased spend without value?
  5. Did the trace reveal a repeatable failure class?

Good trace grading should explain why a run failed, not just that it failed.

Do not create a rubric so detailed that graders cannot use it consistently. A good trace rubric is:

  • tight,
  • repeatable,
  • and connected to real deployment risk.

Too many categories create noise. Too few hide system behavior.

Use output grading to decide whether the user-facing result is acceptable. Use trace grading to decide whether the agent behavior is safe, efficient, and governable. Teams that separate those two layers usually improve faster.