What Should an Agent Eval Scorecard Actually Measure?
What Should an Agent Eval Scorecard Actually Measure?
Section titled “What Should an Agent Eval Scorecard Actually Measure?”Most bad scorecards fail in one of two ways. They either measure only text quality and ignore workflow reality, or they collect every possible metric and drown the team in numbers nobody uses during release decisions.
A good scorecard should make one thing easier: deciding whether the system is healthy enough to ship, expand, or keep behind review.
Start with the workflow outcome
Section titled “Start with the workflow outcome”The first score on the page should answer the obvious question: did the system actually complete the task it was supposed to complete?
That sounds simple, but many teams substitute easier proxies:
- the answer sounded plausible,
- the tool call succeeded,
- the trace was short,
- or a grader gave the final text a decent score.
Those are useful signals. None of them is the outcome.
Four metric families matter most
Section titled “Four metric families matter most”1. Task success
Section titled “1. Task success”Did the workflow end in a usable result? Not “did the model respond?” but “did the user or operator get the right outcome with acceptable effort?“
2. Tool behavior
Section titled “2. Tool behavior”Did the agent choose the right tool, call it correctly, avoid unnecessary calls, and recover when the tool failed? Tool success is not the same as workflow success, but it often explains where workflow quality is leaking.
3. Review and escalation burden
Section titled “3. Review and escalation burden”How much human work did the system create? Many agents look good on outcome rate while quietly driving up approval load, rescue work, and reviewer fatigue.
4. Risk and policy behavior
Section titled “4. Risk and policy behavior”Did the system respect approval boundaries, data restrictions, and escalation rules? A workflow with high completion and bad policy behavior should still fail the release gate.
What teams often miss
Section titled “What teams often miss”Review cost
Section titled “Review cost”If one version raises reviewer load by 40 percent to gain a small outcome improvement, that is not always a win. Review burden belongs on the scorecard.
Recovery quality
Section titled “Recovery quality”Systems should be judged on how they behave when tools fail, inputs are missing, or ambiguity rises. Graceful escalation can be more valuable than brittle autonomy.
Consistency across slices
Section titled “Consistency across slices”Average performance hides ugly clusters. The scorecard should break results by task type, customer segment, workflow path, or failure class.
A practical scorecard structure
Section titled “A practical scorecard structure”For most teams, one page is enough:
- primary outcome rate
- severe-failure rate
- tool-selection accuracy
- approval or escalation rate
- reviewer override rate
- time-to-resolution or time-to-completion
- slice notes for the most important failure clusters
That gives leaders enough signal to judge release risk without pretending the workflow is fully captured by one blended number.
When the scorecard is bad
Section titled “When the scorecard is bad”You probably have the wrong scorecard if:
- nobody can explain why a high score should permit release,
- the team argues about the rubric every week,
- policy failures are hidden inside general quality scores,
- or reviewers keep overruling “passing” runs.