What Should an Agent Eval Scorecard Actually Measure?

Most bad scorecards fail in one of two ways. They either measure only text quality and ignore workflow reality, or they collect every possible metric and drown the team in numbers nobody uses during release decisions.

A good scorecard should make one thing easier: deciding whether the system is healthy enough to ship, expand, or keep behind review.

Start with the workflow outcome

The first score on the page should answer the obvious question: did the system actually complete the task it was supposed to complete?

That sounds simple, but many teams substitute easier proxies:

the answer sounded plausible,
the tool call succeeded,
the trace was short,
or a grader gave the final text a decent score.

Those are useful signals. None of them is the outcome.

Four metric families matter most

1. Task success

Did the workflow end in a usable result? Not “did the model respond?” but “did the user or operator get the right outcome with acceptable effort?“

2. Tool behavior

Did the agent choose the right tool, call it correctly, avoid unnecessary calls, and recover when the tool failed? Tool success is not the same as workflow success, but it often explains where workflow quality is leaking.

3. Review and escalation burden

How much human work did the system create? Many agents look good on outcome rate while quietly driving up approval load, rescue work, and reviewer fatigue.

4. Risk and policy behavior

Did the system respect approval boundaries, data restrictions, and escalation rules? A workflow with high completion and bad policy behavior should still fail the release gate.

What teams often miss

Review cost

If one version raises reviewer load by 40 percent to gain a small outcome improvement, that is not always a win. Review burden belongs on the scorecard.

Recovery quality

Systems should be judged on how they behave when tools fail, inputs are missing, or ambiguity rises. Graceful escalation can be more valuable than brittle autonomy.

Consistency across slices

Average performance hides ugly clusters. The scorecard should break results by task type, customer segment, workflow path, or failure class.

A practical scorecard structure

For most teams, one page is enough:

primary outcome rate
severe-failure rate
tool-selection accuracy
approval or escalation rate
reviewer override rate
time-to-resolution or time-to-completion
slice notes for the most important failure clusters

That gives leaders enough signal to judge release risk without pretending the workflow is fully captured by one blended number.

When the scorecard is bad

You probably have the wrong scorecard if:

nobody can explain why a high score should permit release,
the team argues about the rubric every week,
policy failures are hidden inside general quality scores,
or reviewers keep overruling “passing” runs.

Compare next

EvalOps release gates and scorecard ownership Use this page when the scorecard now needs owners and release consequences.

LLM graders vs human review Use this page when the team needs to decide which scorecard inputs can be automated safely.

Evaluation stacks vs manual review Use this page when the question is whether the team needs tooling around the scorecard yet.

Reader value check

This page should help a reader decide whether the eval, trace, scorecard, or monitoring signal is strong enough to support a release decision. For What Should an Agent Eval Scorecard Actually Measure?, the page is not finished if it only explains vocabulary. It should change what the team approves, measures, routes, buys, logs, or refuses to automate.

Before applying the guidance, bring production traces, labeled failure examples, reviewer notes, and the exact workflow step being evaluated. Those inputs keep the decision anchored in real operating conditions instead of a generic best-practice list.

Check	What the reader should be able to answer
Signal quality	Can the team explain what behavior the signal proves, and what it does not prove?
Release use	Does the page help decide whether to ship, hold, roll back, or collect more evidence?
Failure learning	Does each miss become a reusable eval case instead of a one-off complaint?
Owner	Is there a named person or team responsible for maintaining the scorecard or review loop?

Use the page as a working review artifact: compare the current workflow against the table, mark the missing evidence, and assign an owner for the next change. If the page exposes a gap but no one owns that gap, the correct next step is not broader rollout; it is a smaller pilot, a clearer gate, or a better measurement loop.

For EvalOps pages, the useful outcome is a sharper release conversation. A reader should leave knowing which evidence belongs in the gate, which evidence belongs in incident review, and which metric is too vague to trust.