Skip to content

Eval-driven development for agentic products

Eval-driven development means:

  • writing or refining eval cases before changing prompts, tools, or workflows;
  • using those evals to decide whether the change should ship;
  • and keeping the eval set aligned with the failures the product actually sees.

If evaluation only happens after launch, it is not driving development. It is documenting drift.

Agentic products change along more dimensions than traditional prompt apps:

  • tool contracts,
  • approval behavior,
  • runtime orchestration,
  • retrieval and search paths,
  • and model behavior itself.

That makes “it looked good in staging” a weak quality bar. Evals need to become part of implementation, not just reporting.

Official sourceCurrent signalWhy it matters
Agent evalsOpenAI now frames agent evals around end-to-end agent performance, tools, and outcomesEvaluation is moving closer to real workflow behavior
GradersOpenAI positions graders as part of a structured evaluation workflow, not only ad hoc reviewTeams can operationalize evaluation earlier in the development loop
Agents SDKThe SDK includes tracing and evaluation-oriented workflow supportRuntime instrumentation is now tightly connected to eval practice

What eval-driven development actually changes

Section titled “What eval-driven development actually changes”

Without eval-driven development, teams usually:

  • tweak prompts or tool behavior,
  • run a few hand-picked examples,
  • and ship if the demo still looks good.

With eval-driven development, teams instead:

  1. define the behavior change they want;
  2. add or update eval cases for that behavior;
  3. run the change against those cases;
  4. decide release readiness from the eval result and human review where needed.

That is a different operating rhythm.

These help decide whether an approach is promising enough to keep building.

These block or allow production changes. They should be stable, owned, and hard to game.

These watch for drift, new failure modes, and regression against real traffic patterns.

The mistake is trying to make one eval set do all three jobs.

The first useful release eval set usually covers:

  • representative happy-path tasks,
  • known high-cost failures,
  • approval-boundary behavior,
  • tool-choice correctness,
  • and a few difficult edge cases that product owners care about.

That is enough to shape development without creating an unmaintainable eval program on day one.