Eval datasets for coding agents and repository tasks
Quick answer
Section titled “Quick answer”Coding-agent eval datasets should look like repository work, not like generic coding questions.
That means examples should encode:
- realistic file scope,
- repository constraints,
- approval boundaries,
- tests or checks that matter,
- and the kinds of ambiguity engineers actually face in real change requests.
Why benchmark-style prompts are not enough
Section titled “Why benchmark-style prompts are not enough”A toy coding prompt may show whether a model can generate code. It says much less about whether an agent can:
- navigate a real repository,
- stay inside the allowed scope,
- choose the right files,
- avoid touching risky paths,
- and stop when the task needs human judgment.
That is why repository-aware eval data matters.
What a healthy coding-agent eval example includes
Section titled “What a healthy coding-agent eval example includes”Each example should usually define:
- the task request,
- the allowed write scope,
- the forbidden paths,
- the expected verification or checks,
- and the acceptable end state.
Without those, the eval mostly measures coding fluency rather than operational discipline.
The most valuable example classes
Section titled “The most valuable example classes”High-yield dataset classes usually include:
Local bounded change
Section titled “Local bounded change”A small feature or fix in one narrow scope.
Ambiguous repository navigation
Section titled “Ambiguous repository navigation”The agent must find the right place to work without expanding scope unnecessarily.
Approval-sensitive change
Section titled “Approval-sensitive change”The task appears small but touches CI, dependencies, or another sensitive file class that should trigger stronger approval.
Test-alignment task
Section titled “Test-alignment task”The task requires updating or adding tests in a way that reflects the change instead of editing snapshots blindly.
Refusal or escalation case
Section titled “Refusal or escalation case”The correct behavior is not to proceed automatically.
These classes reveal whether the agent behaves like a safe repository operator, not just a code generator.
The dataset design rule
Section titled “The dataset design rule”Use examples pulled from real engineering work whenever possible:
- anonymized past tasks,
- representative bug classes,
- common refactor requests,
- and real review comments that caused production confusion.
That produces a dataset with much higher operational value than synthetic prompt-only tasks.
What to score
Section titled “What to score”For coding-agent eval datasets, score at least:
- file/path selection,
- scope discipline,
- policy compliance,
- verification behavior,
- and final correctness.
The change itself matters, but so does the path the agent used to get there.
What teams usually miss
Section titled “What teams usually miss”Teams often omit negative examples:
- tasks that should be refused,
- tasks that should be escalated,
- or tasks where the request is underspecified.
Those are crucial because repository safety depends on restraint as much as capability.
Implementation checklist
Section titled “Implementation checklist”Your coding-agent eval dataset is probably healthy when:
- examples look like real repository work;
- path and policy constraints are explicit;
- escalation cases are included;
- and success requires more than producing plausible code.