May 13, 2026 ·

Evals are the spec

Why we write evals before the agent — and how the eval set replaces the design document.

On every agent project, we now write the eval set before the agent. Not after the prototype, not in parallel with the prototype — first. The eval set is the spec, and the prompt is one of many implementations that might pass it.

The reason is boring and practical. Without an eval set, every product conversation about whether the agent is “good enough” turns into a vibes argument. With one, the conversation is short: it passes 86% of the suite, you said 90%, we ship when we get there. Stakeholders relax. Engineers stop guessing what stakeholders want.

What a useful eval set actually contains

  • Golden cases — 30 to 200 real inputs from production traffic, each labelled with the expected outcome by a domain expert (not by the engineering team alone).
  • Adversarial cases — the inputs that broke previous prototypes, the prompts that try to break new ones, the data that turned out to be wrong in the warehouse.
  • Stratified samples — broken out by customer segment, language, or query type, so a 90% average does not hide a 30% failure rate in one slice.
  • A grading rubric — written down, agreed in a room, stored next to the cases. Anyone can re-grade and get the same score.

It costs a week to build the first version. It saves several weeks of arguing later. We have not regretted it on any project.