[ PLAYBOOK · 08 ] · MAY 11, 2026 · 7 min

Evals that catch real regressions.

You do not need a 500-row eval. You need a 50-row eval that you actually run on every change. Small, versioned, binary, and grown one production failure at a time.

The take

Most teams either ship without evals or treat them as one-off scripts. Both fail in different ways. The shape that pays in production is small (around 50 rows), versioned (lives next to the prompt in git), automated (runs in CI on every change), and binary (pass or fail per row). Anything bigger is hard to maintain. Anything looser misses regressions until a customer reports them. The teams we see catch the most real bugs are not running the largest evals. They are running the smallest evals that hurt to ignore.

Why most evals fail

Three failure modes show up in almost every engagement we audit.

The first is vibes-based testing. A team lead spot-checks twenty outputs after a prompt change, decides "looks good," and ships. This works until the underlying model version changes (Anthropic releases a new Sonnet, OpenAI deprecates a snapshot, your provider rotates routing) and the spot-checks are not reproducible. There is no baseline. There is no diff. The team learns about the regression from a support ticket.

The second is the 5,000-row dump. Someone exports two months of production traces, scores them once with an LLM judge, ships a PDF to the team, and never touches the set again. The set is too large to maintain, too noisy to read, and too detached from the prompt to update. Six weeks later it represents a system that no longer exists.

The third is LLM-as-judge alone, with no human calibration. Modern judge models have a documented bias toward leniency: borderline outputs get marked as passing more often than human raters would mark them. The standard inter-rater reliability threshold (Krippendorff's alpha around 0.8) is rarely met by uncalibrated judges run-to-run on the same prompt. If the judge is the only signal, the baseline moves under the team's feet.

The shape of an eval that earns its keep

The eval that survives in production is shaped by five constraints. Each of them is a refusal of a tempting alternative.

50 rows, not 500. Fifty is large enough to span the failure modes you actually see in production and small enough that one engineer can read every row in an afternoon. At 500 rows nobody reads them; the set decays into trust-the-number territory and stops catching new failure modes. We aim for 5 to 10 rows per failure mode, with 5 to 7 modes covered.

Real production traces, not synthetic prompts. Every row is sampled from a real customer interaction that already happened. Synthetic test cases miss the long tail. The phrasing your users actually use, the malformed inputs, the trailing whitespace, the unicode, the half-finished sentences: these are where regressions hide. An eval full of cleanly-formatted English is an eval that lies.

Binary rubric, not 1-to-10 scoring. Boolean pass-fail per row is more reliable than fine-grained scoring. Pointwise scales drift between runs and between judges. The empirical finding is consistent: binary scoring reduces judge variance and produces stable cross-time comparisons. If you cannot decide pass-fail, the rubric is not specific enough.

Versioned alongside the prompt. The eval lives in the same repo as the prompt, in the same commit, behind the same code review. When the prompt moves, the eval moves with it. Any other arrangement leads to drift between what the prompt does and what the eval measures. The cost is real: the repo carries eval fixtures and possibly redacted production traces. Budget the redaction time.

Runs in CI on every change. Not weekly. Not when someone remembers. On every pull request that touches the prompt, the model client, or the tools the agent calls. The eval is a release gate, not a quarterly health check.

Tooling decision

The 2026 eval-tooling landscape has converged on a small set of credible options. Pick by what you actually need to measure.

Hand-rolled (pytest plus a small runner). Right for teams with under five evals, no need for eval-drift dashboards, and no stakeholder review loop. A 100-line runner that loads the dataset, calls the model, applies the rubric, and prints a diff is enough. Most teams overshoot this stage and pay for SaaS they do not yet need.

Promptfoo. YAML-driven, runs in CI, free and open source, results stay local. Strong in security and red-teaming scenarios. The right second step after a hand-rolled runner outgrows the team.

Braintrust. Built around release-level enforcement: dataset management, scoring, regression tracking, CI integration. The right pick when evaluation has become a cross-team release concern rather than a single engineer's tool.

LangSmith. Framework-agnostic, but the integration is tightest if your stack is centered on LangChain or LangGraph. Auto-tracing and prompt management work out of the box there. Outside that ecosystem it works through the SDK with more setup.

Langfuse. Open-source, self-hostable in under thirty minutes, OpenTelemetry-compatible. The right call when data sovereignty or self-hosting is the constraint that drives the decision.

The pattern we see in mature teams is not "pick one." It is two tools: a lightweight runner gating CI, paired with a platform for human annotation, dataset growth, and stakeholder dashboards. Engineers gate releases on the runner; PMs and reviewers grow the set in the platform. That division of labor is what keeps evaluation sustainable as the team scales.

Anti-patterns to avoid

A few patterns look like good evaluation hygiene and are not.

LLM-as-judge alone. Calibrate against three human raters before trusting any judge. If judge-human agreement is below Krippendorff's alpha of 0.7, the judge is not measuring what you think.

100% pass rate. A perfect score does not mean the system is good. It means the eval is too easy. Add the next failure mode you saw in production. The pass rate should sit at 80% to 95%, with movement after every meaningful prompt change.

Pointwise 1-10 scoring. Looks more nuanced. Drifts more between runs. Use binary unless you have a specific cross-team reason to surface a number.

Eval set that never changes. Every regression you ship is a row you should add. The set stays sharp by growing with the system, not by sitting still.

Eval decoupled from prompt change. If the prompt moves and the eval does not, both rot. Same repo, same review, same release.

A 5-day rollout

If you are starting from no evals at all, this is the sequence we run with new clients.

Day 1. Pull 100 production traces from the last week. If you do not have logging, that is your real day-1 problem; fix it before the rest of this matters. Sort the 100 traces into "right," "wrong," and "unclear." Write down the count.

Day 2. For each "wrong" trace, label the failure mode in one sentence. Cluster the labels into 5 to 7 categories. Common categories: hallucinated fact, ignored constraint, wrong format, missed edge case, refused legitimate request, leaked unrelated context.

Day 3. Sample 5 to 10 traces per category for the eval set. You should land between 30 and 60 rows. Write a binary rubric per category, in plain English, specific enough that two engineers reading the same output would agree on pass-fail.

Day 4. Wire the set to CI. Pick one of the tools above. The first integration should fail loudly on regression and produce a diff against the previous run. Nothing fancy.

Day 5. Run the baseline. Document the pass rate per category. This is your floor. Every prompt change from now on either holds the floor or explains why it moved.

What to measure after week one

Once the eval is running, four metrics tell you whether it is earning its keep.

Pass rate per category over time. Track which categories drift first when prompts change or models upgrade. Drift in one category is signal; drift across all of them is noise.

Drift across model versions. When Claude Sonnet 4.6 becomes 4.8, when GPT routing changes, the eval should tell you what shifted. Without this, you are flying blind every time a vendor pushes.

Cost per eval run. Keeps the set sustainable. If a full run costs more than a coffee, you will run it less often, and it will catch fewer regressions. This is the constraint that keeps the row count honest.

Time from production regression to row added. Closing this loop is the entire point. The faster regressions become rows, the faster the next regression of the same shape gets caught before users see it.

Evals are not a separate workstream. They are the part of an LLM system that knows whether the system still works. Build the smallest one that hurts to ignore. Run it on every change. Grow it from real failures. That is the whole job.