Evals — knowing your AI actually works
How to test AI output beyond 'looks fine to me'. The boring practice that separates hobby from production.
"The model usually gets it right" is fine for a chat with yourself. It's not fine when the model is writing emails to customers, classifying medical notes, or routing payments. Evals — short for evaluations — are how you find out, with numbers, whether the model does what you actually need.
What an eval actually is
An eval is three things:
- A dataset of inputs your real users send (or close to it).
- A grader that decides whether each output is good — code, a rubric, or another model.
- A score — pass rate, F1, average grade, whatever your stakeholders care about.
Run the model on the dataset. Run the grader on the outputs. Look at the score. Change the prompt / model / pipeline. Run again. Did the score go up or down?
The four grader types, in increasing cost
| Grader | When to use | Cost |
|---|---|---|
| Exact match | "Did it return JSON with the right keys?" "Is the answer in this set of allowed values?" | ~0 |
| Pattern / regex | "Does the email contain a subject line?" "Is there a phone number in the output?" | ~0 |
| LLM-as-judge | "Rate this reply 1–5 for warmth and accuracy." | ~1 model call per item |
| Human review | Subjective, high-stakes, or first-time-seen task | The most expensive but the truest |
Use the cheapest grader that works. Most teams start with exact-match (was the JSON valid?) before reaching for LLM-as-judge. Always sample a few items for human review — graders themselves can be wrong.
How to assemble a good dataset
- Start with 20 examples. Real ones, from logs if possible. 20 is enough to spot 80% of regressions; more is better but later.
- Include edge cases. Empty inputs. Adversarial inputs. Multilingual. Sarcasm. The longest one a user actually sent.
- Include positives and negatives. "Should refuse" cases matter as much as "should comply" cases.
- Hand-label expected outputs. 30 minutes of careful labelling beats 10,000 auto-generated examples.
- Version it. Treat the eval set like code. Diff it. Review changes. Don't silently edit the ground truth when the model fails.
The flywheel
- User complaint or logged failure → add as eval case.
- Change the prompt / model / tool to fix it.
- Re-run the eval set. Score must go up without regressing other cases.
- Ship. Watch new logs. Add new failures. Repeat.
Tools to know (2026)
| Tool | What it is |
|---|---|
| OpenAI Evals | The original framework. Lots of patterns for offline batch evals. |
| Anthropic Workbench | Compare prompts head-to-head against test cases, in-console. |
| PromptLayer · Langfuse · LangWatch | SaaS observability + eval platforms. |
| promptfoo | Open-source CLI for prompt regression testing. Easy to drop into CI. |
| Phoenix · Braintrust | Trace + eval, end-to-end for agent pipelines. |
The eval mindset, in one paragraph
Common mistakes
- Testing on the examples you wrote into the prompt. They will pass. Test on examples you've never shown the model.
- One pass/fail score. Slice by input type — short / long / edge-case — or you'll miss a regression on one slice while another improves.
- LLM judge that's the same model as the producer. Self-grading is biased; use a different model (and ideally a smaller one) for judging.
- Drifting ground truth. When the model fails a case, don't "fix" the eval by relaxing the expected output. Fix the prompt.