5 min

Evals — knowing your AI actually works

How to test AI output beyond 'looks fine to me'. The boring practice that separates hobby from production.

"The model usually gets it right" is fine for a chat with yourself. It's not fine when the model is writing emails to customers, classifying medical notes, or routing payments. Evals — short for evaluations — are how you find out, with numbers, whether the model does what you actually need.

What an eval actually is

An eval is three things:

A dataset of inputs your real users send (or close to it).
A grader that decides whether each output is good — code, a rubric, or another model.
A score — pass rate, F1, average grade, whatever your stakeholders care about.

Run the model on the dataset. Run the grader on the outputs. Look at the score. Change the prompt / model / pipeline. Run again. Did the score go up or down?

The four grader types, in increasing cost

Grader	When to use	Cost
Exact match	"Did it return JSON with the right keys?" "Is the answer in this set of allowed values?"	~0
Pattern / regex	"Does the email contain a subject line?" "Is there a phone number in the output?"	~0
LLM-as-judge	"Rate this reply 1–5 for warmth and accuracy."	~1 model call per item
Human review	Subjective, high-stakes, or first-time-seen task	The most expensive but the truest

Use the cheapest grader that works. Most teams start with exact-match (was the JSON valid?) before reaching for LLM-as-judge. Always sample a few items for human review — graders themselves can be wrong.

How to assemble a good dataset

Start with 20 examples. Real ones, from logs if possible. 20 is enough to spot 80% of regressions; more is better but later.
Include edge cases. Empty inputs. Adversarial inputs. Multilingual. Sarcasm. The longest one a user actually sent.
Include positives and negatives. "Should refuse" cases matter as much as "should comply" cases.
Hand-label expected outputs. 30 minutes of careful labelling beats 10,000 auto-generated examples.
Version it. Treat the eval set like code. Diff it. Review changes. Don't silently edit the ground truth when the model fails.

The flywheel

User complaint or logged failure → add as eval case.
Change the prompt / model / tool to fix it.
Re-run the eval set. Score must go up without regressing other cases.
Ship. Watch new logs. Add new failures. Repeat.

Tools to know (2026)

Tool	What it is
OpenAI Evals	The original framework. Lots of patterns for offline batch evals.
Anthropic Workbench	Compare prompts head-to-head against test cases, in-console.
PromptLayer · Langfuse · LangWatch	SaaS observability + eval platforms.
promptfoo	Open-source CLI for prompt regression testing. Easy to drop into CI.
Phoenix · Braintrust	Trace + eval, end-to-end for agent pipelines.

The eval mindset, in one paragraph

Treat the prompt and the model as variables in an experiment. Treat the eval set as the fixed measuring instrument. Change one variable, measure the score, decide. If you can't draw a chart of score-over-time, you're not really iterating — you're vibe-coding on the model.

Common mistakes

Testing on the examples you wrote into the prompt. They will pass. Test on examples you've never shown the model.
One pass/fail score. Slice by input type — short / long / edge-case — or you'll miss a regression on one slice while another improves.
LLM judge that's the same model as the producer. Self-grading is biased; use a different model (and ideally a smaller one) for judging.
Drifting ground truth. When the model fails a case, don't "fix" the eval by relaxing the expected output. Fix the prompt.

Do I need evals if my use case is just my personal projects?

Not really — your own judgment is the grader. The moment another human relies on your AI's output, or you can't review every output by hand, you need evals.

How is this different from regular software testing?

Two real differences. (1) The same input can give different outputs (non-determinism), so your grader must tolerate variance — graders score quality, not exact equality. (2) Failures are usually about quality rather than crashes, so you need humans (or LLMs) judging open-ended outputs.

What about benchmarks like MMLU or HumanEval?

Public benchmarks measure model capability in general. Your eval set measures your system's quality on your task. A model that crushes MMLU can still fail your eval if your task is unusual. Build your own.