Learn AI
    navigate Enter open Esc close Open with K or /

    5 min

    Evals — knowing your AI actually works

    How to test AI output beyond 'looks fine to me'. The boring practice that separates hobby from production.

    "The model usually gets it right" is fine for a chat with yourself. It's not fine when the model is writing emails to customers, classifying medical notes, or routing payments. Evals — short for evaluations — are how you find out, with numbers, whether the model does what you actually need.

    What an eval actually is

    An eval is three things:

    1. A dataset of inputs your real users send (or close to it).
    2. A grader that decides whether each output is good — code, a rubric, or another model.
    3. A score — pass rate, F1, average grade, whatever your stakeholders care about.

    Run the model on the dataset. Run the grader on the outputs. Look at the score. Change the prompt / model / pipeline. Run again. Did the score go up or down?

    The four grader types, in increasing cost

    GraderWhen to useCost
    Exact match "Did it return JSON with the right keys?" "Is the answer in this set of allowed values?" ~0
    Pattern / regex "Does the email contain a subject line?" "Is there a phone number in the output?" ~0
    LLM-as-judge "Rate this reply 1–5 for warmth and accuracy." ~1 model call per item
    Human review Subjective, high-stakes, or first-time-seen task The most expensive but the truest

    Use the cheapest grader that works. Most teams start with exact-match (was the JSON valid?) before reaching for LLM-as-judge. Always sample a few items for human review — graders themselves can be wrong.

    How to assemble a good dataset

    • Start with 20 examples. Real ones, from logs if possible. 20 is enough to spot 80% of regressions; more is better but later.
    • Include edge cases. Empty inputs. Adversarial inputs. Multilingual. Sarcasm. The longest one a user actually sent.
    • Include positives and negatives. "Should refuse" cases matter as much as "should comply" cases.
    • Hand-label expected outputs. 30 minutes of careful labelling beats 10,000 auto-generated examples.
    • Version it. Treat the eval set like code. Diff it. Review changes. Don't silently edit the ground truth when the model fails.

    The flywheel

    1. User complaint or logged failure → add as eval case.
    2. Change the prompt / model / tool to fix it.
    3. Re-run the eval set. Score must go up without regressing other cases.
    4. Ship. Watch new logs. Add new failures. Repeat.

    Tools to know (2026)

    ToolWhat it is
    OpenAI EvalsThe original framework. Lots of patterns for offline batch evals.
    Anthropic WorkbenchCompare prompts head-to-head against test cases, in-console.
    PromptLayer · Langfuse · LangWatchSaaS observability + eval platforms.
    promptfooOpen-source CLI for prompt regression testing. Easy to drop into CI.
    Phoenix · BraintrustTrace + eval, end-to-end for agent pipelines.

    The eval mindset, in one paragraph

    Treat the prompt and the model as variables in an experiment. Treat the eval set as the fixed measuring instrument. Change one variable, measure the score, decide. If you can't draw a chart of score-over-time, you're not really iterating — you're vibe-coding on the model.

    Common mistakes

    • Testing on the examples you wrote into the prompt. They will pass. Test on examples you've never shown the model.
    • One pass/fail score. Slice by input type — short / long / edge-case — or you'll miss a regression on one slice while another improves.
    • LLM judge that's the same model as the producer. Self-grading is biased; use a different model (and ideally a smaller one) for judging.
    • Drifting ground truth. When the model fails a case, don't "fix" the eval by relaxing the expected output. Fix the prompt.
    Do I need evals if my use case is just my personal projects?
    Not really — your own judgment is the grader. The moment another human relies on your AI's output, or you can't review every output by hand, you need evals.
    How is this different from regular software testing?
    Two real differences. (1) The same input can give different outputs (non-determinism), so your grader must tolerate variance — graders score quality, not exact equality. (2) Failures are usually about quality rather than crashes, so you need humans (or LLMs) judging open-ended outputs.
    What about benchmarks like MMLU or HumanEval?
    Public benchmarks measure model capability in general. Your eval set measures your system's quality on your task. A model that crushes MMLU can still fail your eval if your task is unusual. Build your own.