Evals Are Unit Tests for Non-Deterministic Systems

Every engineer accepts that production code needs tests. Yet the same engineers will tweak a prompt, eyeball three examples in a playground, and ship it to millions of users. The reason is a category error: they think prompts are configuration, not code. They aren't. A prompt is a program written in English for a non-deterministic interpreter, and the only responsible way to change it is to measure the change against a suite of cases.

That suite is an eval. It is the unit test of LLM engineering — except instead of asserting add(2, 2) == 4, you assert that a probabilistic system produces acceptable output across a representative set of inputs. Get evals right and you can refactor prompts, swap models, and tune retrieval with the same confidence you'd refactor a function under a green test bar. Skip them and every deploy is a vibe.

Why "looks good to me" doesn't scale

Manual spot-checking fails for three structural reasons. First, non-determinism: the same prompt yields different outputs across runs, so a single pass tells you nothing about the distribution. Second, regression invisibility: a prompt edit that fixes one case silently breaks five you weren't looking at. Third, the long tail: your demo inputs are clean and friendly; production inputs are adversarial, malformed, and weird.

Evals convert subjective "feels better" into a number you can put on a dashboard and gate a deploy on. That number is what turns prompt engineering from craft into engineering.

The three kinds of eval

Not every eval needs an LLM judge. Reach for the cheapest mechanism that captures what you care about.

Deterministic / code-based checks. When correctness is verifiable, assert it directly. Does the output parse as valid JSON? Does it match the schema? Does the extracted date fall in a valid range? Does the SQL the model wrote actually execute? These are fast, free, and flake-free — write them first.

Reference-based metrics. When you have a known-good answer, compare against it: exact match for classification, or embedding similarity for fuzzy semantic equivalence.

LLM-as-judge. When quality is genuinely subjective — tone, helpfulness, faithfulness to retrieved context — use a strong model to grade outputs against a rubric. Powerful, but the judge is itself a non-deterministic system you must validate. Both OpenAI's evals guidance and Anthropic's docs on testing and evaluation walk through these patterns.

Default to deterministic checks. Every assertion you can express in code instead of an LLM judge is faster, cheaper, and impossible to bias. Save the judge for the genuinely subjective dimensions.

Start with deterministic assertions

Here's the shape of a code-based eval harness. No framework required to begin — a list of cases and a runner gets you most of the value.

import json
from dataclasses import dataclass
from typing import Callable
 
@dataclass
class Case:
    name: str
    inputs: dict
    check: Callable[[str], bool]
 
def extracts_valid_json(output: str) -> bool:
    try:
        data = json.loads(output)
        return "amount" in data and isinstance(data["amount"], (int, float))
    except json.JSONDecodeError:
        return False
 
cases = [
    Case("simple_invoice", {"text": "Total due: $420.00"}, extracts_valid_json),
    Case("messy_invoice",  {"text": "amt ~ 420 bucks, ish"}, extracts_valid_json),
    Case("empty_input",    {"text": ""},                     extracts_valid_json),
]
 
def run_suite(model_fn):
    passed = 0
    for c in cases:
        out = model_fn(c.inputs)
        ok = c.check(out)
        passed += ok
        print(f"{'PASS' if ok else 'FAIL'}  {c.name}")
    print(f"\n{passed}/{len(cases)} passed")
    return passed / len(cases)

Run that on every prompt change. The moment a "harmless" edit drops you from 3/3 to 2/3, you've caught a regression that spot-checking would have shipped.

Building an LLM-as-judge that you can trust

For subjective dimensions, write a rubric, not a vibe. Give the judge a narrow question, a scale, and explicit criteria. Ask for a structured verdict so you can aggregate it.

JUDGE_PROMPT = """You are grading whether an answer is faithful to the provided context.
Faithful means: every factual claim in the ANSWER is supported by the CONTEXT.
Do not reward fluency. Penalize any claim not grounded in CONTEXT.
 
CONTEXT:
{context}
 
ANSWER:
{answer}
 
Respond with JSON: {{"faithful": true|false, "unsupported_claims": [..]}}"""
 
def judge_faithfulness(client, context: str, answer: str) -> dict:
    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=512,
        messages=[{"role": "user",
                   "content": JUDGE_PROMPT.format(context=context, answer=answer)}],
    )
    return json.loads(resp.content[0].text)

The critical, skipped step: validate the judge against human labels. Hand-grade 50–100 outputs, run the judge on the same set, and measure agreement. If the judge disagrees with you 30% of the time, its scores are noise. Tune the rubric until agreement is high before you trust its numbers to gate deploys.

Curate the dataset like it's the product

Your eval is only as good as its cases. A suite of ten happy-path examples gives false confidence. Build the dataset deliberately:

Seed from real production traffic — anonymized real inputs beat synthetic ones.
Mine your failures — every bug report and bad output becomes a permanent regression case. This is how the suite compounds in value.
Cover the adversarial tail — empty inputs, injection attempts, wrong-language text, enormous inputs, ambiguous requests.
Keep it versioned in git alongside the code, reviewed in PRs.

Frameworks like LangChain's evaluation tooling help you scale this once the suite grows, but the dataset itself is your moat, not the framework.

Treat every production incident as a test case. The first time a bad output reaches a user is unavoidable. The second time is a process failure — and an eval case would have caught it.

Wire evals into CI

The payoff is automation. Run the suite on every pull request that touches a prompt, a model version, or retrieval logic. Print a scoreboard, and fail the build if the pass rate drops below a threshold or regresses against the base branch.

# In CI: run evals, fail the build on regression
python -m evals.run --suite extraction --min-pass-rate 0.95 \
  || { echo "Eval regression — blocking merge"; exit 1; }

Now a prompt change is reviewable like any other diff: the reviewer sees the score moved from 94% to 97%, not just that a string changed. That is the whole point — restoring engineering discipline to a probabilistic system.

Takeaways

A prompt is code; changing it without evals is shipping untested code.
Prefer deterministic, code-based assertions; reserve LLM-as-judge for genuinely subjective quality.
Validate any LLM judge against human labels before trusting its scores.
Your eval dataset is the asset — seed it from real traffic and grow it from every failure.
Run evals in CI and gate merges on the pass rate so regressions can't ship.

Evals Are Unit Tests for Non-Deterministic Systems

Why "looks good to me" doesn't scale

The three kinds of eval

Start with deterministic assertions

Building an LLM-as-judge that you can trust

Curate the dataset like it's the product

Wire evals into CI

Takeaways

Read next

What Is RAG? A Practical Guide to Retrieval-Augmented Generation

RAG Isn't Dead — But Your Chunking Strategy Probably Is

Guardrails: Validate LLM Output Before It Reaches Your Users