// vibes are not evals

How to Actually Evaluate an LLM Application

"Looks good to me" is the most expensive line in AI engineering. It's how a feature that works on five demo questions makes it to production, where it fails on the sixth one and your support inbox explains why.

Here are the five evaluation mistakes I see most often, and the fixes that have actually held up.

1. Evaluating on the prompts you wrote

Your eval set should not be the examples you used to design the prompt. They will pass. They have to. You wrote the prompt to make them pass.

The fix is to build a holdout set from real production traffic (or synthetic equivalents) before you start prompt engineering. Lock it. Don't peek. Run it again at the end.

2. Using accuracy when you should use cost-weighted accuracy

Not all wrong answers are equal. A wrong answer that quietly hallucinates a citation is much worse than a wrong answer that says "I don't know." Your eval should reflect that.

In practice: bucket failures into silent and loud. Multiply silent failures by 5x or 10x. The number you optimize for changes accordingly.

3. LLM-as-judge without spot checks

Using GPT-4 to grade GPT-4 is great, until you find out the judge consistently approves a failure mode that your humans would catch. Always sample 50 graded examples per release and re-grade them by hand. If your judge agrees with you less than 90% of the time, your eval scores are fiction.

4. One score, no breakdowns

"75% accuracy" tells you almost nothing. Per-intent, per-language, per-difficulty, per-customer-segment breakdowns tell you everything. The first time you split a number, you'll find a regression hiding in plain sight.

Averages are where bugs go to hide.

5. No regression tracking

If your CI doesn't run the eval set on every prompt change and block on regressions, you are evaluating once and then drifting forever. The hard part is not running the eval. It's setting a threshold the team will honor when a PR misses it.

What good looks like

A solid LLM evaluation setup, in order of how much you'll regret skipping each:

  1. A frozen holdout set (~100–500 examples) you wrote before tuning.
  2. Cost-weighted scoring that treats silent failures harder than loud ones.
  3. Per-segment breakdowns reported in CI.
  4. A spot-check ritual: 50 examples regraded by hand, every release.
  5. A regression gate that fails the build on a measurable drop.

None of this is exciting. All of it pays for itself the first time you avoid a bad ship.

← All posts