Your Eval Score Is a Guess

Explore

Can you tell these two models apart?

Two models scored differently on your eval. Move the sliders to change the gap between them and the number of test examples. Watch what happens to the confidence intervals.

Score gap between models 0.3%

Number of test examples 200

The Problem

Point estimates hide how little you know

You run your eval. 68.2% accuracy. You change something, run it again. 67.9%. You pick the higher number.

Both numbers are estimates from a sample. If you ran a different sample of the same size, you'd get different numbers. The true performance of each variant is somewhere in a range around what you measured—and with typical eval set sizes, that range is much wider than people expect.

Version A 68.2%

vs

Version B 67.9%

On 200 examples, A's 95% CI is roughly 62–74%. B's is roughly 61–74%. These ranges almost completely overlap.

In Practice

This comes up constantly in agentic work

Any time you change a component and re-run evals, this applies. Three common cases:

01 Model swap

Switching from GPT-5.4 to Sonnet 4.6, or comparing two checkpoints of the same model. The eval set is finite. A 0.5% difference could just be which examples happened to be in your test split.

02 Prompt change

You rewrote the system prompt or added chain-of-thought. The score moved a point. On 100 examples, the 95% margin of error for a 68% score is ±9 percentage points. One point is well inside the noise.

03 Tool addition

You gave the agent web search or a new API. Task completion went up 2 points on 150 examples. The margin of error at that sample size is about ±7.5%. You can't distinguish that 2-point gain from noise.

Same underlying issue every time: a single number from a finite sample doesn't carry enough information to compare two things.

Beyond One-Off Comparisons

This applies to your eval pipeline too

If you're running evals on a regular cadence—on every commit, nightly, or as part of a release gate—the same thing applies. Each run produces a score from a finite sample, and the variance between runs can easily mask or manufacture trends.

A regression dashboard that shows score dropping from 71.3% to 69.8% across two nightly runs looks like something broke. But if the 95% interval on both runs is ±6 points, that movement is well within normal fluctuation. Without intervals, you're either chasing false regressions or missing real ones.

Whether you already have this infrastructure or you're building it, baking confidence intervals into the pipeline output is what turns eval scores into something you can actually make decisions on.

What To Do

Report confidence intervals

Two things make this tractable: enough eval examples (200–500+ for typical pass/fail evals) and reporting the interval instead of just the point estimate.

A confidence interval gives you the range where the true performance likely falls. If two models' intervals overlap, the measured difference is within the noise—you don't have evidence that one is better. If the intervals are separated, you do.

How It Works

Three lines of math

Step 1 — Standard Error

SE = std / √n

How much the measured score would vary if you re-ran the eval on a different sample of the same size. For binary (pass/fail) evals, std = √(p·(1−p)), so this is fully determined by the score and sample size.

Step 2 — Margin of Error

margin = 1.96 × SE

Multiplying by 1.96 gives a 95% confidence interval—if you repeated this eval 100 times, roughly 95 of those intervals would contain the true score. Standard choice; nothing magic about it.

Step 3 — The Interval

CI = score ± margin

The score becomes a range. Compare the ranges, not the point estimates.

Model	Score	Low	High	Range

These values update live with the explorer above. Scroll back up and move the sliders to see the table change.

For Agentic Systems

Higher variance means wider intervals

Classification tasks have relatively low per-example variance. Agentic tasks don't—a coding agent might solve something via three different paths, a research agent might get lucky on its first search. Run-to-run variance is higher, which means confidence intervals are wider at the same sample size.

Most agentic benchmarks report results on 50–200 examples with no error bars. At n=100 with a 65% score, the 95% CI is roughly ±9 points. That's a range from 56% to 74%. Claiming one system beats another by 2 points in that regime is not a meaningful statement.

The practical takeaway: report intervals, not points. If your intervals overlap, either collect more data or accept that you can't distinguish the two variants yet. Calibrated confidence beats false precision.

Your eval score is a guess. Here's how to know if you're actually improving.