← Visual Notes
Explore
Can you tell these two models apart?
Two models scored differently on your eval. Move the sliders to change the gap between them and the number of test examples. Watch what happens to the confidence intervals.
↓ what's actually happening here
The Problem
Point estimates hide how little you know
You run your eval. 68.2% accuracy. You change something, run it again. 67.9%. You pick the higher number.
Both numbers are estimates from a sample. If you ran a different sample of the same size, you'd get different numbers. The true performance of each variant is somewhere in a range around what you measured—and with typical eval set sizes, that range is much wider than people expect.
Version A
68.2%
vs
Version B
67.9%
On 200 examples, A's 95% CI is roughly 62–74%. B's is roughly 61–74%. These ranges almost completely overlap.
In Practice
This comes up constantly in agentic work
Any time you change a component and re-run evals, this applies. Three common cases:
Switching from GPT-5.4 to Sonnet 4.6, or comparing two checkpoints of the same model. The eval set is finite. A 0.5% difference could just be which examples happened to be in your test split.
You rewrote the system prompt or added chain-of-thought. The score moved a point. On 100 examples, the 95% margin of error for a 68% score is ±9 percentage points. One point is well inside the noise.
You gave the agent web search or a new API. Task completion went up 2 points on 150 examples. The margin of error at that sample size is about ±7.5%. You can't distinguish that 2-point gain from noise.
Same underlying issue every time: a single number from a finite sample doesn't carry enough information to compare two things.
↓ it's not just individual decisions
Beyond One-Off Comparisons
This applies to your eval pipeline too
If you're running evals on a regular cadence—on every commit, nightly, or as part of a release gate—the same thing applies. Each run produces a score from a finite sample, and the variance between runs can easily mask or manufacture trends.
A regression dashboard that shows score dropping from 71.3% to 69.8% across two nightly runs looks like something broke. But if the 95% interval on both runs is ±6 points, that movement is well within normal fluctuation. Without intervals, you're either chasing false regressions or missing real ones.
Whether you already have this infrastructure or you're building it, baking confidence intervals into the pipeline output is what turns eval scores into something you can actually make decisions on.
What To Do
Report confidence intervals
Two things make this tractable: enough eval examples (200–500+ for typical pass/fail evals) and reporting the interval instead of just the point estimate.
A confidence interval gives you the range where the true performance likely falls. If two models' intervals overlap, the measured difference is within the noise—you don't have evidence that one is better. If the intervals are separated, you do.
How It Works
Three lines of math
Step 1 — Standard Error
SE = std / √n
How much the measured score would vary if you re-ran the eval on a different sample of the same size. For binary (pass/fail) evals, std = √(p·(1−p)), so this is fully determined by the score and sample size.
Step 2 — Margin of Error
margin = 1.96 × SE
Multiplying by 1.96 gives a 95% confidence interval—if you repeated this eval 100 times, roughly 95 of those intervals would contain the true score. Standard choice; nothing magic about it.
Step 3 — The Interval
CI = score ± margin
The score becomes a range. Compare the ranges, not the point estimates.
| Model |
Score |
Low |
High |
Range |
These values update live with the explorer above. Scroll back up and move the sliders to see the table change.
↓ why agents make this harder
For Agentic Systems
Higher variance means wider intervals
Classification tasks have relatively low per-example variance. Agentic tasks don't—a coding agent might solve something via three different paths, a research agent might get lucky on its first search. Run-to-run variance is higher, which means confidence intervals are wider at the same sample size.
Most agentic benchmarks report results on 50–200 examples with no error bars. At n=100 with a 65% score, the 95% CI is roughly ±9 points. That's a range from 56% to 74%. Claiming one system beats another by 2 points in that regime is not a meaningful statement.
The practical takeaway: report intervals, not points. If your intervals overlap, either collect more data or accept that you can't distinguish the two variants yet. Calibrated confidence beats false precision.