The Quiet Cost of Average Accuracy
A model with 92% accuracy on a balanced test set sounds impressive. The same model, in production, can ship a feature that breaks for 40% of one customer segment while looking healthy on the dashboard. Average accuracy is the most common metric in ML and the one that hides the most failure.
The math is innocent
Average accuracy weighs every example the same. Real users don't. Your top-100 customers might generate 5% of your eval examples and 60% of your revenue. A model that fails on their segment but does well everywhere else still posts a fine average.
Three things to track instead
Worst-segment accuracy. Split your eval set by every meaningful dimension you have, customer tier, language, document type, time of day, and report the worst score. That number won't lie to you.
Cost-weighted error. Define what a failure costs in business terms. A wrong recommendation that leads to a refund costs more than one the user shrugs at. Weight failures accordingly.
Calibration. When your model says it's 90% sure, is it right 90% of the time? Models can be accurate on average and badly mis-calibrated, which makes them dangerous to gate decisions on.
The conversation it changes
Once you start reporting worst-segment numbers next to averages, two things happen in stakeholder meetings:
- You stop celebrating accuracy gains that come from overfitting the easy segments.
- You start having useful arguments about whether a model is good enough for the segment where it has to be.
A 95% average is exciting. A 60% on the segment that matters is the actual story.
Average accuracy is fine as a sanity check. As the headline metric, it's mostly an answer to a question your users aren't asking.
← All posts