How I Design Confidence Scores for AI Products

In 2018 I watched a media buyer look at a recommendation that was, statistically, better than her own judgment — and override it anyway. That half-second of doubt is the whole problem. It's also the only thing I was hired to design.

The number that said nothing

The platform had built a recommendation engine, and on paper it was very good: across more than 10,000 campaigns it was running 60% more profitable than buyer intuition. The buyers rejected it 47% of the time anyway. I sat next to one of them for an afternoon to understand why. She wasn't fighting the math. She'd glance at the recommendation, see a bare number next to a "buy" button, and quietly do what she'd already decided to do.

That afternoon reframed the whole brief for me. Media buyers are trained skeptics — they'll take data over instinct, but only when they can see which way the data is pointing and why. A system that says "buy this" and offers nothing else isn't a tool to them; it's a threat to the judgment they're paid for. So they reach for the thing that feels safer: their own call. The model wasn't being out-argued. It was being ignored.

The fix everyone reached for first was a better number — recalibrate, add a decimal, make it look more authoritative. That's the trap. The buyer didn't distrust the score because it was imprecise. She distrusted it because it arrived as a verdict with no way in. What she needed wasn't a more confident number; it was a number she could interrogate and then choose to act on. The moment I started treating the score as the start of a conversation instead of the end of one, rejection started to fall.

A confidence score is a contract. The model commits to a number. The user commits to acting on it. My job is to make both halves of that bargain visible — so the person can see what they're agreeing to before they agree.

What I actually changed

The thing I'm proudest of on that project is the smallest: below a certain confidence, we stopped showing a number at all. A 42% score is worse than useless — it's an invitation to freeze. Replacing it with a plain "not confident enough to recommend — flag for review" let the buyer move on instead of staring at a figure she'd never act on. We spent the rest of the engagement on the same instinct, again and again: don't dress up uncertainty, name it, and always leave a door open for the expert to walk through and disagree.

That's the counter-intuitive part. The honest interface — the one that admits where the model is shaky — earned more trust than the optimistic one ever did. An "85% confident!" banner buys you exactly one mistake before a skeptic writes the whole system off. A score that already told you where it might be wrong survives the day it actually is.

What I believe now

Confidence scoring isn't a data-visualization problem. It's a trust problem wearing a number's clothes. The model is rarely the bottleneck in adoption; the half-second of doubt in front of the screen is. Design for that half-second and the rest of the model's value finally gets used. Ignore it, and you've built something statistically excellent that nobody bets on.

The reference version

Want the playbook, not the story? The five confidence-score patterns I lean on — with the trade-offs, the do's and don'ts, and worked examples.

See the pattern reference: Confidence Score Patterns →

Continue exploring

Next essay

Design the Failure State First

All essays

Back to Writing