Facial Matches Aren't Yes or No. They're Scores.
Here's something that should stop you cold: two facial recognition systems can look at the exact same pair of photographs, use the exact same underlying math, and one will say "same person" while the other says "different person." Not because one is broken. Not because one is better. Because someone made a decision — a quiet, technical, almost invisible decision — about where to draw a line. And that line is everything.
A facial "match" isn't a binary verdict — it's a distance score in 128-dimensional space, and the threshold separating "same person" from "different person" is a human judgment call with massive consequences for reliability.
Most investigators, attorneys, and even technologists treat a facial match like a light switch: on or off, yes or no, match or no match. That mental model feels intuitive. It's also completely wrong. Under the hood, modern facial recognition isn't flipping a switch — it's measuring distance. And understanding that distance, and what you're willing to call "close enough," is the difference between evidence you can stand behind and a result you'd never want to defend in a deposition.
Your Face as a Point in Space
Start with the basics, because the basics are genuinely fascinating. When a modern face recognition model analyzes a photograph, it doesn't "see" a face the way you do. It converts that face into a vector — a list of roughly 128 numerical values, each one encoding a geometric relationship between facial landmarks. The distance between your pupils, the curvature of your jaw, the ratio of your nose length to your forehead height — all of it gets compressed into a string of numbers.
That string of numbers is called a face embedding. And here's where it gets interesting: every face becomes a single point in a 128-dimensional space. You can't visualize 128 dimensions (nobody can, and anyone who claims otherwise is lying to you), but the math works exactly the same way it does in two or three dimensions. To compare two faces, the system calculates the straight-line distance between their two points in that space. This is Euclidean distance — the same geometry from your high school math class, just applied across 128 axes simultaneously instead of two.
Small distance? The faces are similar. Large distance? They're different. Simple, elegant, and deeply non-binary. The output isn't "match" or "no match." The output is a number. A score. A measurement.
So where does the yes-or-no verdict come from? Someone has to draw a line. This article is part of a series — start with Deepfake Detection Accuracy Gap Investigator Workf.
The Threshold: A Philosophy Disguised as a Setting
The threshold is the value you designate as the boundary between "same person" and "different person." Every facial recognition system has one. Most don't advertise it. And almost none of the reports generated from these systems mention it.
Here's what makes this genuinely consequential: the threshold is tunable. Move it lower, and the system becomes more conservative — it only declares a match when two faces are very close together in that 128-dimensional space. You'll miss some real matches, but your false positive rate drops sharply. Move it higher, and you'll catch more true matches, but you'll also start pulling in pairs of faces that aren't the same person at all. Neither setting is "correct." Both are deliberate tradeoffs.
Read that again. A shift of two hundredths — on a normalized scale where scores typically run from 0 to 1 — can multiply your false positive rate by ten. That's not a bug. That's not a flaw in the algorithm. That is the intended behavior of a system working exactly as designed. The algorithm is doing its job. The question is whether the person who set the threshold understood what they were trading away.
NIST's Face Recognition Vendor Testing program, which subjects commercial algorithms to rigorous independent testing, has consistently shown that error rates vary dramatically across vendors — not just because of different underlying models, but because of how threshold decisions interact with real-world image quality, demographic variation, and use-case context. Two systems built on identical mathematical foundations can produce opposite verdicts on the same photo pair simply because their thresholds were calibrated for different operating environments.
The BAC Analogy That Should Make You Uncomfortable
Think about blood alcohol content. In most U.S. states, 0.079% BAC is legal. 0.080% is a criminal offense. The biological difference between those two numbers is essentially meaningless — your driving is not measurably safer at 0.079 than at 0.080. But the legal consequence is absolute, because society decided it needed a line, and that line had to live somewhere.
Euclidean distance thresholds work identically. The distance score is a continuum — a smooth, analog measurement of similarity. The threshold is the law. A face pair that scores 0.41 on a system calibrated to flag anything below 0.42 is a "match." The same pair on a system calibrated to 0.39 is "not a match." The faces didn't change. The photographs didn't change. The number changed. Previously in this series: Face Quality Score Hidden Metric Behind Face Match.
The critical difference from BAC? Blood alcohol thresholds are publicly defined, legally standardized, and disclosed in every DUI case. Facial recognition thresholds are almost never disclosed in reports, rarely standardized across deployments, and frequently unknown even to the investigators relying on them. (That's not an accusation — it's just where the field currently sits, and it matters enormously.)
For investigators and analysts who want to understand how facial comparison actually produces its results, the threshold question is the first place to dig. Not the confidence percentage. Not the match indicator. The threshold.
Why "High Confidence" Means Less Than You Think
Here's the misconception that trips up almost everyone encountering facial recognition output for the first time: a "94% confidence" match sounds more reliable than an "88% confidence" match. It usually isn't — and sometimes the relationship is exactly backwards.
In most systems, confidence scores are just normalized distance values. They describe how far below the threshold a given score landed, expressed as a percentage. A 94% match using an aggressively permissive threshold might represent a face pair sitting at a distance of 0.40, on a system that would flag anything under 0.55. That pair isn't necessarily a strong match — it's just comfortably inside a generous boundary.
An 88% match on a conservative system might represent a face pair at 0.32, on a system that only flags distances under 0.36. That pair is actually much closer together in the underlying space. The more conservative system is working harder to earn its verdict.
Why the Threshold Question Matters in Practice
- ⚡ Confidence scores aren't standardized — A "94% match" from System A and a "94% match" from System B may represent entirely different levels of actual face similarity depending on each system's threshold calibration.
- 📊 The same image pair can produce opposite verdicts — Threshold differences across vendors and deployments mean a "match" in one context is a "non-match" in another, using identical source photographs.
- 🔮 Threshold disclosure should be standard practice — Any report citing a facial match that doesn't specify the operating threshold is omitting the single most important variable in evaluating that match's reliability.
CaraComp's approach to facial comparison is built around making these underlying scores — and the thresholds applied to them — transparent to analysts rather than hiding them behind a single match indicator. The distance score is real information. Collapsing it into a binary verdict before the analyst ever sees it throws away the most important part. Up next: Face Aging Facial Comparison Accuracy.
What You Should Actually Ask When You See a Match
Look, nobody's saying this is simple. Setting a threshold requires genuine expertise, access to validated test data, and a clear understanding of the specific use case. A threshold calibrated for airport screening (where you want to catch everyone, and a false positive just means a second look) is completely wrong for a criminal investigation (where a false positive means implicating the wrong person). One number cannot serve both masters.
But the response to complexity isn't to wave it away. The response is to ask better questions. When a facial match shows up in a report or on your screen, the questions that actually matter are not "how confident is the system?" They are: What was the threshold? What was the actual distance score? Was this threshold validated against a dataset that resembles the image conditions I'm working with? Is this threshold calibrated for my use case or someone else's?
A facial match is not a conclusion — it's a measurement. The threshold that converts that measurement into a "yes" or "no" is a human decision, and it should be disclosed, documented, and defensible just like any other methodological choice in a forensic report.
The threshold isn't a feature buried in a settings menu. It's the entire argument. It's the claim that this much similarity is enough to say "same person" — and that claim needs to be made explicitly, not hidden inside a confidence percentage that sounds authoritative but tells you almost nothing about the decision underneath it.
So the next time you see a "high confidence" facial match and feel ready to trust it — ask yourself one question first. Not "is the score high?" Ask: high relative to what threshold, set by whom, calibrated for which conditions, and disclosed where in this report?
If you can't answer that, the number isn't evidence. It's just a very convincing-looking guess.
Ready to try AI-powered facial recognition?
Match faces in seconds with CaraComp. Free 7-day trial.
Start Free TrialMore Education
A 0.78 Match Score on a Fake Face: How Facial Geometry Stops Deepfake Wire Scams
Deepfake scam calls now pair synthetic faces with cloned voices in real time. Learn how facial comparison geometry catches what human instinct misses—before the wire transfer goes through.
biometricsWhy 220 Keystrokes of Behavioral Biometrics Beat a Perfect Face Match
A fraudster can steal your password, fake your face, and pass MFA—but they can't replicate the unconscious rhythm of how you type. Learn how behavioral biometrics silently build an identity profile that's nearly impossible to forge.
digital-forensicsYour Visual Intuition Misses Most Deepfakes — Why 55% Accuracy Fails Real Cases
Think you can spot a deepfake by watching carefully? A meta-analysis of 67 peer-reviewed studies found human accuracy averages 55.54% — statistically indistinguishable from random guessing. Learn the three forensic layers investigators actually need.
