That Facial Match Score Is Lying to Your Face
Here's something that should stop you cold: the best deepfake detection systems in the world still miss roughly 4 in every 100 synthetic faces. That's not a rounding error. In a high-stakes investigation involving thousands of images, those four misses aren't an acceptable margin — they're the ones that matter most. And the reason they slip through has nothing to do with how sophisticated the deepfake is. It has everything to do with a fundamental misunderstanding of what facial comparison systems are actually doing when they "analyze" a face.
They're not looking at it. Not the way you are. They're doing geometry.
Facial comparison systems convert every face into a string of 128 numbers, then measure the mathematical distance between those numbers — and understanding where that process breaks down is the difference between a defensible match and a catastrophic error.
Your Eye vs. The Algorithm: Two Completely Different Questions
When you look at two photos and decide they show the same person, your brain is doing something remarkably sophisticated and remarkably unreliable. It's pattern-matching on visual texture — the curve of a nose, the spacing between eyes, the particular shadow a jawline casts. You're drawing on years of evolved face-recognition instinct, and you're doing it in milliseconds. It feels certain. It almost never is.
A facial comparison algorithm isn't doing any of that. When it processes a face, it runs the image through a deep neural network — specifically one trained on millions of face pairs — and the output isn't a visual impression. It's a vector: an ordered list of numbers, typically 128 of them, that encodes the geometric relationships between facial features into a fixed-length numerical signature. Two photos of the same person should produce two very similar vectors. Two photos of different people should produce vectors that are far apart. The whole game is in how you define "similar" and "far apart."
This is the foundation of modern facial comparison, and it was formalized in Google's landmark FaceNet research, published at arXiv — a system that achieves face recognition performance using just 128 bytes per face. That paper established the blueprint that almost every serious facial comparison system today still follows.
The Pipeline: What Actually Happens Between "Face" and "Match Score"
Most people — including many investigators who use facial comparison tools daily — think of a match score as something the algorithm simply "decides." It doesn't. That score is the end product of a four-step pipeline, and understanding each step is the only way to know when to trust the result. This article is part of a series — start with China Made Creating A Deepfake The Crime Not Sharing It U S .
Step one: detection. Before anything else, the system has to find the face in the image. Not as obvious as it sounds. A face at a 45-degree angle, partially behind a door frame, or lit from below challenges detection models in ways that don't affect human perception at all. If the detector crops the face incorrectly, every subsequent step processes corrupted input — and the system has no way of flagging that this happened.
Step two: landmark alignment. Once detected, the face gets geometrically normalized — rotated and scaled so key landmarks (eyes, nose tip, mouth corners) land in consistent positions. This alignment step is what makes the embedding reproducible across different photos of the same person. Skip it or get it wrong, and the vectors you produce are essentially random.
Step three: embedding. The aligned face patch passes through the neural network. Out comes your 128-number vector. Think of it as the face's address in a vast 128-dimensional space. Two photos of Alice land her address at roughly [0.23, -0.45, 0.78...]. Two photos of Bob land somewhere completely different. The network learned how to build this address space during training — by processing millions of face pairs and learning which numerical combinations reliably separate identities.
Step four: distance calculation and threshold comparison. The system computes the Euclidean distance between two embedding vectors. A distance of 0.0 means the faces are mathematically identical. A distance of 4.0 corresponds to two clearly different people. Somewhere in between sits the threshold — the line the system draws between "same person" and "different person." Cross below it: match. Cross above it: no match. That threshold number is not handed down from mathematical heaven. Someone chose it.
"Deepfake detection tools spot microscopic giveaways that generative models leave behind — unnatural pixel patterns, bizarre color shifts, and other artifacts that are completely invisible to humans." — ScreenApp, on the limits of human visual inspection
The Threshold Problem: Where Investigator Intuition Goes Wrong
Here's the analogy that makes this click. Imagine face comparison like converting every person into GPS coordinates in a 128-dimensional city. Two photos of the same person drop their coordinates within a few meters of each other. Two different people live on opposite ends of town. The algorithm's job is simply to measure the distance and decide: same neighborhood, or different city?
But here's what the analogy reveals that people miss: if the GPS signal is weak — bad lighting, unusual angle, a hand partially covering the face — the coordinates get scrambled. The algorithm still produces coordinates. It still calculates a distance. It still renders a verdict. It just doesn't know the signal was corrupted. You do. If you understand what causes corruption. Previously in this series: Every Image Is Guilty Until Proven Authentic.
The most common misconception investigators carry is this: a high confidence score means a reliable result. A 95% match sounds like an A. It feels safe. It's the kind of number you'd want to show a jury. The problem is that confidence scores are local measurements — they tell you how close two specific embeddings are to each other, given the specific conditions of those two images. They say nothing about how the algorithm would perform across a database of 100,000 faces, or whether the embedding itself was generated from clean input.
Peer-reviewed research published in ScienceDirect examined FaceNet performance under real-world occlusion conditions — sunglasses, hats, hands obscuring part of the face — and the findings are genuinely alarming. At 30% and 40% facial occlusion rates, recognition accuracy fell below 40%. Below 40%. That means the algorithm was wrong more than half the time. Not slightly degraded. Functionally broken. And yet it was still producing match scores. Still rendering verdicts. Still looking confident.
This is why occlusion is the hidden failure mode that almost nobody talks about in operational briefings. Real surveillance footage is occluded. Suspects wear hats. Witnesses instinctively raise a hand. The scenarios where you most need reliable facial comparison are precisely the scenarios where the embedding pipeline is most likely to produce corrupted output.
What You Just Learned
- 🧠 Facial comparison is geometry, not vision — the algorithm measures distance between 128-number vectors, not visual similarity between faces
- 🔬 Thresholds are human choices, not mathematical facts — the line between "match" and "no match" was set by someone, and it can be set wrong for your specific use case
- ⚠️ Occlusion corrupts the embedding, not just the score — a face obscured at 30-40% can drop algorithm accuracy below 40%, while the system continues producing confident-looking output
- 💡 A confidence score is local, not global — it describes this pair of images, not the algorithm's general reliability under the conditions you're working in
How Training Shapes What the Algorithm "Knows"
One more layer worth understanding: where do those 128 numbers come from in the first place? The network didn't arrive at them by logic. It learned them through a training process called triplet loss — and the mechanics of that process explain a lot about where the system fails.
During training, the network is fed triplets: an anchor face, a positive match (same person, different photo), and a negative (different person entirely). The training objective is simple to state and brutal to execute at scale — push the anchor and positive closer together in 128-dimensional space; push the anchor and negative further apart. Do this millions of times across millions of face triplets, and the network eventually learns an embedding space where identity is encoded as proximity.
But here's what that means for deployment: the network only knows what its training data taught it. If those millions of training faces were predominantly well-lit, frontal, high-resolution images — the kind you get from a controlled dataset rather than real-world surveillance — then the embedding space it learned may be genuinely excellent in those conditions and genuinely fragile in others. At CaraComp, this is precisely the problem that drives how we think about model validation: a system that scores beautifully on benchmark datasets can behave unpredictably the moment the real world hands it conditions that weren't well-represented during training. Up next: Law Enforcement Biometrics Facial Comparison Compliance.
Modern deepfake generation, meanwhile, is getting very good at producing faces that land in exactly the right neighborhood of embedding space — close enough to a real person's vector to fool a threshold comparison, while being entirely synthetic. As Holistic News put it plainly: visual inspection is no longer a sufficient defense. The era of "spotting a deepfake by eye" isn't ending because humans are bad at it. It's ending because the synthetic faces being generated were specifically optimized to defeat human visual inspection — and increasingly, to defeat threshold comparison too.
A facial comparison match score tells you how close two embeddings are in 128-dimensional space — it does not tell you whether those embeddings were generated from clean, unoccluded input. Before you trust a score, ask what conditions produced it. The algorithm can't ask that question for you.
So here's the question that should follow you out of this article: if you're comparing a grainy surveillance still — face partially obscured by a cap brim, shot at an angle, maybe 30% of the face genuinely missing — against a clean, frontal mugshot, you will get a distance score. The math will be correct. The distance between those two vectors is real. But the embedding generated from that surveillance image? It's corrupted. The algorithm encoded a partial face, normalized it as if it were complete, and produced 128 numbers that don't reliably represent the person's actual identity. The system doesn't know that. The score doesn't show that. The only thing standing between a defensible conclusion and a catastrophic misidentification is whether the investigator reading that score understands what the pipeline actually did to produce it.
Your eye sees a face. The algorithm sees a string of 128 numbers. The question worth asking isn't which one you trust — it's whether you know enough about how those 128 numbers were generated to decide when they're worth trusting at all.
When you're deciding whether two photos show the same person, what's your current process — and at what point would you feel confident enough to defend that conclusion under cross-examination?
Ready for forensic-grade facial comparison?
2 free comparisons with full forensic reports. Results in seconds.
Run My First SearchMore Education
A Facial Recognition 'Match' Isn't Evidence Until It Survives These 4 Hidden Steps
Most people think a facial recognition system outputs a "match" and that's that. Here's what actually happens — and why skipping any of the four hidden steps between raw score and reliable result is where investigations go wrong.
digital-forensicsThat Smoking-Gun Video? It's Not Evidence. It's a Suspect.
When a "smoking gun" video lands in your hands, your gut says it's real. That's exactly the problem. Learn why realism is a feature of deepfakes — not evidence against them — and what a real verification process actually looks like.
digital-forensicsInvestigators Can't Explain Their Own Facial Recognition Evidence. Courts Noticed.
Courts are now asking investigators to justify every facial comparison decision — not just whether they used biometric tech, but exactly how. Learn the hidden math that determines whether your evidence holds up.
