CaraComp
Log inTry Free
CaraComp
Forensic-Grade AI Face Recognition for:
Start Free Trial
digital-forensics

A Facial Recognition 'Match' Isn't Evidence Until It Survives These 4 Hidden Steps

A Facial Recognition 'Match' Isn't Evidence Until It Survives These 4 Hidden Steps

Here's a number that should stop you cold: a 50-percentage-point drop in accuracy can happen before the algorithm even gets a fair shot — simply because the inter-eye pixel distance in the probe image falls below 24 pixels. That's a resolution level you'll encounter constantly in real operational footage. The algorithm doesn't warn you. The confidence score doesn't flag it. The system just quietly becomes half as reliable as the benchmark chart promised, and the result lands on your screen looking exactly the same as it would if the image quality were perfect.

TL;DR

A facial comparison result is not a conclusion — it's a signal that needs to survive four distinct human and technical filters before it belongs anywhere near a report, a client, or a courtroom.

This is the gap between what people think facial recognition does and what it actually does. Most people's mental model is: upload photo, algorithm compares, system says "match," case closed. That model is approximately as accurate as thinking your GPS knows where you are because it's connected to satellites — technically true, but missing about six layers of engineering that determine whether the thing is actually right.

In serious investigations, the algorithm's output is where the story starts, not where it ends. And understanding the four hidden steps between a raw similarity score and a result you can put your name on is, increasingly, the difference between rigorous professional practice and expensive mistakes.


Step One: The Quality Check That Has Nothing to Do with the Algorithm

Before any comparison algorithm runs, something more fundamental has to happen: the input image has to be assessed on its own terms. Not "is this face similar to that face?" but "is this image even workable?"

Research published in the International Journal of Legal Medicine found a direct, measurable relationship between image quality scores and match outcomes — high quality scores correlated with correct matches, low quality scores correlated with incorrect matches, and critically, high exposure was linked to false negatives while low exposure was linked to false positives. These aren't edge cases. They're the physics of how light and resolution interact with the feature-extraction process. This article is part of a series — start with China Made Creating A Deepfake The Crime Not Sharing It U S .

What does "image quality" actually mean here? It's not aesthetics. It's a set of measurable parameters: resolution (measured in inter-eye pixel distance), pose angle (algorithms trained on frontal faces start losing accuracy meaningfully when the face rotates beyond 30 degrees), illumination uniformity, occlusion percentage, and compression artifacts from CCTV encoding. Pose angles beyond 30 degrees can reduce match confidence scores by 30–40% even on top-performing algorithms — yet the score appears on screen as if that degradation never happened.

This is why experienced investigators treat the quality check as a first-line gate, not an afterthought. If the probe image fails a quality assessment, the downstream comparison score is not just imprecise — it's potentially misleading in ways that aren't self-announcing.

50pts
accuracy drop observed when inter-eye pixel distance falls below 24 pixels — a resolution level common in operational surveillance footage
Source: CaraComp operational accuracy research

Step Two: The Score Isn't What You Think It Is

Let's say the image passes quality checks. The algorithm runs. A score appears — say, 0.94. Most people read this as "94% certain this is a match." That reading is understandable. It's also wrong, and understanding exactly why it's wrong is the cognitive shift that separates careful practitioners from overconfident ones.

A recognition confidence score describes the similarity between two templates extracted from the probe and reference images. It does not describe the probability that the two images show the same person. Those are different questions. And here's the kicker: the score is also shaped by image quality itself. A lower score can indicate poor quality images rather than less similarity between the people pictured. The number you're looking at conflates two separate things — actual dissimilarity and extraction degradation — into a single value, without telling you which one is driving it.

Think of it like a thermometer that changes its measurement scale based on ambient conditions. If the temperature reads 98.6°F, you don't know whether that person is healthy or whether the thermometer is running cold because it's been sitting in a drafty room. You need to know the instrument's state to interpret the reading. Same principle here.

Why do people get this wrong? Because numbers that look like percentages trigger a deeply ingrained mental shortcut — we treat them as probability statements. An algorithm that returns "0.94" is doing something technically precise, but not epistemically complete. The precision of the output implies confidence that the process doesn't actually deliver on its own. Previously in this series: Facial Recognition Isnt On Trial Your Explanation Is.

"The conversation cannot stop at whether the system performs well technically; organisations also need to consider how the technology is governed, how data is stored, who has access to it and whether its use can be clearly justified." Startups Magazine

Trusted by Investigators Worldwide
Run Forensic-Grade Comparisons in Seconds
2 free forensic comparisons with full reports. Results in seconds.
Run My First Search →

Step Three: The Threshold Is a Risk Decision, Not a Technical Setting

Here's where it gets genuinely interesting — and where most people's understanding of facial recognition has a blind spot the size of a barn door.

Every facial comparison system operates against a threshold: a cutoff score below which the system doesn't return a match candidate. This threshold appears to be a technical parameter. It is not. It is a risk decision, made by human beings, about which kind of error matters more in your specific context.

Set the threshold too high, and you'll miss genuine matches — legitimate subjects who should appear in your candidate list won't surface because their image quality dragged the score below the cutoff. Set it too low, and you'll be wading through low-quality candidates that waste investigator time and create false leads. As the Bipartisan Policy Center has documented through NIST benchmark data, at least six of the most accurate identification algorithms had higher false-positive rates for one demographic group at one threshold but lower false-positive rates at a different threshold. The same algorithm. Different threshold. Different demographic error profile. The "neutral" setting doesn't exist.

This is precisely why the National Institute of Standards and Technology measures algorithm performance at specifically defined false match rates — 0.001% and 0.0001% — rather than at a single universal threshold. The point is to make the tradeoff explicit. Every threshold is a statement about what your organization is willing to risk. Treating it as a default technical setting is the same as leaving that risk decision to whoever installed the software.


Step Four: The Human Review That Cannot Be Shortcut

The algorithm produces a ranked candidate list. The threshold determines what appears on that list. Then comes the step that no amount of algorithmic sophistication replaces: a trained investigator looks at the images, side by side, and makes a judgment.

This isn't a formality. Research in forensic facial comparison, including work published in ScienceDirect, establishes that forensic comparison systems need calibrated confidence measures, not raw scores — with the preferred approach being a score-based likelihood ratio that places the algorithm's output within a statistical framework. What this means practically is that translating a similarity score into evidentiary weight requires human expertise in reading the quality factors that shaped that score, not just the score itself. Up next: Law Enforcement Biometrics Facial Comparison Compliance.

At CaraComp, this is the step where the actual investigative value gets realized — where an analyst's understanding of what drove a particular score (image quality degradation? pose angle? partial occlusion?) transforms a raw number into a defensible conclusion. The algorithm's job is to prioritize. The analyst's job is to evaluate. Those are not the same job, and the analyst cannot skip the algorithm any more than the algorithm can replace the analyst.

What You Just Learned

  • 🧠 Image Quality Check — resolution, pose angle, exposure, and occlusion assessment before any comparison runs
  • 🔬 Algorithm Score & Threshold — similarity score generation against a threshold that reflects an explicit risk tolerance decision
  • 👁️ Investigator Visual Review — trained side-by-side comparison that contextualizes the score against quality factors
  • 💡 Report & Risk Decision — calibrated confidence statement, not a binary match/no-match, suitable for a case file or legal context
Key Takeaway

A confidence score tells you how well the algorithm extracted features from a specific image. It does not tell you how confident you should be in the match. Those are different questions — and conflating them is where investigations, and boardroom risk decisions, go wrong.

NIST's own face verification testing data shows that recognition accuracy has improved dramatically since 2013 — miss rates averaging 0.1% on high-performing algorithms in controlled conditions, with software from 2018 performing at least 20 times better than 2014 equivalents. But those numbers describe algorithm performance at its ceiling, on structured, high-quality datasets. Every piece of operational footage, every angled surveillance image, every low-light capture sits somewhere below that ceiling. The question is never "is this algorithm good?" The question is always "what's the quality of this specific input, and how far below peak performance is this specific comparison running?"

That question cannot be answered by looking at the score alone. Which is why the next time you see a facial match result — from any system, for any purpose — the first manual check isn't "is the score high enough?" It's "what was the quality of the image that produced this score?" One of those questions has an answer baked into the output. The other one requires you to go looking. That's exactly where rigorous practice lives.

When you get a "match" result in your work, what's the first manual check you run before you're willing to put your name on it? We'd genuinely like to know — the answer varies more across disciplines than most people expect.

Ready for forensic-grade facial comparison?

2 free comparisons with full forensic reports. Results in seconds.

Run My First Search