CaraComp
Log inTry Free
CaraComp
Forensic-Grade AI Face Recognition for:
Start Free Trial
Podcast

A Facial Recognition 'Match' Isn't Evidence Until It Survives These 4 Hidden Steps

A Facial Recognition 'Match' Isn't Evidence Until It Survives These 4 Hidden Steps

A Facial Recognition 'Match' Isn't Evidence Until It Survives These 4 Hidden Steps

0:00-0:00

This episode is based on our article:

Read the full article →

A Facial Recognition 'Match' Isn't Evidence Until It Survives These 4 Hidden Steps

Full Episode Transcript


A confidence score of ninety-five percent sounds rock solid. But according to research published by CaraComp, that same algorithm's accuracy can plummet by fifty percentage points — half its performance, gone — when the image it's working with drops below a specific resolution threshold. The algorithm doesn't warn you. The number on screen just quietly becomes unreliable.


That should matter to you whether you've ever run a

That should matter to you whether you've ever run a facial comparison search or you've simply unlocked your phone with your face this morning. Because facial recognition technology is making decisions that affect criminal investigations, airport security lines, and the photos you post online. And the uncomfortable truth is that most people — including many professionals who use these systems daily — misunderstand what the output actually means. If that feels unsettling, it should. But understanding how the process really works is exactly how you stop feeling powerless about it. There are four hidden steps between a facial recognition algorithm producing a score and that score meaning anything useful. So what are those steps, and why does almost no one talk about them?

The first gate is image quality, and it controls everything that happens next. Before any algorithm runs a comparison, the image itself has to be good enough for the math to work. According to a forensic facial comparison study in the International Journal of Legal Medicine, high image quality scores correlated with correct matches. Low quality scores correlated with incorrect ones. Specifically, low exposure was linked to false positives — the system saying two different people are the same person. High exposure was linked to false negatives — the system missing a real match entirely. For someone investigating a crime, that means the lighting in a gas station camera could be the difference between identifying the right person and accusing the wrong one. For the rest of us, it means a blurry tagged photo of you could behave very differently inside these systems than your driver's license picture would.

So what counts as "too low quality"? According to CaraComp's analysis of benchmark versus operational accuracy, when the distance between a person's eyes in the image falls below twenty-four pixels, accuracy drops by about fifty percentage points compared to high-resolution benchmarks. Twenty-four pixels is tiny. On most surveillance cameras, that threshold gets crossed more often than you'd expect. And once the light source moves past about thirty degrees off center, match confidence scores can fall by thirty to forty percent — even on top-ranked algorithms. The image doesn't have to look terrible to a human eye. It just has to be degraded enough that the algorithm's feature extraction starts falling apart.


Trusted by Investigators Worldwide
Run Forensic-Grade Comparisons in Seconds
2 free forensic comparisons with full reports. Results in seconds.
Run My First Search →

The article from Startups Magazine makes a point

Now, the article from Startups Magazine makes a point that reframes this entire conversation. The confidence score you see on screen — that number like zero-point-nine-five — isn't telling you how certain the system is about the match. It's telling you how well the algorithm extracted and compared features given that specific input image. Those are fundamentally different things. People assume the score works like a probability statement because it looks like one. A number between zero and one feels like a percentage of certainty. That's a completely reasonable assumption, and it's wrong. If the input image is degraded, the score degrades with it — silently. The system doesn't flag that its own reading became less trustworthy. It's like a thermometer that changes its scale based on the temperature of the room, without telling you it did so. You don't know if the temperature actually dropped or if the instrument just lost accuracy.

The third hidden step is the threshold — the cutoff line a system uses to decide what counts as a potential match and what gets discarded. And this is where the technology becomes a business decision wearing a lab coat. According to the Bipartisan Policy Center's review of N.I.S.T. testing data, at least six of the most accurate identification algorithms showed higher false positive rates for one demographic group at one threshold setting, but lower false positive rates for that same group at a different threshold. Swap the threshold, swap which group gets more errors. There's no neutral setting. Set the threshold too high, and the system might not return enough candidates to find the actual person you're looking for. Set it too low, and human reviewers get flooded with bad matches, which buries the real one in noise. Every threshold is a choice about which kind of mistake an organization is willing to tolerate. That's not a technical decision. That's a values decision. And for anyone who's ever worried about being misidentified by one of these systems, this is exactly where that risk lives — not in the algorithm itself, but in who chose the threshold and why.

The fourth step is human review, and research from forensic science literature makes clear that raw similarity scores need statistical calibration — specifically something called a score-based likelihood ratio — before they carry any evidentiary weight. That's a method for translating a raw number into a statement about how much that score should actually shift your belief. Without that calibration, an investigator is essentially using intuition to interpret a number the system never designed to be interpreted by intuition. According to N.I.S.T.'s own testimony, algorithm performance has improved dramatically since twenty-thirteen, with software in twenty-eighteen performing at least twenty times better than twenty-fourteen versions. But those improvements were measured on controlled datasets — structured mugshot photos with good lighting and resolution. The gap between that laboratory ceiling and what happens with real-world surveillance footage is enormous, and no benchmark number can tell you how far below that ceiling your specific image sits.


The Bottom Line

The real risk with facial recognition isn't that the technology fails. It's that the technology succeeds just enough to look trustworthy — while hiding every factor that determines whether you should actually trust it.

So remember three things. A confidence score reflects image quality as much as it reflects identity. Every threshold setting is a human choice about acceptable risk, not a scientific constant. And between the algorithm's output and a reliable result, there are quality checks, calibration steps, and human judgment calls that most people never see. Whether you're evaluating these systems professionally or you're just someone whose face exists in photos online, knowing those hidden steps exist is how you start asking the right questions instead of accepting the wrong answers. The written version goes deeper — link's below.

Ready for forensic-grade facial comparison?

2 free comparisons with full forensic reports. Results in seconds.

Run My First Search