"94% Accurate" Means Nothing — And Europe Just Made It Illegal to Pretend Otherwise
For years, the big question about any AI system was simple: does it work? Give it a test run, check the accuracy score, and if the numbers look good, ship it. That was the whole game. And honestly, it made a certain kind of sense — until you realize that "works great in testing" and "works fairly for everyone, in every condition, without quietly discriminating against certain people" are two completely different things.
A new European law is forcing AI companies to stop just proving their systems are accurate and start proving they were tested responsibly — and if that sounds boring, wait until you see the fines.
Here's the thing that should stop you cold: an AI system can score a 95% accuracy rate and still be systematically wrong about certain groups of people. It can feel confident — even be confident — and still be broken in ways that only show up when something goes badly wrong for a real person. And until very recently, nobody was legally required to check for that. Nobody had to keep records proving they looked. The algorithm decided, the company shrugged, and if you were the person on the wrong end of that decision? Good luck.
That's the thing that's quietly changing right now. Not with fanfare, not with a flashy product launch — with a law. A boring, detailed, enormously consequential European law called the EU AI Act.
Why "It Works" Is No Longer a Good Enough Answer
The EU AI Act doesn't just ask whether an AI system performs well. It asks something harder: how do you know? And more specifically — can you prove how you know, with documentation, test logs, and a clear record of who was responsible at every step?
According to Applause, a software quality and testing firm that works with organizations preparing for compliance, the Act forces a fundamental shift in how AI tools are evaluated — away from "does this produce good results?" and toward "can we show this was built and tested without hidden flaws?" That's not a small ask. It's a total rethink of what "good enough" even means.
The law targets what it calls "high-risk AI" — systems that make or influence decisions about your job, your ability to travel, your access to loans or benefits, or your identity. If an AI touches any of that, it now has to meet three concrete requirements before it goes anywhere near a real person. This article is part of a series — start with Blocked By A Bot Europe Just Gave You The Right To Demand An.
The Three Things High-Risk AI Must Now Prove
- 🧪 Testing records — documented proof of how the system was tested, including for bias and failure conditions
- ⚠️ Risk controls — evidence that someone actively looked for where it could go wrong before deployment
- 👁️ Human oversight — proof that a real person can understand, question, and override the AI's decisions
None of those things sound glamorous. That's exactly the point. The most important safety systems in your life — the brakes in your car, the circuit breakers in your walls — are also the least exciting. They exist to handle the moments when everything goes sideways.
Breaking Things on Purpose (Before They Break in Real Life)
One of the most interesting requirements under this framework is something called adversarial testing, or "red teaming." The idea is exactly what it sounds like: you hire people — or design automated processes — to try to break your AI system before it ever meets the public.
Not light prodding. Deliberate, creative attacks on the system's weaknesses. Feed it edge-case data. Show it inputs from unusual demographics. Throw unusual conditions at it — bad lighting, poor audio quality, ambiguous situations — and watch where it stumbles. For a facial recognition tool, that means testing at extreme angles, with glasses or masks, across a full range of skin tones and ages, in the kind of chaotic real-world conditions the clean training data never captured.
Here's why that matters for you specifically: a system that was only ever tested on the easy cases will fail on the hard ones. And the hard cases — the ones where lighting is bad, or where someone looks different from the demographic the training data skewed toward — are often the cases involving real people with the most at stake. Adversarial testing is how you find those gaps before a person pays the price for them.
Think of it like the difference between a car that passed one final road test and a car that went through months of crash simulations, extreme weather tests, and brake failure scenarios. Both cars might feel fine on a sunny Tuesday. Only one of them was designed to survive the unexpected.
That number — €35 million, or 7% of a company's total global revenue, whichever is larger — is what turns "responsible testing" from a nice idea into a business survival question. According to the European Commission's official AI regulatory framework, full enforcement for high-risk systems rolls in from late 2026 through mid-2027. That window is closing. Right now, every company building AI tools that affect people's lives is making a choice: document everything and comply, or exit European markets entirely.
The Confidence Score Trap
Here's the misconception that catches almost everyone — including plenty of people who work in tech. Previously in this series: Roblox Just Lost 6 7b Asking Kids One Question Yours Is Next.
When an AI system tells you it's "94% confident" in a match or a decision, that sounds like good news. Ninety-four percent! That's an A. We're trained since school to trust high scores. So it's completely understandable that people treat a high confidence number as proof that the system is reliable.
But confidence is what the model thinks about itself. It is not a report card on how the system was built.
A facial comparison algorithm could return 94% confidence on every single match it makes — and still be systematically biased against people with darker skin tones, or people over 60, or anyone photographed in non-standard lighting. The confidence score has no idea. It can't see its own blind spots. It was trained on data that may have skewed toward certain demographics, and it learned to be very sure of itself without ever learning to be fair.
"Testing for basic technical functionality is no longer enough — the Act explicitly targets algorithmic bias and safety flaws." — Applause, EU AI Act: A Practical Guide for QA Leaders
That's the shift the law is forcing. A confidence score is a number the AI generated about itself. A testing record is evidence that actual humans went looking for the ways it fails — and documented what they found. Those are not the same thing, and they never were.
At CaraComp, this distinction sits at the center of how we think about facial recognition as a tool. A match score tells you what the algorithm decided. A documented testing history tells you whether that algorithm earned the right to decide. Those are two very different conversations.
The Human Override Problem Nobody Talks About
The third requirement — human oversight — sounds obvious. Of course a human should be able to override an AI decision. But the law goes further than just saying "put a human in the loop." It requires that the human actually be able to understand what the AI did and why. Up next: Liveness Detection Selfie Id Verification Explained.
That's a design requirement. It means AI tools can't just spit out a result and expect humans to rubber-stamp it. The system has to communicate its reasoning in a way a real person can evaluate. Not "95% match" and a black box. Something more like: here are the features that drove this result, here's where uncertainty exists, and here's what you should look at before you decide.
Why does this matter? Because "human in the loop" only works if the human isn't just nodding along at a number they can't interrogate. The oversight has to be real — which means the AI has to be built to support it. That's a much harder engineering problem than building something that just gives confident-sounding answers.
The safest AI systems are not the ones with the highest accuracy scores. They're the ones that can show their work — with test logs, bias checks, and human-readable explanations — when those scores affect a real person's job, identity, or freedom.
The risk classification framework built into the EU AI Act scales all of this based on stakes. A recommendation algorithm that suggests movies? Minimal risk, minimal requirements. An AI that influences whether you get a job, cross a border, or receive a benefit? Maximum scrutiny, full documentation, mandatory human oversight design. The higher the stakes for the person on the receiving end, the more the company behind the tool has to prove it did the work.
And that's the real shift — the one that quietly changes everything. For years, companies could say "the algorithm decided" as if that transferred the responsibility to the machine. The EU AI Act says: you chose this algorithm. You trained it on this data. You tested it — or you didn't. You're accountable. The machine doesn't get to take the blame anymore.
- The EU AI Act shifts the focus from "does it work?" to "can you prove how it was tested, including for bias and failure modes?"
- Adversarial testing and red teaming intentionally push AI systems into edge cases so weaknesses show up before they harm real people.
- High confidence scores can hide systematic bias — only documented, diverse testing can reveal who an AI system actually works for.
- Human oversight under the Act means humans must understand and question AI decisions, not just approve whatever the model outputs.
- For high-risk AI, fines up to €35 million or 7% of global revenue make thorough testing records and audit trails a business necessity.
So next time you hear that an AI system is "highly accurate," ask the follow-up question that actually matters: accurate for whom, under what conditions, and where's the paperwork? Because impressive results and proven testing are not the same credential — and from late 2026 onward, at least in Europe, one of them is required by law.
Ready for forensic-grade facial comparison?
Full forensic reports with detailed similarity scoring. Results in seconds.
Run My First SearchMore Education
That "99% Face Match" Unlocking Your Bank? Fraudsters Just Found the Skip Button.
That 98% facial confidence score feels like proof — but it isn't. Here's the three-part question that actually verifies identity in a deepfake world, and why even a "perfect" match can be completely wrong.
ai-regulationThat "Made by AI" Label? It's Hiding Something You Can't See
Most people think AI watermarks are stamps you can see. They're not — and the EU's new law is about to force a complete rethink of how we verify what's real online. Here's what's actually changing.
biometricsNervous on a Bank Call? An AI Just Judged You — And It's Probably Wrong
Modern identity systems can now read emotional cues during a verification check — but emotion is a clue, not proof. Learn why that distinction could protect you.
