CaraComp
Log inStart Free Trial
CaraComp
Forensic-Grade AI Face Recognition for:
Start Free Trial
digital-forensics

"I Saw It on Video" Is Now the Most Dangerous Phrase in Any Investigation

"I Saw It on Video" Is Now the Most Dangerous Phrase in Any Investigation

In February 2024, a finance employee at Arup — one of the world's most respected engineering firms — sat in on a video conference call with the company's CFO and several senior colleagues. The conversation felt normal. The faces were familiar. The voices were right. At the end of the call, he authorized a wire transfer of $25 million.

None of those executives were real. Every face on that call was a deepfake — a real-time synthetic composite built from publicly available footage. The entire "meeting" was a coordinated fraud. And the thing that made it work wasn't sophisticated hacking or elaborate social engineering. It was the simple, ancient human instinct to trust what we see.

TL;DR

Modern deepfakes are good enough to fool trained humans in real time — which means video and audio are now evidence of appearance, not identity, and every investigator needs structured facial comparison to tell the difference.

The Myth That Used to Be a Fact

Here's why this misconception is so forgivable: it was correct for most of human history. For decades, video was expensive to produce, nearly impossible to fake convincingly, and treated by courts as strong corroborating evidence. Investigators built entire methodologies around visual observation. If you saw someone's face on a recording — especially a live feed — that was about as close to certainty as the job allowed.

That era ended quietly, and most people missed it.

The tools required to clone a convincing likeness are now accessible, fast, and cheap. Security researchers at McAfee confirmed that as little as three seconds of audio is enough to produce a voice clone that most listeners cannot distinguish from the real person. Three seconds. That's shorter than a sneeze. And voice is actually the harder problem — faces are easier, because the training datasets are larger and the visual outputs are easier to evaluate during generation.

1,740%
increase in deepfake incidents reported across North America

That number isn't a rounding error. A 1,740% increase means deepfake fraud has gone from a theoretical threat to a primary attack vector inside the span of a few years. And according to the Deloitte Center for Financial Services, AI-generated fraud in the United States alone is projected to reach $40 billion by 2027. This isn't a niche problem for cryptocurrency exchanges or offshore transactions. It's showing up in boardrooms, legal proceedings, and insurance claims. This article is part of a series — start with Age Assurance Becomes The New Kyc And Your Next Case Probabl.


Why Your Brain Can't Catch This — Even When It's Trying

The uncomfortable truth is that the human visual system was not built for this problem. We evolved to recognize faces in real-world conditions: consistent lighting, three-dimensional depth, micro-expressions tied to genuine emotion and muscle movement. A deepfake exploits the fact that most of those signals, when rendered convincingly in 2D video, become indistinguishable from real footage to an untrained eye.

And "untrained" doesn't mean inexperienced. It means human.

Even investigators who've spent careers reading body language and spotting inconsistencies can't reliably identify high-quality deepfakes in real time. The brain isn't running a pixel-level forensic analysis during a video call — it's pattern-matching against memory, filling in gaps with expectation, and generally trying to keep up with the conversation. Deepfake generators are specifically optimized to stay within the envelope of what looks "normal enough" to pass that fast, intuitive check.

"Employees are often untrained to identify deepfakes, particularly as they become increasingly sophisticated, and traditional verification methods, such as matching a name to a face, fail when the face and voice are synthetic." CompassMSP: AI-Generated Deepfakes Are Here to Stay

Here's an analogy that makes this click: think about how banks used to verify identity by asking customers to answer a security question — your mother's maiden name, your first pet. That worked fine when the threat was a stranger guessing. The moment criminals could look up those answers online, the system collapsed. The verification method hadn't changed. The threat had. We're in exactly that moment right now, with video as the "security question."


Trusted by Investigators Worldwide
Run Forensic-Grade Comparisons in Seconds
Full platform access for 7 days. Run real searches — no credit card, no commitment.
Run My First Search →

What "Detection" Actually Requires

So if human observation fails, can AI detection save us? Partially — but not in the way most people hope.

A Purdue University evaluation of 24 commercial, government, and academic deepfake detection systems found meaningful variation in accuracy across tools — and critically, no single-layer system performed reliably across all attack types. The best-performing systems didn't just analyze pixels. They operated across multiple detection layers simultaneously: behavioral signals (does the blink rate match what's expected?), integrity signals (are compression artifacts consistent with genuine capture?), and perceptual signals (do fine details like skin texture and hair hold up under frame-by-frame analysis?).

That level of checking matters. A deepfake that passes a pixel-level check can fail a behavioral check. One that survives behavioral analysis might collapse under compression artifact inspection. No single test is sufficient — and this is exactly why investigators who rely on a single "does it look real?" judgment are working with an incomplete toolkit.

The academic research reinforces this. A peer-reviewed forensic survey published in PubMed Central found that pixel-level analysis alone — the approach that most resembles "looking carefully at the video" — is among the least reliable methods for detecting modern synthetic media. The limitations compound when you factor in real-world variables: compressed video from a messaging app, inconsistent lighting, and low-resolution source material all degrade the signals that detection systems rely on.

What You Just Learned

  • 🧠 Three seconds of audio — that's all a voice clone requires, per McAfee security research. Less time than it takes to say your own name twice.
  • 🔬 Multi-layer detection is mandatory — behavioral, integrity, and perceptual checks must run simultaneously; any single-layer approach can be beaten.
  • 💡 Video is now a claim — not a confirmation. It proves someone appeared on screen, not that the person on screen was who they claimed to be.
  • 🎯 The $25M Arup attack — succeeded not because the victim was careless, but because the deepfakes were credible enough to pass live, real-time human scrutiny.

The New Standard: Treating Video as a Hypothesis

This is where the investigator's job fundamentally shifts — and where the real aha moment lives.

A forensic document examiner doesn't look at a signature and ask "does that look right?" They compare it against multiple exemplars of known provenance, check for pen pressure consistency, examine paper fiber compression, and verify ink composition. The question isn't "does it look like a signature?" — it's "can I establish, through independent evidence, that this specific person made this specific mark at this specific time?"

That's now the standard for video. A recording of someone's face is a starting hypothesis, not a closed finding. The question shifts from "does this look like them?" to "can I verify this is them through structured comparison against reference imagery of known provenance, corroborated by out-of-band evidence?"

Structured facial comparison — the kind that uses Euclidean distance analysis across verified facial landmarks rather than visual impression — becomes essential precisely because it's immune to the cognitive shortcuts that deepfakes exploit. As the team at CaraComp has documented, even a 99% accuracy claim means something very specific and contextual — the threshold settings, reference quality, and dataset composition all determine whether a comparison is actually meaningful. Facial comparison done right isn't eyeballing; it's measurement.

And measurement can be cross-checked. A video call cannot.

The multi-factor framework that emerges from this looks like: structured facial comparison against reference imagery with verified chain of custody + behavioral corroboration (does the communication pattern match known behavior?) + out-of-band confirmation through a separate, previously established channel. Three independent verification streams, none of which a deepfake can simultaneously fake across all three without leaving forensic traces in at least one.

Key Takeaway

Video and audio now prove that someone appeared to be present — not that they actually were. Every visual in an investigation is a claim that requires corroboration through structured facial comparison and independent verification, not just the judgment that it "looked real."

The skills that experienced investigators built over decades — reading behavior, spotting inconsistencies, building contextual understanding — aren't obsolete. They're just incomplete on their own now. The game didn't get easier. It added a new opponent that your eyes can't see.

So here's the question worth sitting with: When you get a key photo or video in a case, what's the first thing you do to convince yourself it's really the person you think it is? If the answer is "I look at it carefully" — that answer just became the most expensive three seconds in your investigation.

Ready to try AI-powered facial recognition?

Match faces in seconds with CaraComp. Free 7-day trial.

Start Free Trial