CaraComp
Log inTry Free
CaraComp
Forensic-Grade AI Face Recognition for:
Start Free Trial
digital-forensics

The $25M Deepfake Used Three AI Layers at Once — How Each One Fooled a Human

The $25M Deepfake Used Three AI Layers at Once — How Each One Fooled a Human

The employee noticed something was off. The CFO looked slightly strange on the video call. The expressions felt slightly delayed, the movements slightly mechanical. He noticed — and then he transferred $25 million anyway.

That detail is the whole story. Not the technology. Not the AI. The fact that a trained professional saw the artifacts, registered the wrongness, and proceeded regardless — because five other senior executives on the same call all agreed the transfer should happen. All of them were fake. Every single participant on that call, other than the victim, was an AI-generated impostor.

TL;DR

The Arup $25M deepfake attack succeeded not because the technology was undetectable, but because three simultaneous AI systems — facial mapping, voice cloning, and behavior synthesis — created just enough convincing pressure to override a human's visual suspicion.

This is what investigators need to understand: modern deepfake fraud doesn't need to be perfect. It needs to be just convincing enough to tip the balance while social pressure does the rest. The technical pipeline behind that call is more specific — and more teachable — than most coverage suggests.


Layer One: The Face Is a Skeleton of 68 Points

Before a single frame of deepfake video gets rendered, the attack begins with source material. In the Arup case, the World Economic Forum reported that attackers scraped LinkedIn profiles, press conference recordings, and YouTube appearances — all publicly available, no hacking required. The raw footage becomes training data.

What the algorithm actually does with that footage is more precise than "learns what the person looks like." It builds a geometric skeleton. Facial landmark detection maps 68 anatomical anchor points across the target face: the outer corners of each eye, the peaks of the cupid's bow, the attachment points of the earlobes, the widest points of the nostrils, the edges of the jawline. These 68 coordinates become a mathematical description of that face's geometry — the proportional distances, the angles, the depths.

That skeleton then drives 3D face reconstruction. According to a comprehensive technical review of face deepfakes published on arXiv, the process involves detecting 2D landmarks in both the source and target faces, computing the 3D pose that accounts for viewpoint and expression, segmenting the face from its background using a pre-trained neural network, and then warping the source face onto the target using alignment calculated from those 3D poses. The result: the attacker's head movements drive the CFO's face. This article is part of a series — start with Deepfake Calls Surge As Governments Bet On Biometric Verific.

Here's the forensic implication investigators rarely hear: extreme angles break the warping algorithm. When a face tilts beyond roughly 40 degrees, the geometric alignment between the real skull and the reconstructed surface starts to fail. Jawlines drift. Ear geometry becomes inconsistent. The hairline flickers. These are not subtle artifacts — they're structural failures that frame-by-frame analysis can surface. The victim's "something looks off" feeling was almost certainly responding to exactly this.


Layer Two: Three Seconds of Audio Is Enough

The voice track runs on a completely separate system — and the training data requirements are shockingly low.

3 sec
of audio required to produce an 85% voice match, according to McAfee research
Source: McAfee AI Research

McAfee's research found that just 3 seconds of reference audio produces an 85% voice match. High-fidelity cloning, according to ThreatLocker, requires roughly 30 seconds of clean audio. MIT researchers demonstrated high-quality speech generation from only 15 seconds of training material. The CFO of a major international engineering firm had given dozens of recorded presentations. The attackers had more than enough.

Voice cloning works by extracting a speaker's acoustic fingerprint — the specific harmonic patterns, resonance characteristics, and prosodic rhythms that make a voice recognizable — and encoding it into a neural model. New text is then synthesized in that voice in real time. The output isn't a recording of the real person; it's a mathematical prediction of what that person would sound like saying words they never said.

The result, layered over the facial deepfake, creates a perceptual double-bind. The viewer's brain is simultaneously processing visual and auditory signals that both seem to match a known person. Even when one channel feels slightly wrong, the other channel pushes back toward trust. That's not a technical coincidence — it's the attack strategy.


Trusted by Investigators Worldwide
Run Forensic-Grade Comparisons in Seconds
2 free forensic comparisons with full reports. Results in seconds.
Run My First Search →

Layer Three: The Behavior Problem (and Why Pre-Rendering Matters)

Here's a detail that doesn't get nearly enough attention: the deepfake in the Arup case almost certainly wasn't being generated dynamically in real time. SoftwareSeni's technical breakdown of the deepfake pipeline notes that typical tools require approximately 30 minutes of processing to generate a few sentences of convincing video. Real-time synthesis at that quality level, in 2024, was not commercially available at consumer price points.

So what the attackers actually did was closer to staging a play than running an AI model. They wrote a script. They pre-rendered a library of video clips — the CFO explaining the transaction, responding to expected questions, showing agreement — and then played those clips back in sequence during the call while a human operator controlled the timing. The "live" video call was a controlled playback, not a live generation. Previously in this series: 64 Deepfake Laws Passed And Investigators Still Cant Prove W.

Think of it like an elaborate puppetry show where the puppeteer has studied their target so thoroughly — voice recordings, video appearances, facial mannerisms — that they can perform a convincing one-person play. The audience isn't looking for puppet strings. They're listening for tone, watching for familiar expressions, and trusting the context. A good puppeteer with rehearsed lines and pre-recorded audio only needs to fool the audience for ten to fifteen minutes. That's exactly the window deepfake attackers require.

"The realistic visuals and audio, combined with the presence of multiple seemingly familiar senior figures discussing the transaction, ultimately convinced the employee of the request's legitimacy." Security Boulevard, analysis of the Arup deepfake attack

Behavioral artifacts are where trained examiners still have purchase. Eyeblink research has shown that real video contains periodic, biologically consistent blinking patterns — deepfakes frequently miss the timing, producing either unnaturally regular blinks or long stretches without any. Micro-expressions, gaze tracking, and the subtle asymmetry of genuine emotional responses all carry signals that pre-rendered deepfakes struggle to replicate consistently across an extended conversation. At CaraComp, frame-level facial comparison against baseline reference footage — not a confidence score, but a methodical landmark-by-landmark forensic comparison — remains one of the few approaches that surfaces these inconsistencies reliably.


The Misconception That Makes This Dangerous

Most people who hear about the Arup case walk away with a specific wrong conclusion: that the deepfake was so perfect, it was indistinguishable. This belief is both understandable and genuinely dangerous.

It's understandable because every news headline emphasizes how realistic modern deepfakes are becoming. "Indistinguishable." "Undetectable." If the story is that a trained professional at a major firm got fooled, surely the technology must be extraordinary.

But the victim explicitly said the CFO looked "a little off." He saw the artifacts. The deepfake was detectable — and it won anyway. That's a completely different threat model, and it demands a completely different response.

The attack succeeded through social engineering, not technical perfection. Five "executives" on the same call all confirming the same high-pressure transaction is not normal behavior — it's a manufactured consensus designed to override individual doubt. The deepfake only needed to be convincing enough to prevent the victim from stopping the call and making a phone call to a known number. That's a much lower bar than "undetectable." Up next: The 25m Deepfake Used Three Ai Layers At Once How Each One F.

What You Just Learned

  • 🧠 The three-layer architecture — Facial mapping (68 landmarks), voice cloning (as little as 3 seconds of audio), and behavioral pre-rendering work simultaneously, not sequentially
  • 🔬 Extreme angles break the geometry — The facial warping algorithm fails at sharp head angles, producing jaw, ear, and hairline artifacts that frame-level analysis can detect
  • 🎭 It was staged, not live — Pre-rendered clip libraries played back in sequence, not real-time AI generation — which means the "conversation" was scripted and the response range was limited
  • 💡 The deepfake didn't need to be perfect — Social pressure from multiple fake "executives" overrode the victim's visual suspicion; the technology only needed to delay skepticism for 10 minutes

The One Verification Step That Changes Everything

The economics here are worth sitting with for a moment. Deepak Gupta's detailed analysis of the Arup case estimates the attack cost less than $10,000 to execute against a $25 million target — a roughly 2,500-to-1 return ratio. Even if only one in a hundred attempts succeeds, the math works overwhelmingly in the attacker's favor. This isn't a complex nation-state capability anymore. It's commercially available technology with extraordinary ROI.

Which means the response can't be "train people to spot deepfakes." That's a losing arms race against a system specifically designed to defeat human visual judgment. The response has to be procedural friction — and specifically, friction that exploits the one thing deepfake calls cannot easily defeat: multiple independent communication channels.

A video call request for an urgent financial transfer should trigger a callback to a number stored in your contact system before the call happened — not a number provided during the call. An email confirmation to an address in your existing directory. A second approver reached through a separate channel. These steps feel inefficient. That's the point. The Arup attack worked because efficiency was prioritized over verification. Sometimes friction is security.

Key Takeaway

Modern deepfakes don't need to fool expert forensic analysis — they only need to hold up for ten minutes under social pressure. The defense isn't better visual detection. It's verifying high-stakes requests through a second channel that was established before the suspicious call ever started.

If a key witness or claimant only ever appears to you on video calls, there's one question worth asking yourself right now: do you have a pre-established, out-of-band way to confirm their identity that doesn't run through the same session you're already in? If the answer is no — and for most investigators, it currently is — you're relying entirely on a signal that a $10,000 AI system was specifically built to spoof.

The victim in Hong Kong noticed something was wrong. He just had no protocol for what to do when the CFO looked slightly off but five other executives all said everything was fine. That's the gap. Not the technology. The gap is the absence of a procedure that treats video-only verification as insufficient for high-stakes decisions — because, frame by frame and landmark by landmark, it increasingly is.

Ready for forensic-grade facial comparison?

2 free comparisons with full forensic reports. Results in seconds.

Run My First Search