The $25M Deepfake Used Three AI Layers at Once — How Each One Fooled a Human
The $25M Deepfake Used Three AI Layers at Once — How Each One Fooled a Human
This episode is based on our article:
Read the full article →The $25M Deepfake Used Three AI Layers at Once — How Each One Fooled a Human
Full Episode Transcript
Every person on that video call — the C.F.O., the senior executives, the colleagues nodding along — was fake. Every single one was an A.I. deepfake. And the one real human on the line transferred twenty-five million dollars.
That happened to a finance worker at Arup, the
That happened to a finance worker at Arup, the global engineering firm. And the attack probably cost less than ten thousand dollars to pull off. That's a return of roughly twenty-five hundred to one. Even if only one in a hundred attempts like this succeeds, the math still overwhelmingly favors the attacker. Which means deepfake fraud isn't some rare, exotic threat. It's economically inevitable. So how did a single video call fool a trained professional — and what actually happened under the hood?
The deepfake on that call wasn't one trick. It was three separate A.I. layers running at the same time. Layer one is facial mapping and reconstruction. Layer two is voice cloning with real-time audio synthesis. And layer three is behavioral mimicry — lip-sync, expression matching, the subtle stuff that makes a face feel alive. No single detection method catches all three at once. That's why it works.
Start with the face. Deepfake generation begins by mapping sixty-eight anatomical points across a target's face. Eyes, nose, mouth corners, jawline, hairline — each one becomes a coordinate. Those sixty-eight landmarks form a skeleton, and a neural network uses that skeleton to build a three-D reconstruction. It computes the pose of the face — both viewpoint and expression — then warps the source face onto the target. A separate network segments the face from the background, removing things like occlusions. Glasses, a hand resting on a chin, a head tilted at a steep angle — those still trip up the warping algorithm. Forensic examiners can sometimes catch inconsistencies at extreme angles because the math behind that warping doesn't handle them cleanly.
The voice. How much audio do you think an attacker
Now the voice. How much audio do you think an attacker needs to clone someone's voice? According to McAfee's research, just three seconds of audio produces an eighty-five percent voice match. Three seconds. M.I.T. researchers demonstrated high-quality speech generation from only fifteen seconds of training data. Previous systems needed tens of hours. The Arup attackers didn't need to hack anything. They scraped LinkedIn videos, press conferences, and YouTube clips — all publicly available.
So what about the live interaction? Wouldn't a deepfake stumble if the victim asked an unexpected question? This is where a common assumption falls apart. Current tools typically need about thirty minutes of processing to generate just a few sentences of video. That's nowhere near fast enough for a real-time back-and-forth conversation. The attackers almost certainly pre-generated a set of video clips — face, expressions, lip movements all matched to cloned audio — and then played them back in sequence during the call. They controlled the timing. They controlled the script. It was staging, not some magical real-time A.I. conversation.
And what about biological tells — the little signals a real face produces that a fake one doesn't? Research has shown that real video contains periodic eye-blink patterns, while many deepfakes don't reproduce them accurately. Heart-rate-linked skin color changes, micro-expressions, gaze direction — all of these can expose a fake under frame-by-frame analysis. But on a live call, in real time, with a stressed employee facing a high-pressure financial decision? Those signals vanish into the noise.
The Bottom Line
And that brings us to what actually happened. The finance worker noticed something was off. The C.F.O.'s face looked a little wrong. The employee had doubts. But then the other participants on the call — all of them deepfakes — confirmed the transaction. Authority, urgency, and social proof overrode what the employee's own eyes were telling them. The deepfake didn't win because it was perfect. It won because it was just convincing enough to delay skepticism for about ten minutes.
That's a much lower bar than most people imagine. People believe deepfakes must be nearly flawless to succeed because every headline calls them "indistinguishable" and "hyper-realistic." But the real threshold isn't perfection. It's plausibility — just enough to keep someone from hanging up and calling back on a known number.
So the takeaway is this. A deepfake video call layers fake faces, cloned voices, and scripted behavior all at once. It doesn't need to be perfect — it just needs to be good enough to stop you from double-checking through a separate channel. And the fix isn't training your eyes to spot fakes. It's adding friction — a callback to a known number, a confirmation over email, a second verification step that a pre-rendered video clip can't answer. Sometimes slowing down is the strongest security measure you have. The full story's in the description if you want the deep dive.
Ready for forensic-grade facial comparison?
2 free comparisons with full forensic reports. Results in seconds.
Run My First SearchMore Episodes
Viral Deepfake Demo Forces ByteDance to Limit AI Video Tool — Courts Feel the Fallout
A content creator uploaded a single photo of himself to ByteDance's new video tool, called Seedance. Minutes later, the model generated a clip of him moving and speaking — in his own voice — that he hadn't <phoneme alphab
PodcastFacial Recognition's Real Reckoning: Courts Want a Paper Trail
At least twelve people in the U.S. have now been wrongfully arrested because of facial recognition. A twenty-twenty-five Washington Post investigation found that in six of those eight documented cases at the time, police
PodcastAge Checks Now Read Your Face — But That Still Doesn't Prove Who You Are
A system can look at your face and guess your age within about one year — in under a single second. And it does this without ever learning your name, checking a database, or storing your photo. Milli
