๐ŸŽ™๏ธ Browser-only ยท 12 spectral & prosodic signals ยท Audio never uploaded

Voice Cloning Detector

Upload any voice clip โ€” we run twelve spectral, prosodic, and breath-pattern heuristics calibrated against ElevenLabs, Resemble AI, OpenAI Voice, and Play.ht output patterns. The audio never leaves your browser.

Audio input

๐Ÿ”’ 100% in-browser. Decoded by Web Audio API. No network request is ever made with your audio.

Verdict & signals

Awaiting audio
โ€”
Upload a clip and press Analyze.

Pair this with our other AI-content tools

If you're verifying user-generated content for trust & safety, build a stack: detect cloned voice, label AI images with C2PA Content Credentials, and disclose AI use per the EU AI Act.

C2PA Manifest Generator โ†’ AI Disclosure Generator โ†’ AI Text Detector โ†’

What does the voice cloning detector actually measure?

Modern AI voices โ€” from ElevenLabs Turbo v3 down to OpenAI's TTS-HD, Resemble AI's Localize, and Play.ht's Conversational engine โ€” leave behind a small but consistent set of acoustic fingerprints. Even when they're indistinguishable to the human ear, a spectrogram tells a different story. This tool runs twelve lightweight signals on whatever clip you upload, scores each one, and combines them into a single verdict.

The signals fall into three buckets: spectral fingerprints (does the high-frequency content look natural?), prosodic patterns (does pitch, energy, and rhythm vary the way a human's would?), and physical-presence cues (are there breaths, mouth clicks, room reverb โ€” the things a microphone captures but a vocoder skips?).

Spectral fingerprints

Prosodic patterns

Physical-presence cues

How accurate is this?

Honestly? Calibration on a 400-clip held-out set (200 ElevenLabs / OpenAI / Resemble / Play.ht, 200 real recordings) gave a single-clip accuracy around 78%, with the false-positive rate skewing toward podcast hosts who record into very clean studio booths (their audio looks suspiciously like AI). Heuristic detection like this is a screening tool, not forensic-grade evidence. For courtroom or journalistic use, run a second-opinion pass with a peer-reviewed model โ€” the open-source RawNet3 and ECAPA-TDNN systems are good places to start.

The two scenarios where this tool genuinely earns its keep: (1) flagging suspicious clips inside a moderation queue at scale, where 78% triage saves enormous amounts of human review time; (2) sanity-checking a single clip you're suspicious of, in seconds, without sending the audio to a third party.

Why does this matter in 2026?

Voice cloning crossed the consumer-grade threshold in 2024 โ€” three seconds of source audio is enough for several commercial APIs to clone a voice well enough to fool family members on a phone call. The FTC's Impersonation Rule (effective April 2024) and the EU AI Act's Article 50 (effective August 2026) both treat AI-generated voice as a regulated category requiring disclosure. Newsrooms now run incoming voice tips through detection tools before broadcasting; banks run it on phone-channel authentication; political campaigns watch for deepfake robocalls.

The detection arms race is asymmetric โ€” generation is improving faster than detection โ€” but lightweight browser-side screening still catches the long tail of consumer-grade clones, which is most of what's actually in the wild.

FAQ

Why is the audio not uploaded? Because every byte stays local. The Web Audio API decodes the file into a Float32 PCM buffer in your browser, and our heuristics run as plain JavaScript on that buffer. Nothing goes to a server.

What's the smallest clip you can analyze? Three seconds. Below that, prosodic signals are too noisy. Five to fifteen seconds is the sweet spot.

Can I trick this by adding noise to AI audio? Yes, you can lower the score by adding broadband noise or running output through a real microphone (the so-called "re-record attack"). No detection tool is robust to a determined adversary; this is a triage tool.

Does it support all TTS systems? The signals are general-purpose โ€” they target patterns common across modern neural vocoders (HiFi-GAN, BigVGAN, encodec). Calibration was strongest against ElevenLabs (Multilingual v2 + Turbo v3), OpenAI TTS-HD, Resemble Localize, and Play.ht 2.0.

Why a single clip and not voice biometrics? Biometric identity verification (does this match Alfredo's known voice?) is a different problem and requires enrolment data. This tool is a generic real-vs-synthetic classifier with no enrolment.

Roadmap

Coming next: a browser-side ONNX run of the open-source ASVspoof-2024 baseline for a second-opinion score, a batch-mode CSV export for moderation teams, and a clip-region selector so you can analyze a specific 10-second slice of a 30-minute interview.