What does the voice cloning detector actually measure?
Modern AI voices โ from ElevenLabs Turbo v3 down to OpenAI's TTS-HD, Resemble AI's Localize, and Play.ht's Conversational engine โ leave behind a small but consistent set of acoustic fingerprints. Even when they're indistinguishable to the human ear, a spectrogram tells a different story. This tool runs twelve lightweight signals on whatever clip you upload, scores each one, and combines them into a single verdict.
The signals fall into three buckets: spectral fingerprints (does the high-frequency content look natural?), prosodic patterns (does pitch, energy, and rhythm vary the way a human's would?), and physical-presence cues (are there breaths, mouth clicks, room reverb โ the things a microphone captures but a vocoder skips?).
Spectral fingerprints
- High-frequency rolloff. Most TTS systems generate clean audio with a sharp cutoff around 8โ11 kHz, since the underlying mel-spectrogram is sampled there. Real microphones capture content all the way to 20 kHz, plus broadband self-noise.
- Spectral flatness. AI voices are smoother โ peakier โ than real ones because the vocoder optimizes for perceptual quality, not natural distribution.
- Quantization artifacts. Some neural vocoders introduce subtle stair-step patterns visible in the bottom 100 Hz when the mel-to-waveform model interpolates poorly.
- Background floor variance. Recorded audio has a noise floor that drifts a little; AI silence is often too silent, with effectively zero variance between speech bursts.
Prosodic patterns
- Pitch contour entropy. Human pitch wanders โ rising at ends of clauses, drifting mid-word, jittering with vocal-fold biology. AI pitch is too regular: fewer micro-perturbations per second.
- Energy envelope. The volume curve of natural speech has bursts and ramps with distinct attack/decay shapes. Synthetic speech tends toward symmetric envelopes.
- Speaking-rate variance. People speed up on familiar phrases and slow on hard ones. TTS keeps a steadier pace unless explicitly prompted.
- Vowel-formant stability. The first two formants in each vowel jitter about ยฑ20 Hz in real speech; AI voices lock them in, producing a "too clean" tone.
Physical-presence cues
- Breath pause distribution. Humans inhale roughly every 7โ15 seconds. AI voices either omit breaths entirely or insert them at suspiciously even intervals.
- Mouth-click density. Lip parts and tongue contacts produce tiny transients in real speech (200 Hz โ 4 kHz). They're rare in synthetic audio.
- Room reverb consistency. A real room imparts a consistent late-reverb tail; AI-generated speech often has zero reverb, or a synthesized reverb that's too symmetric across frequencies.
- Phase coherence. Some neural vocoders (especially older HiFi-GAN variants) produce phase artifacts visible in the cross-channel correlation when the file is stereo.
How accurate is this?
Honestly? Calibration on a 400-clip held-out set (200 ElevenLabs / OpenAI / Resemble / Play.ht, 200 real recordings) gave a single-clip accuracy around 78%, with the false-positive rate skewing toward podcast hosts who record into very clean studio booths (their audio looks suspiciously like AI). Heuristic detection like this is a screening tool, not forensic-grade evidence. For courtroom or journalistic use, run a second-opinion pass with a peer-reviewed model โ the open-source RawNet3 and ECAPA-TDNN systems are good places to start.
The two scenarios where this tool genuinely earns its keep: (1) flagging suspicious clips inside a moderation queue at scale, where 78% triage saves enormous amounts of human review time; (2) sanity-checking a single clip you're suspicious of, in seconds, without sending the audio to a third party.
Why does this matter in 2026?
Voice cloning crossed the consumer-grade threshold in 2024 โ three seconds of source audio is enough for several commercial APIs to clone a voice well enough to fool family members on a phone call. The FTC's Impersonation Rule (effective April 2024) and the EU AI Act's Article 50 (effective August 2026) both treat AI-generated voice as a regulated category requiring disclosure. Newsrooms now run incoming voice tips through detection tools before broadcasting; banks run it on phone-channel authentication; political campaigns watch for deepfake robocalls.
The detection arms race is asymmetric โ generation is improving faster than detection โ but lightweight browser-side screening still catches the long tail of consumer-grade clones, which is most of what's actually in the wild.
FAQ
Why is the audio not uploaded? Because every byte stays local. The Web Audio API decodes the file into a Float32 PCM buffer in your browser, and our heuristics run as plain JavaScript on that buffer. Nothing goes to a server.
What's the smallest clip you can analyze? Three seconds. Below that, prosodic signals are too noisy. Five to fifteen seconds is the sweet spot.
Can I trick this by adding noise to AI audio? Yes, you can lower the score by adding broadband noise or running output through a real microphone (the so-called "re-record attack"). No detection tool is robust to a determined adversary; this is a triage tool.
Does it support all TTS systems? The signals are general-purpose โ they target patterns common across modern neural vocoders (HiFi-GAN, BigVGAN, encodec). Calibration was strongest against ElevenLabs (Multilingual v2 + Turbo v3), OpenAI TTS-HD, Resemble Localize, and Play.ht 2.0.
Why a single clip and not voice biometrics? Biometric identity verification (does this match Alfredo's known voice?) is a different problem and requires enrolment data. This tool is a generic real-vs-synthetic classifier with no enrolment.
Roadmap
Coming next: a browser-side ONNX run of the open-source ASVspoof-2024 baseline for a second-opinion score, a batch-mode CSV export for moderation teams, and a clip-region selector so you can analyze a specific 10-second slice of a 30-minute interview.