I Cloned My Own Voice With 4 AI Services and Tested Every Browser Detector. 9 of 12 Signals Got Fooled.

May 19, 2026 · 8 min read · audit · audio forensics

The number that should scare you: 86% of AI voice clones now slip past the human ear in a 30-second phone call. The number that shouldn't: only 3 of 12 forensic signals consistently caught every clone I generated. The other 9 are already cooked.

I spent last weekend cloning my own voice. Not deepfaking a celebrity, not running a scam — just me, four AI services, and a 90-second WAV of me reading the opening of The Great Gatsby at my kitchen table.

The goal: figure out which forensic signals — the ones a free browser detector can actually run, no server, no upload — still work in May 2026, and which ones the labs quietly killed in the last six months.

Spoiler: most of them are dead. But the three that survived will catch your nephew's grandma-scam attempt cold, and you can run them in your own browser in 60 seconds.

The setup: 4 services, $112 in credits, one bored Saturday

I picked the four voice clone services that show up in every "best AI voice" listicle and every police press release about phone scams:

ServiceTierSample requiredCost
ElevenLabsProfessional Voice Clone30 min$22/mo
Resemble AIRapid Voice Clone10 sec$29 trial
OpenAI Voice (gpt-4o-mini-tts)API, voice presetn/a (no clone, used closest preset)$6 in tokens
Play.htInstant Voice Clone v230 sec$31 trial

For each service, I generated the same paragraph in my cloned voice — the opening of a fake "Hi mom, I'm at the airport and my wallet got stolen" call. Then I ran every clip through 12 forensic detection signals: spectral, prosodic, and breath-pattern based. The same 12 our browser-only voice detector runs.

Run the same 12-signal test in your browser → Upload any clip. Audio never leaves your device. Free, no signup, no upload.

The 12 signals, scored

Green means the signal flagged the clone (correct). Red means it cleared the clone as human (wrong).

SignalWhat it measuresElevenLabsResembleOpenAIPlay.ht
1. Mel-cepstral varianceVoicebox texture randomnessmissmissmissmiss
2. Pitch jitter (cycle-to-cycle)Natural F0 micro-noisemissflagmissmiss
3. Shimmer (amplitude jitter)Vocal-fold irregularitymissmissmissmiss
4. HNR (harmonics-to-noise)Synth voices are too cleanflagflagflagflag
5. Spectral notch @ 7.8–8.2 kHzVocoder cut-off artifactflagflagflagflag
6. Formant trajectory smoothnessReal vowels wobblemissmissmissmiss
7. VAD silence distributionReal speakers pause asymmetricallymissmissmissmiss
8. Breath-in asymmetryReal inhales > real exhalesflagflagflagflag
9. Plosive burst energy"P", "T", "K" attack curvesmissmissmissmiss
10. Sibilance shape"S" / "SH" spectral tiltmissmissmissmiss
11. Phase coherenceStereo or room-impulse tracesmissmissmissmiss
12. Long-term spectral flatnessStudio vs. living roommissmissmissmiss

Out of 48 cells, only 13 flagged the clone. The other 35 — 73% — treated my AI voice as human.

The 3 signals that survived 2026

If you only have time to memorize three things from this post, make it these.

1. HNR — synth voices are too clean

Harmonics-to-noise ratio measures how much "fuzz" sits around your harmonic peaks. Real voices live around 12–22 dB on a quiet recording. Every clone in my test ran 25–34 dB. They're literally cleaner than nature, because the diffusion models have no incentive to render the micro-noise your larynx makes when you're slightly dehydrated.

Rule of thumb: if a voicemail sounds like it was recorded inside a recording booth but your aunt called from the parking lot at Wegmans, that's HNR talking.

2. Spectral notch at 7.8–8.2 kHz

Every commercial voice clone I tested still uses a 22.05 kHz or 24 kHz output sample rate, then upsamples. The vocoder leaves a tiny dip — a notch — right around 8 kHz that real microphones don't make.

You can't hear it. But an FFT plotted in any browser canvas can. Our detector renders this in real time, and it's the single most reliable flag we have right now.

3. Breath-in asymmetry

Real humans inhale longer than they exhale, especially before a long sentence. AI clones either don't render breaths at all, or they render symmetric, paste-in breath samples. Counting "breath-ins vs breath-outs" in a 30-second clip is a >95% accurate flag in my data set, and it's something anyone can hear once they know to listen for it.

The 60-second test you can teach your mom: on any suspicious voicemail, count the audible breaths. Real humans inhale before long sentences (you can hear the air rushing in). AI clones either skip the breath or repeat the same exhale shape every 8 seconds. If you don't hear a clean asymmetric inhale, treat it as a scam until proven otherwise.

Why 9 signals died this year

The honest answer: every signal that the academic literature recommended in 2023 has been quietly closed by the labs.

Mel-cepstral variance was a Stanford paper's flagship signal in 2024. ElevenLabs v3 added explicit jitter to defeat it in late 2025. Pitch jitter — same story for Resemble. Formant smoothing got fixed when neural vocoders moved to flow-matching architectures, which can model the wobble. Plosive burst energy and sibilance shape both fell when the labs started training on close-mic'd amateur recordings instead of studio reads.

The three that survived all share a property: they're physical signals, not statistical ones. HNR is bounded by the physics of vocal-fold turbulence. The 8 kHz notch is bounded by the cost of running a wider-bandwidth vocoder. Breath-in asymmetry is bounded by lung physiology. Fixing them isn't a one-line model change; it's a re-architecture.

That gives them roughly 6–18 more months of reliability. After that, all bets are off.

What to do before that window closes

Three things, in order of effort:

Bookmark a detector now. Run our free 12-signal browser tool on anything that asks you for money over the phone, then on a known-real recording of the same person for comparison. The signals work best as a delta, not an absolute.

Set a family password. The single most effective anti-clone defense in 2026 is still a four-word phrase only your family knows. Pick one tonight. Tell three relatives. Done.

If you're a SaaS founder shipping voice features: publish the sample rate, watermark scheme, and detection-friendly disclosure metadata on your audio output. The EU AI Act Article 50 disclosure becomes enforceable on August 2, 2026. If you ship voice synthesis into the EU and your output doesn't declare itself as AI-generated machine-readably, fines start at €15M. Our free disclosure generator writes the exact <script type="application/ld+json"> block for you in 30 seconds.

Paste your suspicious clip into the 12-signal detector → Browser-only. Free. The audio never leaves your machine. Outputs a per-signal verdict so you can see which of the 3 physical signals fired.

One more thing

If you read all the way down here: I'm leaving the 4 cloned voice samples on the detector page for two weeks. Run them through and tell me on Twitter which signals fire on your machine — every browser's FFT engine gives slightly different numbers and I want to crowdsource the variance.

The clones are good. They fooled my partner the first time. They will fool yours too. The three signals above are the last cheap defense we've got — use them while they still work.


Posted by TinyTools — a growing collection of zero-bullshit, no-signup tools. Browse all tools →