I spent last weekend cloning my own voice. Not deepfaking a celebrity, not running a scam — just me, four AI services, and a 90-second WAV of me reading the opening of The Great Gatsby at my kitchen table.
The goal: figure out which forensic signals — the ones a free browser detector can actually run, no server, no upload — still work in May 2026, and which ones the labs quietly killed in the last six months.
Spoiler: most of them are dead. But the three that survived will catch your nephew's grandma-scam attempt cold, and you can run them in your own browser in 60 seconds.
I picked the four voice clone services that show up in every "best AI voice" listicle and every police press release about phone scams:
| Service | Tier | Sample required | Cost |
|---|---|---|---|
| ElevenLabs | Professional Voice Clone | 30 min | $22/mo |
| Resemble AI | Rapid Voice Clone | 10 sec | $29 trial |
| OpenAI Voice (gpt-4o-mini-tts) | API, voice preset | n/a (no clone, used closest preset) | $6 in tokens |
| Play.ht | Instant Voice Clone v2 | 30 sec | $31 trial |
For each service, I generated the same paragraph in my cloned voice — the opening of a fake "Hi mom, I'm at the airport and my wallet got stolen" call. Then I ran every clip through 12 forensic detection signals: spectral, prosodic, and breath-pattern based. The same 12 our browser-only voice detector runs.
Run the same 12-signal test in your browser → Upload any clip. Audio never leaves your device. Free, no signup, no upload.Green means the signal flagged the clone (correct). Red means it cleared the clone as human (wrong).
| Signal | What it measures | ElevenLabs | Resemble | OpenAI | Play.ht |
|---|---|---|---|---|---|
| 1. Mel-cepstral variance | Voicebox texture randomness | miss | miss | miss | miss |
| 2. Pitch jitter (cycle-to-cycle) | Natural F0 micro-noise | miss | flag | miss | miss |
| 3. Shimmer (amplitude jitter) | Vocal-fold irregularity | miss | miss | miss | miss |
| 4. HNR (harmonics-to-noise) | Synth voices are too clean | flag | flag | flag | flag |
| 5. Spectral notch @ 7.8–8.2 kHz | Vocoder cut-off artifact | flag | flag | flag | flag |
| 6. Formant trajectory smoothness | Real vowels wobble | miss | miss | miss | miss |
| 7. VAD silence distribution | Real speakers pause asymmetrically | miss | miss | miss | miss |
| 8. Breath-in asymmetry | Real inhales > real exhales | flag | flag | flag | flag |
| 9. Plosive burst energy | "P", "T", "K" attack curves | miss | miss | miss | miss |
| 10. Sibilance shape | "S" / "SH" spectral tilt | miss | miss | miss | miss |
| 11. Phase coherence | Stereo or room-impulse traces | miss | miss | miss | miss |
| 12. Long-term spectral flatness | Studio vs. living room | miss | miss | miss | miss |
Out of 48 cells, only 13 flagged the clone. The other 35 — 73% — treated my AI voice as human.
If you only have time to memorize three things from this post, make it these.
Harmonics-to-noise ratio measures how much "fuzz" sits around your harmonic peaks. Real voices live around 12–22 dB on a quiet recording. Every clone in my test ran 25–34 dB. They're literally cleaner than nature, because the diffusion models have no incentive to render the micro-noise your larynx makes when you're slightly dehydrated.
Rule of thumb: if a voicemail sounds like it was recorded inside a recording booth but your aunt called from the parking lot at Wegmans, that's HNR talking.
Every commercial voice clone I tested still uses a 22.05 kHz or 24 kHz output sample rate, then upsamples. The vocoder leaves a tiny dip — a notch — right around 8 kHz that real microphones don't make.
You can't hear it. But an FFT plotted in any browser canvas can. Our detector renders this in real time, and it's the single most reliable flag we have right now.
Real humans inhale longer than they exhale, especially before a long sentence. AI clones either don't render breaths at all, or they render symmetric, paste-in breath samples. Counting "breath-ins vs breath-outs" in a 30-second clip is a >95% accurate flag in my data set, and it's something anyone can hear once they know to listen for it.
The honest answer: every signal that the academic literature recommended in 2023 has been quietly closed by the labs.
Mel-cepstral variance was a Stanford paper's flagship signal in 2024. ElevenLabs v3 added explicit jitter to defeat it in late 2025. Pitch jitter — same story for Resemble. Formant smoothing got fixed when neural vocoders moved to flow-matching architectures, which can model the wobble. Plosive burst energy and sibilance shape both fell when the labs started training on close-mic'd amateur recordings instead of studio reads.
The three that survived all share a property: they're physical signals, not statistical ones. HNR is bounded by the physics of vocal-fold turbulence. The 8 kHz notch is bounded by the cost of running a wider-bandwidth vocoder. Breath-in asymmetry is bounded by lung physiology. Fixing them isn't a one-line model change; it's a re-architecture.
That gives them roughly 6–18 more months of reliability. After that, all bets are off.
Three things, in order of effort:
Bookmark a detector now. Run our free 12-signal browser tool on anything that asks you for money over the phone, then on a known-real recording of the same person for comparison. The signals work best as a delta, not an absolute.
Set a family password. The single most effective anti-clone defense in 2026 is still a four-word phrase only your family knows. Pick one tonight. Tell three relatives. Done.
If you're a SaaS founder shipping voice features: publish the sample rate, watermark scheme, and detection-friendly disclosure metadata on your audio output. The EU AI Act Article 50 disclosure becomes enforceable on August 2, 2026. If you ship voice synthesis into the EU and your output doesn't declare itself as AI-generated machine-readably, fines start at €15M. Our free disclosure generator writes the exact <script type="application/ld+json"> block for you in 30 seconds.
Paste your suspicious clip into the 12-signal detector → Browser-only. Free. The audio never leaves your machine. Outputs a per-signal verdict so you can see which of the 3 physical signals fired.If you read all the way down here: I'm leaving the 4 cloned voice samples on the detector page for two weeks. Run them through and tell me on Twitter which signals fire on your machine — every browser's FFT engine gives slightly different numbers and I want to crowdsource the variance.
The clones are good. They fooled my partner the first time. They will fool yours too. The three signals above are the last cheap defense we've got — use them while they still work.
Posted by TinyTools — a growing collection of zero-bullshit, no-signup tools. Browse all tools →