ok so for my machine learning class project i decided to test some of the ai voice detection tools that are available right now. figured id share results here since theres not a lot of real world testing data out there
setup: 20 audio clips total. 10 real human voice recordings (mix of podcast clips, voice memos, lecture recordings). 10 generated using elevenlabs, play.ht, and bark
results by tool:
- tool A: 14/20 correct (70%). flagged 2 real recordings as ai, missed 4 ai clips
- tool B: 12/20 correct (60%). actually worse than a coin flip on the elevenlabs clips
- tool C: 16/20 correct (80%). best of the three but still missed some obvious ones
the biggest pattern i noticed: tools were decent at catching older tts voices (the robotic sounding ones) but really struggled with emotional, expressive ai voices. the elevenlabs voice clone with emotion was the hardest for every tool
also compression killed accuracy. when i ran the same clips through whatsapp compression first, every tool performed worse. which matters because thats how most voice clips are actually shared in the real world
This is really useful data thanks for sharing. the compression finding is especially important because it mirrors what we see with image detection - real world conditions (compression, noise, conversion) destroy the subtle signals that detection tools rely on
80% from the best tool is better than i expected honestly. but for any kind of enforcement or legal use thats nowhere near reliable enough
Excellent methodology. The emotional voice finding is consistent with the literature - tools trained on neutral TTS struggle with expressive synthesis. If you could test at different bitrates I suspect there’s a quality threshold below which detection becomes random.
this is why the detection approach to voice authenticity is probably doomed long term. if 80% is the ceiling with clean audio and it drops from there with real world conditions, you cant build reliable systems on top of that
we need provenance-based solutions for audio just like we need them for images and video. sign it at creation, verify the chain. detection as an afterthought will always lag behind generation
@Marc_Delrieu great suggestion, i actually have the data for different bitrates but havent analyzed it yet. my initial impression is that 128kbps mp3 is roughly where accuracy starts tanking but ill do the proper analysis
@HugoNomad yeah the provenance approach makes more sense but how do you implement that for phone calls or live audio? its not like a photo where you can embed metadata at capture time