Tested 4 voice detectors on the same 30 sec clip - heres what happened

Ran the same 30-second cloned audio sample through 4 different ai voice detectors. results were 12%, 47%, 89%, and 73%. same clip. anyone else doing comparison testing and seeing this spread

ran a similar test last month with 6 detectors, similar spread. the methodology each tool uses for thresholding is wildly different. the underlying probability might be similar but the displayed score isnt directly comparable.

The field needs a shared evaluation benchmark for voice. there’s a couple of academic ones starting to circulate but nothing industry-standard yet. until then any single tool’s score is essentially uninterpretable without calibration against your own ground truth set.

Yeah im going to build a small ground truth set internally and calibrate. better than trusting raw scores

interesting btw which 4 you tested? helpful for the rest of us doing similar audits.