ran an experiment today. took one 800-word article (human-written, about remote work trends) and ran it through five different AI content detectors. the results:
- tool A: 12% AI probability
- tool B: 67% AI probability
- tool C: 89% AI probability
- tool D: 34% AI probability
- tool E: “mostly human”
same text. five different answers. this is supposed to be the technology universities and publishers are relying on?
the article was written by my colleague. i watched him write it. zero AI involvement. the fact that one tool flagged it at 89% while another said 12% tells you everything you need to know about the state of detection.
what is actually the best AI content detector if they cannot even agree with each other?
This is a well-documented phenomenon in the detection research. Different tools use different model architectures, different training data, and critically different calibration thresholds. A “probability score” from one tool is not comparable to a probability score from another because they are not measuring the same thing.
The variance you observed is not unusual. Liang et al. (2023) tested major detectors against the same corpus and found similar disagreement rates. The average inter-tool agreement was below 70%, which is remarkably poor for tools being used in high-stakes decisions.
If I were evaluating detectors, I would prioritize consistency across repeated runs and transparency about confidence intervals over raw accuracy claims. A tool that gives you 40% with a stated margin of plus or minus 15% is more honest and useful than one that gives you 89% with no uncertainty estimate.
This kind of experiment is exactly what more people should be doing. The detection industry has benefited from a lack of independent, comparative testing. Vendors publish their own benchmarks against cherry-picked test sets and consumers accept the marketing claims.
I have been recommending a triangulation approach to my clients: use multiple tools and only flag content when a majority agree at high confidence. If three out of four tools say a piece is AI-generated with confidence above 80%, that is more actionable than any single tool’s output.
But even that approach has limits. Your experiment shows that triangulation would have produced a mixed verdict on confirmed human-written text. The fundamental technology is not reliable enough for definitive claims.
this is exactly why students are panicking. if a professor picks the tool that says 89% instead of the one that says 12%, a student is getting accused of something they didnt do. and most students dont know enough to challenge the result
we need like a consumer reports for AI detectors. independent testing, public results, standardized methodology. not just marketing claims from the companies selling the tools. everyone asks what is actually the best AI detector and theres literally no honest answer because nobody is testing them independently
From a business perspective this is a trust problem for the entire detection category. If I am paying for a tool and it cannot agree with other tools on basic classification, why should I trust it with decisions that affect my business?
I run an e-commerce operation and content quality matters for our brand. We briefly considered adding AI detection to our content review process but abandoned it after running tests exactly like yours. The inconsistency was not acceptable for production use.
We went back to human editorial review. More expensive, slower, but at least the results are defensible.