Testing AI detectors specifically on academic writing - surprising results

for my research methods course i ran an informal experiment testing four major AI detectors on academic writing samples. the results were more concerning than i expected.

i collected 30 writing samples: 10 from published journal articles (definitely human-written pre-ChatGPT era), 10 from ChatGPT asked to write in an academic style, and 10 from grad students including myself.

the published journal articles were flagged as “AI-generated” or “mixed” by at least two detectors in 6 out of 10 cases. let that sink in. peer-reviewed, published research from 2019 or earlier was flagged as AI by tools that universities are using to evaluate student work.

the problem is clear: formal academic writing shares structural features with AI output. precise vocabulary, disciplined paragraph structure, citation-heavy argumentation. these are exactly the features detectors associate with AI.

Your findings are consistent with the published research on this topic, particularly the work by Liang et al. who demonstrated elevated false positive rates on formal academic and non-native English writing.

The root cause is what I call the formality-detection paradox. Detectors are trained on casual AI outputs (the kind most commonly available in training datasets). Formal human writing, which follows strict conventions and disciplined structure, shares statistical properties with AI outputs that also follow predictable patterns.

The implication for academic integrity is serious. If the very writing style that academia demands triggers false positives, then the tools are systematically biased against students who write well in the expected academic register.

I hope you publish this work. Even informal studies contribute valuable evidence to the discussion.

This mirrors an exercise I ran with colleagues last semester. We submitted our own published papers to GPTZero and Originality.ai. Three out of eight papers were flagged at moderate to high AI probability.

The experience was instructive because it gave faculty a visceral understanding of what students face. If established researchers with published track records get flagged, expecting students to accept detection results without anxiety is unreasonable.

Since that exercise, my department has adopted a policy that no academic consequence can result from a detection score alone. The score can prompt a conversation but the conversation must focus on evidence of the learning process, not the statistical output of a classification model.

this is exactly what i was afraid of. im a student who writes in a pretty formal style because i actually care about my papers. now i have to worry that writing well is suspicious?

seriously though thank you for sharing this. im going to save this thread and show it to anyone who tells me detection tools are reliable. 6 out of 10 published papers flagged. thats a 60% false positive rate on confirmed human writing. how is anyone still trusting these tools for grading decisions

This has implications beyond academia. In B2B content marketing, thought leadership articles are written in a formal, structured style very similar to academic writing. My clients have had LinkedIn articles and white papers flagged by AI detectors, which caused confusion with their audiences.

The broader lesson is that AI detectors are not measuring what they claim to measure. They are measuring text formality and structural predictability, which correlates with AI output but also correlates with high-quality professional writing. Until detectors can distinguish between these, they are a blunt instrument being used for precision work.