Article

LREC 2026: An In-Depth Evaluation of Machine-Generated Text Detection

logo
Kevin Stowe

Research Scientist • May 7, 2026 (UPDATED ON May 7, 2026)

CONTRIBUTORS

Kailash Patil, Director of Research, Pindrop

3 minutes read time

From May 11-16, 2026, Pindrop researchers are attending the Language Resources and Evaluation Conference (LREC) 2026, an event that brings together professionals and scholars in natural language processing, computational linguistics, and speech and multimodality to discuss advancements and research in these fields.

Spotlights and blindspots: How we evaluated machine-generated text detection

This year, we submitted a paper titled, “Spotlights and Blindspots: Evaluating Machine-Generated Text Detection.”

As AI writing tools explode in popularity, so have the tools designed to detect it. But here’s the problem: It’s difficult to know how well detection models actually work, due to inconsistent testing and lack of clear measurement frameworks.

That’s why we ran a broad evaluation, testing 15 detection models from six systems and seven trained models across ten different datasets. We aimed to figure out how well these tools actually perform and what contributes to better performance. Keep reading to learn what we found.

No single model was the clear winner

When comparing across evaluation criteria, no model significantly outperformed the others across the board.

An F1 score is a machine learning measurement that looks at the precision and recall of the model and ranks it on a scale of 0 to 1. The F1 scores for the detection models we tested ranged from ~0 to 0.982, depending on the dataset. This revealed that the evaluation criteria you choose can significantly mask or inflate a model’s measured performance.

Models struggled with human-written texts

Most detection models had error rates above 15% when tested against novel, human-written text.

Models fell into two buckets:

  • Restrictive: With strong recall, these models catch machine-generated texts, but often label human-written text as machine-generated
  • Permissive: With low recall, these models didn’t misclassify human-written text as machine-generated, but also didn’t catch machine-generated text.

This suggests that, in the settings we tested, detection models could not reliably tell the difference between human- and AI-written text, which raises concerns for the real-world use of these models for detection.

For example, if a detection model falsely identifies human-written text as AI-generated in an education setting, this could have serious academic ramifications for the student who did actually write the text.

Em-dashes might be a tell, but features like that don’t reliably inform detection

We analyzed factors that may explain errors in detection models, including length (word count), punctuation (%), repetition (%), and perplexity. These features show some correlations with error rates, but are inconsistent at best. This means that none of these features consistently correlated with higher error rates than others across all of the models tested.

Key takeaway: We need clear, well-motivated metrics to know if a detection model is performing well

The need for accurate machine-generated text detection will only continue to rise as AI writing tools are adopted. For that reason, and from the research above, we advise practitioners to define clear, well-motivated metrics and datasets, so that models can be tested in a valid, useful, and reproducible way.

Catch up on our past AI text detection research.
Dive into the research

Digital trust isn’t
optional—it’s essential

Take the first step toward a safer, more secure future for your business.