To understand how well a detector of AI-generated text works, we need to measure its accuracy on data where we know the original source, whether it is human or AI. At AI Aware, we have curated a large dataset and we use a sample of it for benchmarking AI detectors, to make sure we understand our performance and how it compares with other competitors.
Our dataset is based on the well-known M4 dataset (https://github.com/mbzuai-nlp/M4) but extended. We include more genres of text, including the recent GPT-4o (and its mini variant), Google’s Gemini, Anthropic’s Claude, Grok and Llama. We also added more text genres, including fiction and non-fiction books, medical, poetry, and even recipes and have approximately 50% human and 50% AI text in all genres.
We put this text through the AI-aware Ai detector and several competitors, including Originality.ai, goWinston.ai, GPT Zero, Zero GPT, Ryne.ai, and Smodin.io, The metric for AI detection is the accuracy, i.e. the share of all text where the system gives a correct answer, and we found that AI Aware is correct 98% of the time, which is the best result on our dataset.

The other important metric is the precision, i.e. how many of the items detected as AI are actually AI. We have tuned our model to minimise the number of misdetections (false alarms) and come out with over 99%. This indicates that less than 1% of all texts shown as AI are actually human-written. While other precision of detectors is not very much lower, the difference between 99% and 98% means a doubling of false alarms.

The other detectors that are similarly precise, have overall lower accuracy, because they miss more of AI-generated text. We can see this when looking at the recall (how many of the actually AI-generated items are detected), where the other models with high precision fall behind.

While results may vary on different types of data, AI Aware shows leading performance in terms of precision and recall, making it a reliable system for applications in diverse domains out of the box and with the option to fine-tune for different specialities.