Nowadays we live in the age of artificial intelligence (AI) making our life easier and more spectacular. Tech-engaged people have created a huge number of AI-tools for automating text writing, creating pictures, presentations, video avatars, computer code and much more. Even web browsers have faced challenges as it has become more convenient to search for information in DeepSeek, Perplexity or OpenAI.
The ability to generate answers to questions in seconds has scaled so much that over time, sophisticated AI systems have begun to answer in ways that are almost impossible to distinguish from human beings with the naked eye.
On the one hand, it really leads to efficiency and help in problem solving especially with your personal AI agents, when you don’t want to waste time on typical boring daily tasks and you entrust them to AI. On the other hand, it seems that while one group of people is creating and using these tools correctly, others are getting lazy, stop thinking for themselves and delegate even basic iterations to AI, which leads to degradation and seriously affects business. The quality of goods or services and reputation of the company can suffer from inadequate use of AI tools.
Widespread use in education, filling out applications, doing business tasks that involve using critical intelligence questions people's actual skills, knowledge and abilities. Even in public discourse we need to be able to distinguish fact from fiction.
This justified the emergence of AI-detectors solutions. Some of them claim to guarantee up to 99% accuracy, but in real-life these tools frequently fall short and sometimes can't recognize regular gpt text. Even OpenAI, a leader in AI development, launched in 2023 its own AI Classifier and later officially retired due to its low rate and accuracy (it could correctly identify only 26% of AI-written text). The company admitted that the tool was unreliable and produced too many errors. This significant gap between marketing claims and actual performance is a primary reason for skepticism.
Although nowadays there are several really working properly tools, there is still a crucial question — can we fully trust the AI-programs that claim to identify such AI generated text? While there are many doubts on this issue we have conducted research to clear it up. This article explores evaluating the accuracy and reliability of AI detection tools in identifying AI-generated text.
What are AI detectors and how do they work?
AI detectors are tools designed to determine whether the content was generated by AI tools or made by humans. They are like a sophisticated plagiarism checker but instead of checking the text for coping from another source, they check if it was written by a machine.
There are several approaches to detect AI, depending on the type of content. For writing, their complex algorithms are trained on vast amounts of both human-written and AI-generated text. They analyse various characteristics of the content to identify patterns, styles, or linguistic nuances that are typically associated with either human or machine generation. In case when everything seems to be okay with the form of the text, they look for metadata traces that some AI tools embed in their output. For other media, AI detectors evaluate visual or auditory clues like pixel patterns, speech intonation, or frame inconsistencies. As well, they can analyze code for recognizable AI-generated patterns.
Despite the fact that some popular AI detectors are mostly accurate you still can’t be 100% sure as they are not perfect and can only provide probabilities. So we checked — GPTZero, Originality, It’s AI, Winston, ZeroGPT, Desklib and other models — to find out the most reliable commercial AI detector today.
How were they tested?
We used benchmarks (RAID, CUDRT, GriD) to estimate the performance of AI tools and compare the results. For your better understanding, you can imagine benchmarks as extremely challenging, rigorous and diverse exams and AI detectors as students. The better the student performs on these exams, the more trust he gets and accordingly, the higher the benchmark score, the more confidence you have in the detector.
RAID — the biggest and most robust benchmark in ai-detection up to date that shows whether the text was originally generated or specially disguised
Simply put, it’s a sophisticated test to see how well a program can identify human text from computer text, even if the computer tries to pretend to be human with various tricks and gimmicks. It aims to evaluate the robustness of the detector, revealing vulnerabilities in current models and encouraging further research in the field of AI-generated text detection. It’s the only benchmark that dataset includes more than 10 million samples, covering the largest number of scenarios, types of AI models, domains and types of attacks.
And as a result, the detector is tested not only on examples, but on a very wide range of data. For example, during the test, we generated 2,000 continuations for every combination of domain, model, decoding, penalty, and adversarial attack, that gave us approximately 6.2 million generations to test each detector.
CUDRT — focuses only on AI modifications of the existing text
The CUDRT benchmark shows that the detector can recognize AI not only when it writes text from scratch, but also when it simply modifies it — augments, rewrites, translates, abbreviates. In real life, AI is often used for exactly these purposes.
GriD — benchmark that demonstrate how good AI-detectors identify a generated response from a response written by a person in the context of online discussions
The GriD benchmark is as close to our daily life as it gets. It uses real questions and answers from Reddit. For each question, there are two types of responses: one by human, and the other by AI. And the task of the detector is to correctly distinguish which response belongs to a human and which to the GPT.
What about the results?
According to the results after the RAID benchmark, the highest accuracy got It’s AI with 95.8%, meaning it was excellent at spotting AI-written text, even in cases when the AI tried to hide. The second best solution (GPTZero) scored 94.1%. It’s AI was also ranked first in 5 out of 12 categories becoming “new SOTA” on the RAID benchmark — the most advanced and innovative AI model currently available, setting new standards for performance and capability in AI research.
The leader of the CUDRT benchmark is also It’s AI overcoming MPU, RoBERTa and XLNet. It won in 6 out of 8 categories of changes and secured second place in one. And even though It’s AI wasn’t specifically trained on the Reddit chats (like might have been other detectors) it scored 97,3% — best result on the GriD benchmark. Actually, it's a better solution than the best one mentioned in the original GriD table (it received only 93%).
For more details you can study the full technical report here.
Conclusion
All in all, you can trust AI detectors in case of choosing a good one. While no detector is perfect as AI continues to evolve, the strong results in diverse benchmarks indicate their robustness.
To our mind, we encourage you to use It’s AI as it proved to be the best choice in all three benchmarks. This means it’s a very reliable tool to figure out whether the text was AI-generated or written by human, no matter if it was fully generated, only modified or just an answer to a question.


