How good AI detectors are in content classification?
(13 min. read)
Most AI content generation tools are based on GPT language models provided by OpenAI. Therefore, the patterns they employ in creating content are somewhat similar. We can now create, improve, or edit texts not only directly in the ChatGPT application but also in notion.so documents, or using the Grammarly plugin.
Today, an increasing amount of content published online is generated by or with the assistance of artificial intelligence. As a result, new ways of verifying the origin of content are emerging: AI detectors. We decided to investigate their reliability and the degree of trust we could place in them.
How do AI detectors work?
AI content detection tools are essentially based on new language models that have been trained to differentiate between texts written by humans and those created by artificial intelligence. These tools determine their probability assessments based on the perplexity (measurement of the randomness) and burstiness (measurement of the variation in perplexity) of the text. Humans tend to write with greater randomness – for instance, we often intersperse longer sentences with shorter, less complex ones. AI content detectors identify these characteristics through training on millions of sample texts, previously categorized as human-written or AI-created.
While AI models like ChatGPT and other GPT-based tools can generate content in a variety of languages, most verification tools primarily support English. So how does one verify content in other languages?
One method is to verify texts that have been automatically translated by tools such as DeepL or Google Translate. It's crucial to remember, however, that this translation method also imposes certain patterns on the texts, according to which they are automatically translated, influencing the detector's evaluation. Nevertheless, some detectors are willing to verify content in other languages, despite their official support being limited to English.
Each detector also has its limitations regarding the length of content it can verify, which are dependent on whether a free or paid version is used. Moreover, they use different methods to evaluate the origin of content; some provide a percentage of certainty, in their opinion, that the text was generated by a specific source, while others use verbal assessments or a binary scale: yes or no.
Tools we verified
From the range of solutions available on the market, we've put the four most popular tools under the microscope and tested them using a sample set of 20 texts. We selected 10 texts written by humans and 10 generated by artificial intelligence to evaluate how these tools would classify them.
In the batch of traditionally written texts, we included articles written entirely by copywriters (some of these were written before 2021, which excludes the participation of AI in their writing), excerpts from literature (including children's books), news articles, and instructional guides. The AI-generated content included fragments of literature created in this way, such as The Inner Life of an AI: A Memoir by ChatGPT and Bob the Robot, as well as articles produced using artificial intelligence (including both those already published online and those freshly created by GPT-3.5 and GPT-4). We also incorporated Polish texts along with their translations by DeepL and Google Translate into our analyzed content sample. All tests were conducted on the same text fragments, and given the varying classification methods, we examined the results on a tool-by-tool basis.
Below, we present a compilation of texts written by humans:
and those generated by artificial intelligence:
AI Text Classifier by OpenAI
The AI Text Classifier by OpenAI, the creators of the GPT models, is the most popular tool for detecting content created by artificial intelligence. It evaluates and expresses its findings in terms of the probability that the analyzed text was created by artificial intelligence. This is done using a range of categories: 'very unlikely', 'unlikely', 'unclear', 'possibly', and 'likely AI-generated'. As you can see, this scale offers a fair degree of nuance. OpenAI itself notes in the tool's description that the results may not always be accurate, and that the detector could potentially misclassify both AI-created and human-written content. It's important to note that the model used to train the AI Text Classifier didn't include student work, so it's not recommended for verifying such content.
Out of 10 human-written texts, the AI Text Classifier correctly classified 9 as 'very unlikely' or 'unlikely' to have been created by artificial intelligence. However, it struggles more with categorizing those generated by AI. In this case, the tool frequently classifies the text as 'unclear', 'unlikely', or 'possibly' AI-generated.
Originality.AI
Another widely used tool for verifying content created with artificial intelligence is called Originality. Its creators claim that it has a 95.93% effectiveness rate. It's the only tool on our list that comes with a cost, charging $0.01 for every 100 words verified. The minimum package costs $20. Besides verifying the origin of content, Originality also screens it for plagiarism.
Originality uses percentages to reflect its certainty of how the content was created. A 66% Original score does not mean that the text is 66% written by a human and 34% by AI, but rather that Originality is 66% certain that the content was created by a human. The tool highlights in red the parts that it believes were generated by AI and in green the parts that it's confident are human work. Interestingly, it's often the case that in texts ultimately categorized as human-written, most or at least half of the content is highlighted in red.
Originality sometimes struggles to definitively categorize human-written content. Out of all the fragments verified, only one was rated with 100% certainty as human-written. The results for the remaining texts ranged from 52 to 92% certainty that the content is human.
The tool did a slightly better job of verifying content generated by AI; in 7 out of 10 texts, the certainty that the content was generated by AI was 100% or 99%. Doubts arose when it came to content generated by AI in Polish and then translated into English. Even though the article was generated by one of the older GPT models (and was published on a blog back in September 2022) and had quite a few stylistic errors, Originality was 92% certain that the Polish version was human-written. But as the translations continued, the balance tipped in favor of AI: Originality was 57% certain that the DeepL-translated content was AI-generated and 85% certain for the Google Translate version.
The text that gave Originality the most trouble was an article from the popular site Bankrate.com, where the content is AI-generated and verified by humans. This was the only case where the tool was 88% sure that the article was human-written, even though it was actually created with the help of AI. So, it seems the key to "fooling" Originality lies in careful editing of the text.
CopyLeaks
The overall content rating in CopyLeaks is binary; the possible results are "This is human text" and "AI Content Detected". Detailed verification for specific segments can only be viewed by hovering over the text. The tool will then indicate the likelihood that the selected paragraph was written by a human or an AI.
Out of 10 human-written texts, CopyLeaks detected AI-generated content in two of them. When it came to content generated by AI, the tool evaluated it as being about equally human- and AI-created. Therefore, the results are not reliable and could even be considered useless. It is surprising how "sensitive" CopyLeaks is to changes. In the verified examples, a simple alteration of the input prompt or the addition of the chapter number and title information was enough to completely change the result of the evaluation.
Content at Scale - AI DETECTOR
Content at Scale primarily serves as a tool for automatic content generation, with its verification function as an additional feature. The creators of the tool assert that texts generated by their system are undetectable by AI detectors. This raises the question: was the detector designed to confirm the effectiveness of the generator?
The company provides an example text generated by Content at Scale on their website. According to their own detector, this text was assessed to have a 95% probability of being human-written. However, the AI Text Classifier judged it unlikely to be the work of AI. Originality and CopyLeaks, however, were not so easily fooled. Originality rated the text as 100% AI-generated, while CopyLeaks detected AI content. As one can see, different detectors offer varying perspectives.
Content at Scale's detector did quite well in detecting content written by a human. For 9 out of 10 of the verified fragments, the probability of them being human-written exceeded 90%.
However, it struggled more with AI-generated content; in 6 out of 10 cases, it rated texts as created by both humans and AI. The rest were incorrectly classified as human-written.
Fool the detectors
AI-generated content doesn't just appear out of nowhere. Behind the prompts that create it, there is always a human. So, the question arises: can we devise ways to fool AI content detectors and generate content that slips past their detection capabilities?
The internet is full of various tricks for creating undetectable prompts. One of them is to explain the concepts of 'perplexity' and 'burstiness' to the chatbot. The idea is to guide the AI to take these factors into account when generating new text. To put this to the test, we tasked a chatbot powered by the GPT-4 model to draft an article on the best climbing spots around Paris.
The results were interesting. We compared the outcome of the original text version and a second attempt, where we clarified the assessment criteria for the chatbot and guided it on how to make the content seem more human-like.
The only tool that wasn't fooled was Originality. However, in both instances, we successfully duped the AI Text Classifier, which concluded that the chance of both contents being AI-generated was low. Interestingly, Content at Scale and CopyLeaks changed their opinions after ChatGPT-4 crafted the text, taking into consideration the guidelines about perplexity and burstiness.
What makes detectors unreliable?
Detectors are designed to search for predictable elements of content, known as perplexity. The lower the predictability, the greater the chance that the text was written in a traditional manner: by a human. However, just as slight changes in prompts entered into a chatbot like ChatGPT can yield a completely different result, a minor modification in the text being checked (which may only slightly alter its meaning) can change the assessment given by these tools.
Take, for instance, the children's book Bob the Robot, which was 80% generated by artificial intelligence. The publisher attributed 20% of the creation to human editing of the text. The segment we verified in terms of how the book was created was its first chapter, comprising 276 words. We essentially verified the same text twice, with only a minor change: we added "Chapter 1: Star City" before the content of the chapter. Consequently, Copyleaks.com completely changed its assessment of the book's origin. In the clean text of the chapter, the tool detected content generated by AI, but the same text with information about the chapter number and title was deemed to be human-written.
So, how do detectors respond to translated content? As an example, we used content written by a copywriter without the aid of artificial intelligence. The article was written in Polish. Even though some tools don't support this language, they still attempt to verify its content without flagging any errors. The results provided by originality.ai showed 77% certainty that the content was human-written when verifying the Polish text. The same text, when translated using DeepL, gave the tool only 67% certainty that it was written by a human. This certainty drops to 36% when the text is translated by Google Translate.
Interestingly, another tool, Copyleaks, which officially supports Polish content, correctly classified all versions. The likelihood that most of the text was human-written (bearing in mind that the tool evaluates individual parts of the text separately) was 99.9% for the Polish version, 89.8% for the DeepL translation, and 90.2% for the Google Translate version. Although the differences are minor, it's intriguing that Google's translation is considered closer to a human writing style than DeepL, which was classified oppositely by Originality.
How do AI content detectors differ from humans?
Researchers from the School of Engineering and Applied Science at the University of Pennsylvania investigated how humans discern AI-generated content. Are we able to spot the differences, and what factors do we consider in our judgment?
Source: https://
For various content types, we primarily focus on relevance, which typically exerts the strongest influence on our appraisal. Within AI-generated content, we also detect errors stemming from a lack of common-sense logic, which AI tools cannot verify, and the presence of contradictory segments.
AI detectors primarily base their assessments on the predictability and randomness of the text, without considering many other factors highlighted by the study's participants. We put this to the test with an article created by artificial intelligence in September 2022, before ChatGPT became a household name. The article is riddled with stylistic errors, illogical sentences, and awkward phrasing on the one hand, but on the other, it is indeed 'original' and notably random.
The text was originally in Polish, so we repeated the process of automatic translations via DeepL and Google Translate, and for good measure,we also examined the original text (keep in mind, not all tools support the Polish language, yet they undertake its evaluation).
Here are the results:
As evident, Originality.ai managed the task of evaluating the translated text the best, but its results are not unequivocal. Bear in mind that the process of automatic translation can introduce an 'unnatural' element to the content, which may partly account for an increased likelihood of the content being perceived as AI-generated.
Summary
The paradox of AI content detection tools is that we have such a level of skepticism about the quality of AI-generated texts that we need another AI-based tool to detect them. The crucial questions we should ask ourselves at this point are: Can we determine with absolute certainty how a piece of content was created? If an article is well-written and contains no factual errors, does the way it was written really matter?
There's a widespread belief that content created by AI is of low quality. Some also argue that websites will be 'penalized' for content generated in this way. Yet, Bing and Google are moving towards the use of artificial intelligence in their search engines, suggesting that they see the benefits of such solutions.
None of the tools are infallible. In our trial, some did better in assessing human-written texts, while others excelled with those written by AI. The results for some content varied significantly between tools. When verifying selected excerpts, we knew how they were created, but if we had conducted such a test blindly, the results would simply be unreliable. The biggest problem with the assessments of these tools is the uncertainty of when they are correct and when they are mistaken. Using them, we truly never know if we can trust them in a particular case.