A reflection on the phenomenon of LLM Model Collapse leading to the decline in AI quality
In the rapidly developing world of artificial intelligence (AI), we are witnessing a phenomenon that seems to contradict the idea of continuous and unstoppable progress. In recent months, several articles and discussions have raised doubts about the stability of AI models’ performance, which, apparently, are becoming worse in their responses over time, even without being replaced or updated, leading to the paradox that artificial intelligence is getting dumber. But what does this phenomenon really imply, and what are the possible explanations? In this article, I try to explore the possible causes of this decline, the implications for the future of AI, and the proposed solutions to reverse this trend.
The phenomenon of LLM Model Collapse
Just as over-reliance on AI tools risks limiting our ability to innovate and explore new technologies — a topic I covered extensively in a previous article reflecting on the risk of technological stagnation associated with this choice — the phenomenon of ‘model collapse’ further highlights how crucial it is to preserve the quality of data and the authenticity of human interactions to ensure AI progress and technological development.
The concept of LLM model collapse refers to a potential degradation in AI models’ performance as they are trained on data generated by other AI, instead of genuine human data. This risk arises when models begin to “recycle” their own outputs, leading to a progressive loss of quality and diversity in the results. This phenomenon is comparable to a digital form of inbreeding, where the continued use of synthetic data compromises the model’s ability to produce accurate and diverse responses.
Large language models are often referred to as “black boxes”, underscoring the unknown nature of how these complex systems operate. There is no clear answer as to why large language models might behave inconsistently, but researchers say that these drifts can have a real impact.
The dependence on high-quality human data
Modern artificial intelligence systems are built using machine learning. Programmers set up the underlying mathematical structure, but the actual “intelligence” comes from training the system to mimic patterns in the data. The current generation of generative AI systems requires high-quality data in large quantities.
To source this data, major tech companies like OpenAI, Google, Meta, and Nvidia are constantly scouring the Internet, collecting terabytes of content to fuel their machines. However, since the advent of widely available generative AI systems and tools in 2022, people have increasingly been uploading and sharing content that is partially or entirely created by AI.
In 2023, researchers began to ask whether it was possible to rely solely on AI-generated data for training instead of human-generated data, but they soon discovered that without high-quality human data, AI systems trained on AI-generated data become increasingly dumber as each model learns from the previous one.
The risks of training on AI-generated data
The problem is exacerbated because AI-generated data is much cheaper and easier to obtain than human data, leading less forward-thinking companies to exploit it massively. However, over time, this practice could make models less and less effective, with outputs becoming increasingly inaccurate and repetitive, and with a reduction in the model’s quality and diversity of behavior as their knowledge base would end up being built on a series of “derived” and not original information.
The consequences could include the loss of cultural diversity and the reduction of human interaction online, negatively impacting the “digital commons” represented by the Internet. Some scholars suggest that to avoid model collapse, it is crucial to continue integrating new human data and promoting diversity in training sets, as well as limiting monopolies in the AI industry.
Essentially, what we must absolutely avoid is polluting the primary source of data, which represents the essential resource for developing quality AI systems, thus preventing the information used for AI training from being based on a cycle of “regurgitated training” (i.e., through repeated use of artificially generated data instead of authentic data) because such a practice would not only damage trust in the system but also its long-term effectiveness, compromising the potential for innovation that AI is capable of offering.
The difficulty of filtering AI-generated content
But then, to solve this problem, couldn’t major tech companies simply filter AI-generated content? Not quite. Tech companies already spend a lot of time and money cleaning and filtering the data they collect, with an industry insider recently sharing that sometimes as much as 90% of the data initially collected for training models is discarded.
These efforts could become more challenging as the need to specifically remove AI-generated content increases. But, more importantly, over time it will become increasingly difficult to distinguish AI content. This will make filtering and removing synthetic data a game of diminishing (financial) returns.
Although a wave of synthetic content may not pose an existential threat to the progress of artificial intelligence development, it could, however, threaten the digital public good of the Internet. For example, researchers found a 16% drop in activity on the coding website StackOverflow a year after the release of ChatGPT. This suggests that AI assistance may already be reducing person-to-person interactions in some online communities.
According to a recent study, as much as 57% of the content on the web is generated by AI or translated with the support of AI models currently in circulation. It is especially AI-translated content — often done hastily and not entirely accurately — that populates the web, representing “a significant portion of the total content in those languages.”
This excess of AI-generated content risks collapsing the models. Since chatbots like ChatGPT or Gemini are trained on data acquired through web scraping, if the quality of the content on the web deteriorates, the models’ performance will suffer as well. This creates a vicious cycle: websites fill up with low-quality content generated or translated by AI, and that same content becomes training material for the models, which end up learning incorrect information.
The overproduction of AI-based content is making it increasingly difficult to find content that is not clickbait full of ads. It is becoming impossible to reliably distinguish between human-generated and AI-generated content. One way to address this issue would be to watermark or label AI-generated content, as many have recently highlighted and as reflected in the recent interim legislation from the Australian government.
The exhaustion of human data
As AI-generated content becomes systematically homogeneous, we risk losing socio-cultural diversity, and some groups of people may even experience cultural erasure. We urgently need interdisciplinary research on the social and cultural challenges posed by AI systems.
According to some estimates, the pool of human-generated text data could be depleted as early as 2026. It is likely that this is why OpenAI and others are rushing to strengthen exclusive partnerships with industry giants like Shutterstock, Associated Press, NewsCorp and GEDI (including Repubblica). These companies own large proprietary collections of human data that are not readily available on the public Internet.
Human interactions and human data are important, and we should protect them. For our sake, and perhaps even for the sake of the potential risk of a future AI model collapse.
New web restrictions threaten AI model growth
Generative AI has evolved thanks to the vast sets of public data collected by crawlers scouring the web, a practice whose ethics are often questioned. On one hand, free access to public data has allowed models like those of OpenAI and Anthropic to improve quickly, but on the other hand, many are wondering whether it is right to use such data without the explicit consent of content creators. An increasing number of websites are using simple technologies such as the robots.txt file to prevent crawlers from accessing their content, primarily to protect their profits from unauthorized use by AI models. Even though many AI companies have been accused of disregarding the robots.txt file’s directives and thus scanning all pages of websites anyway, this is leading to a data supply crisis for AI companies, which rely on large amounts of public information to train their models. The report “Consent in Crisis: The Rapid Decline of the AI Data Commons” by the Data Provenance Initiative highlights how restrictions are growing, especially on high-quality sites like news and social media, reducing the availability of fresh and accurate content for AI models.
The case of Wordfreq: when AI pollutes linguistic data
The closure of the Wordfreq project, which aimed to monitor and analyze language usage across various platforms, is an illustrative case that clearly demonstrates the contamination of data by generative artificial intelligence.
While Google has made word frequency public through its Book Ngram Viewer service (which uses literature available on Google Books), Wordfreq relied on web scraping for its datasets, gathering information from platforms such as Twitter and Reddit. However, both social services have recently restricted access to their data: Twitter’s public APIs have been closed or made prohibitively expensive, while Reddit now sells its archives at high prices, affordable only to large companies like OpenAI. This change has severely limited the availability of reliable conversational linguistic data on which Wordfreq relied.
But the real problem, according to Robyn Speer, the creator of Wordfreq, was the data pollution caused by generative artificial intelligence, which has made the Internet saturated with low-quality texts generated by large language models (LLMs). These contents often lack genuine communicative intent and can mislead linguistic analysis. For example, Speer observed that LLMs like ChatGPT tend to overuse certain words — such as “delve” — which distorts their frequency in datasets, making it difficult to draw accurate conclusions about human language use after 2021.
The rise of LowBackgroundSteel as a reaction
In response to the problems caused by data pollution from generative artificial intelligence, projects like LowBackgroundSteel.ai have emerged. Founded in 2023, this website was created as a place to gather references to datasets not contaminated by AI, offering a valuable resource for researchers and developers in need of clean data for training models, and aiming to combat the “model collapse” phenomenon by providing access to authentic and reliable data.
LowBackgroundSteel aims to create a repository of datasets that have not been contaminated by AI-generated content. This is crucial to ensuring the quality and reliability of the data used in machine learning processes. Among the included datasets is, of course, Robyn Speer’s project, Wordfreq.
LowBackgroundSteel represents a direct reaction to the challenges posed by artificial intelligence in the field of data collection and usage. The project invites the scientific and technological community to contribute by sharing and using uncontaminated datasets, thus promoting sustainable and reliable AI development.
The very existence of a site like LowBackgroundSteel also highlights an increasing awareness of the need to protect human datasets from contamination by generative AI.
Asimov had predicted it
On the web, in relation to this phenomenon, Isaac Asimov is often cited, who in his novella “Profession,” had, in a sense, predicted the risks associated with learning exclusively based on artificial sources. In this story, the society of the future trains professionals through standardized programs, depriving individuals of the ability to develop creative or critical thinking skills. The result is a population that possesses high technical skills but is unable to innovate, finding solutions only within predefined frameworks.
This dystopian vision eerily reflects what could happen if artificial intelligence models, such as LLMs (Large Language Models), were trained only on data generated by other LLMs. Just as Asimov’s professionals become repetitive tools, without the ability to create new ideas, such a system could produce AI incapable of generating original content, progressively limiting innovation and the evolution of knowledge. The output of these models risks becoming a sterile repetition of existing information, without the freshness and diversity that comes from confronting real experiences or new human ideas.
In the end, the protagonist, initially excluded because deemed unsuitable for the conventional training system, demonstrates, however, that he possesses a rare and crucial quality for the future of society: the ability for original thought, providing new perspectives beyond the limits imposed by the system and innovating where others follow only rigid and repetitive patterns.
The lesson to be learned for artificial intelligence is that relying solely on artificially generated data could cause us to lose that very element of uniqueness and divergent thinking that is essential for innovation.
The decline in performance during the “Winter Break”: reality or myth?
Many users have noticed that ChatGPT seems to become “lazy” during the winter months, with slower and less detailed responses. This phenomenon, jokingly dubbed the “winter break hypothesis,” suggests that the model may “learn” from human behavior patterns, where people tend to slow down work around the December holidays.
An X account called Martian openly wondered if LLMs could simulate seasonal depression. Later, Mike Swoopskee tweeted : “What if it learned from its training data that people usually slow down in December and push off bigger projects to the new year, and that’s why it’s been lazier lately?”
On the other hand, while ChatGPT seems to exhibit this winter slowdown, jokes have circulated about Claude AI, developed by Anthropic, showing similar behavior in the summer. Since Claude was created by a European company, some have joked that the AI may mirror European summer habits, where productivity tends to decline during the hot months, following the typical summer vacation rhythm.
“These kinds of behaviors can pose a major obstacle to reliable implementations of source language models,” said CIO Dive James Zou, associate professor of biomedical data science at Stanford University. “If you have a large language model as part of your software or data science stack and the model suddenly becomes lazy or exhibits changes in formatting, behavior, and outputs, this could actually disrupt the rest of your pipeline.”
But why consider such strange hypotheses? Simple: because research has shown that large language models like GPT-4 respond to human-like encouragement, such as telling a bot to “take a deep breath” before solving a math problem or telling an LLM that it will receive a tip for completing the work, or if an AI model becomes lazy, using emotional stimuli such as telling the bot that your job may depend on its response, or simply that you have no fingers, seems to help extend and make the outputs more complete.
However, these results have not been conclusive because, while it seems a plausible explanation, the performance difference observed with the ‘winter break’ is still overall small and thus could simply be due to chance.
Prompt engineering and the role of kindness
One of the suggested solutions to address these issues is to improve Prompt Engineering techniques. Refining the ability to formulate questions and interactions with AI models could improve the quality of responses received.
An often overlooked element is the use of kindness in requests. Phrases like “please,” “thank you,” and a general tone of respect can influence the quality of the responses provided by the AI. This happens because the model is trained on huge amounts of textual data, including online discussions and real human interactions, where kindness often generates more positive responses. In a sense, the model mimics human social dynamics, recognizing and responding more accurately to requests made with kindness and care.
Additionally, asking the model to “interview you” with all the questions needed to provide a well-done output can further improve the quality of responses. By providing clear context and showing respect for the interaction process, it is more likely that the model will respond more appropriately and effectively.
Possible explanations for the performance decline
In addition to the aforementioned issues of “data saturation” resulting from the use of artificial data that creates a negative cycle, where each new iteration becomes less accurate (it is the equivalent of photocopying a photocopy: over time, the image loses quality), and the bizarre theory of the “winter break,” there are other theories to explain why AI models seem to deteriorate over time. Here are a few I’ve gathered:
- Economic reasons: One of the simplest explanations is related to costs. Initially, when a new model is launched, a large amount of computing resources is allocated to ensure optimal performance. However, over time, these resources may be reduced to contain costs, leading to a drop in performance. This hypothesis is supported by the fact that slower models, such as the O1 version of GPT, take much longer to generate a response, suggesting more economical management of computing resources.
- Predefined prompts and response moderation: Another theory concerns the use of predefined prompts to guide and moderate the model’s responses. These prompts, which are secretly provided before the user’s prompt, may limit the model’s ability to provide complete or accurate answers. For example, to avoid inappropriate responses, models may be programmed to avoid certain topics or response styles, thus reducing overall effectiveness.
- Reinforcement learning and continuous training: Another factor that could affect performance decline is reinforcement learning. During training, AI models are programmed to improve their responses based on the feedback they receive. However, if this process is interrupted or if the training data becomes less representative, the quality of the responses may decrease.
Originally published at Levysoft.