AI’s potential self-destruction: a threat to the future of data integrity

The contamination of AI data could undermine the reliability of future models. Experts warn that if clean, human-generated data is lost, artificial intelligence could be controlled by a few powerful entities.

November 30, 2022, marks a pivotal moment in the history of artificial intelligence. It is the day OpenAI officially launched ChatGPT, setting off a new era of generative AI. Since that moment, everything has changed. Much like the first atomic bomb test on July 16, 1945, in the New Mexico desert — which had irreversible environmental consequences — the advent of ChatGPT has, according to many experts, permanently “contaminated” the data world.

This analogy is strong, but not arbitrary. After the Trinity nuclear test, radioactive particles spread through the atmosphere, seeping into everything, even industrial materials. From that point on, no metal produced was considered pure, and to create high-sensitivity medical or scientific instruments, manufacturers had to rely on low-radioactive steel — metal produced before 1945.

Now, something similar is happening in the world of artificial intelligence.

The risk of self-destruction for artificial intelligence

Every time a generative AI produces content — whether it’s a text, an image, or a code — it leaves an artificial trace in the digital environment. These traces end up in other datasets, later used to train newer generations of models. However, this process means the models are no longer learning from humans, but from other models. It’s as if an ecosystem started feeding only on its own waste.

This phenomenon has a name: model collapse, or Model Autophagy Disorder (MAD). It’s a technical term describing a very real risk — that AI might stop being reliable, as its models begin to rely on increasingly distorted, inaccurate, or false information.

As early as 2023, John Graham-Cumming, former CTO of Cloudflare, recognized this danger. He created lowbackgroundsteel.ai, a virtual archive that collects datasets generated before the “contamination point” of 2022. One example is the Arctic Code Vault, a frozen copy of public content on GitHub from February 2020.

Graham-Cumming’s idea? We need a reserve of “clean” data, just like the steel of the past, to train future models on untainted foundations.

The threat of running out of clean data

But the issue goes deeper. It’s not just about the reliability of models; it’s also about the fairness of the system. Those who still possess original, human-generated, untainted data may soon hold a significant competitive advantage. Startups and smaller players in the field, on the other hand, would be forced to use “contaminated” datasets, leading to weaker, less accurate, and less sustainable models.

This concern was expressed by a group of scholars from various European universities — including the University of Cambridge, the University of Düsseldorf, and Ludwig Maximilians University in Munich — in their paper “Legal Aspects of Access to Human-Generated Data and Other Essential Inputs for AI Training,” published in December 2024. According to these experts, public access to clean data must be guaranteed; otherwise, the future of artificial intelligence will be controlled by a few dominant players.

Maurice Chiodo, a researcher at Cambridge and co-author of the study, perfectly captured the urgency:

“If we still have real human data today, it’s because there was a moment, like in 1919 with the sinking of the German fleet, that allowed us to preserve pure steel. The same goes for data: everything created before 2022 is still considered safe. But if we lose even that, we can never go back.”

A global policy to protect original data

So, how can we protect human data from contamination by artificial intelligence? One potential solution is to label AI-generated content, but this is far from simple. Labels can be removed, digital watermarks erased, and jurisdictions differ from country to country. As Chiodo pointed out, anyone can upload content to the web, and that data will then be collected and used by other models — without control.

In their study, the authors also propose encouraging federated learning, a system where data is not directly shared but remains protected, allowing for model training nonetheless. This would ensure privacy and security while preventing the creation of informational monopolies.

However, this solution also carries risks. Who controls these data? How are they managed? And what if a government that seems trustworthy today becomes authoritarian tomorrow?

Rupprecht Podszun, a competition law expert and co-author of the study, emphasizes the importance of a decentralized and competitive management of untainted data to avoid political influence and monopolies.

Because the issue at hand is not just technical — it concerns the very future of artificial intelligence, as Chiodo warns:

“If we want AI to remain a useful, fair, and democratic tool, we must act now. Once the entire data environment is contaminated, cleaning it will be virtually impossible.”

Sources: Cambridge University – ArXiv