Machines trained with artificial data lead to AI collapse: “They lose the perception of reality”

If an artificial intelligence (AI) model is asked to generate random images of dogs, the machine will recreate images of a golden retriever as the most popular dog breed, but also some Dalmatians or French bulldogs, although in smaller numbers because they are rarer breeds. But if other AI models were trained with the data produced by that machine, with the golden retriever overrepresented, they will gradually forget about the less common breeds and only show that breed. Finally, they will only return brown spots that resemble those dogs. Research shows that after training an AI model over and over again with content generated by the same machine, the model collapses, thus ceasing to function, giving bad answers and providing incorrect information. “They begin to produce examples that would never have been created by the original model, that is, they begin to misinterpret reality based on errors introduced by their predecessors,” explains the study, which warns of how machines that are trained with synthetic information “lose their perception of reality.”

“It starts to lose information because it is not clear whether the data collected is sufficient to cover all possible cases. The models are biased and introduce their own errors, and future models may misperceive reality, as they will be trained with biased data from other models,” explains Ilia Shumailov, co-author of the study published today in the journal Naturea spearhead of the best science, and a researcher at the University of Oxford, who currently works for Google DeepMind. The data is “poisoned,” according to the study.

The authors of the study present mathematical models that illustrate the idea of collapse: they show that an AI can ignore some data in its training (for example, less common lines of text) and only train on a part of it. For example, a test was run with a text about medieval architecture as the original input, and in the ninth training it ended up giving a list of hares. “Models learn from each other. The more they learn, the more their performance degrades and they start generating repetitive text that is independent of the input request,” Shumailov adds.

Nowadays, it is common practice for models to be trained with synthetic data, which is data that has not been created by humans but rather imitates real-world data. This is according to the latest report from OpenAI’s ChatGPT-4. In principle, it would be almost impossible to distinguish whether the data has been generated by machines or humans, but if measures are not taken to control the collapse, the consequences are “the degradation of content quality, data contamination, and the perpetuation of biases,” describes Luis Herrera, solutions architect at Databricks Spain.

Why do the technology companies behind language models allow these practices? “AIs are trained with huge amounts of data present on the internet, produced by people who have legal copyrights to their material. To avoid lawsuits or to save costs, technology companies use data generated by their own AIs to continue training their machines,” explains Víctor Etxebarria, professor at the University of the Basque Country, in statements to the specialized portal SMC Spain. However, he adds: “This increasingly widespread procedure means that AIs do not serve any truly reliable function. It transforms AIs into tools that are not only useless to help us solve our problems, but can also be harmful if we base our decisions on incorrect information.”

The content created can be used to train other models or even to train themselves. The degradation loop can even start unintentionally, when machines are trained with content from the Internet, but which has been dumped in turn by other machines. Lorena Jaume-Palasí, an expert in algorithmic ethics and advisor to the European Parliament, warns about the danger of the origin of synthetic data: “The Google search engine is one of the sites where quality has decreased. There is a great variety in the origin of this type of data and the quality can never be good. There are trillions of data that are humanly impossible to correct all of them.” And she emphasizes the “ecological collapse” that these models cause: “The data centers are taking all the water. There will come a time when we will have to decide who we give water to and who we don’t.”

Training an artificial intelligence model with images generated by its own results, according to Nature News&views by author Emily Wenger, professor of electrical engineering and computer science at Duke University in North Carolina.Nature

Pablo Haya Coll, a researcher at the Autonomous University of Madrid, highlights a limitation of these systems: “This technique can lead to corrupting the LLM (a large language model, like ChatGPT). It is a warning about the quality of the data used in the construction of these LLMs. As these LLMs are adopted more widely, more synthetic data ends up on the Internet, which could hypothetically affect the training of future versions.”

The study’s findings suggest a scenario where only AI-generated data is used. In a real-world context, there will likely always be some human-generated data – at least as much as is available now. But it is not yet clear how that data can be differentiated. Shumailov, the study’s lead author, suggests doing so with “list maintenance and watermarking.”

For this researcher and his colleagues, training a model with synthetically generated data is possible, but filtering must be taken very seriously. Toju Duke, former director of responsible AI at Google, explained to EL PAÍS in October last year that models can be trained with data generated by AI, as long as regulation comes into play: “We have to be able to check the facts and the sources. We have to be able to review these things before releasing them. We can’t just let them out, that’s crazy.”

You can follow THE COUNTRY Technology in Facebook and X or sign up here to receive our newsletter week c

US Open 2024: Sabalenka wins US Open after beating Pegula | Tennis | Sports

Who is the member of Los Bro from Big Brother who will be Susana Giménez’s new “Susano”?

US Open 2024: Rafa Jódar, US Open junior champion and another talent to watch | Tennis | Sports

The video of the terrible argument between Elba Marcovecchio and Jorge Lanata’s daughters came to light

Machines trained with artificial data lead to AI collapse: “They lose the perception of reality” | Technology

Happy Birthday Wishes, Quotes, messages, Facebook WhatsApp Instagram status, images and pics (Updated)

150+ Birthday Quotes, Wishes and Text Messages for Friends and Family (Updated)

Merry Christmas Wishes, messages, Facebook WhatsApp Instagram status, images and pics | theusaprint.com

There is no end to Spain: Nations League champion half a year after winning the World Cup | Soccer | Sports

Vicky López: from her signing on the beach of Benidorm to making her senior debut at 17 years old | Soccer | ...

Topics

US Open 2024: Sabalenka wins US Open after beating Pegula | Tennis | Sports

Who is the member of Los Bro from Big Brother who will be Susana Giménez’s new “Susano”?

US Open 2024: Rafa Jódar, US Open junior champion and another talent to watch | Tennis | Sports

The video of the terrible argument between Elba Marcovecchio and Jorge Lanata’s daughters came to light

IC 814 Series: IC 814.. Now the name is very famous.. and also controversial..!

Pilla Zamindar Movie: Remember Nani ‘Pilla Zamindar’ movie heroine..? Her husband is also an actor.. Who is..

De la Fuente: “We are victims of the calendar; not responsible” | Football | Sports

Jorge Martin and Pecco Bagnaia agree with Marc Marquez | Motorcycling | Sports

Related Articles

The “torture” of Alexei Soldatov, the father of the Russian Internet, terminally ill and imprisoned without a bed | Technology

Noelia Ferruz, the chemist who will create an artificial intelligence with a “supernatural” power | Technology

“I sent you a nude, now it’s your turn.” X-ray of extortion and revenge with sexual images | Technology

Brazilians adjust to life without X as they migrate en masse to Bluesky: “I just realized I was hooked on Twitter” | Technology

Five apps to overcome post-vacation syndrome | Your Technology | El País