One of the main controversies around Artificial Intelligence This is an alleged violation of copyright and open source licenses by companies because they take a lot of material from all kinds of sites to train their models. Given how controversial the issue is becoming, Soccer goal (formerly parent Facebook) eventually confirmed that it used the data set (data file) Books3, which operates in an allegedly pirated manner to train its AI models.
Going deeper into the data of the controversy, Meta recognized it uses Books3 to train its Llama 1 and Llama 2 large language models (LLMs).. Books3 consists of a plain text collection of over 195,000 books which occupies a total of approximately 37 gigabytes and was created by a researcher named Shawn Presser to provide material that served as a data source for improving machine learning algorithms.
More specifically, Meta trains its LLMs on copyrighted material for which it has neither paid for nor sought permission. It is reminiscent of the case of The New York Times lawsuit against Microsoft and OpenAI, as explained by a well-known American newspaper: “Through Microsoft’s Bing Chat (recently renamed Copilot) and OpenAI’s ChatGPT, the defendants seek to take advantage of The New York Times’ enormous investment in journalism by that he used it to make substitute products without permission or payment. Using non-paid Times content to create products that replace the Times and steal its audience.”
Meta recognized that he was training his LLM with allegedly pirated material This is not an attack on sincerity, but rather comes from a demand that a group of authors filed against the giant behind Facebook, WhatsApp and Instagram. The company reached out to some of these authors, including Sarah Silverman and Richard Kadrey, to acknowledge the facts.

Just because content is available to the public doesn’t mean it’s in the public domain or that anyone can use it under any conditions, or even that it’s legal given how widespread piracy is on the internet, although queues lead to a more complicated debate than it seems. However, with the topic of artificial intelligence, we are not talking about spending an individual’s free time in the privacy of his home, but about companies that directly or indirectly make money by downloading copyrighted material without permission or violating open source licenses.
Did we say open source before? Yes, because Microsoft, GitHub and OpenAI were sued in Fall 2022 for license and copyright infringement with GitHub Copilot. According to plaintiff Matthew Butterick, the three companies violated a total of eleven open source licenses, including the MIT, GPL, and Apache 2, which require naming and copyright.
We’ll see how these lawsuits play out and the debate surrounding the use of copyrighted material to train AI-related models. Leaving aside the regulations that the European Union enforces, it is possible that much of the world will take the decisions of the United States Supreme Court as the basis for their actions regarding artificial intelligence.