The Big Data Robbery: How Tech Companies Steal Copyrighted Data

A growing sentiment among many is that it’s high time that tech companies foot the bill for the free training data feasts that have fueled the growth and strength of their generative Artificial Intelligence systems. These systems rely heavily on large amounts of training data to learn from. This data includes text, images, and other forms of input, which are all used to train these systems to be able to generate human-like responses or to create other content autonomously.

The current situation has led to twelve large legal actions in recent times, with content creators filing lawsuits against companies that utilize their copyrighted data. One of the recent notable lawsuits is The New York Times, which is suing OpenAI for the use of their data following unsuccessful negotiations. OpenAI, which was once a non-profit entity, now charges users for its generative AI services, making it essentially a commercial company. It is reasonable that creators, such as The New York Times, should receive compensation for their contributions to the training data.

Another lawsuit against OpenAI was filed by two authors from the US. They claimed that OpenAI used almost 300,000 books from “shadow libraries”; these are illegal online platforms that offer copyrighted materials without proper authorization or payment.

The company Stability AI has been involved in several lawsuits, including one against Getty Images, in which they are being sued for using millions of photos without permission or compensation. Together with all the other cases, they highlight the fact that their content shouldn’t be used for free anymore.

The sued companies are relying on a legal concept called “fair use”. The law differentiates between bluntly copying someone else’s work or using it as inspiration for the creation of new work. If something is fair use, it is not seen as an infringement. This allows for the limited use of copyrighted material without the actual permission of the creator. OpenAI has used this argument in their case against the Times, suggesting that “the use of copyrighted materials by technology innovators in transformative ways is entirely consistent with copyright law”, as the tech company has filed the U.S. Copyright Office.

Four different factors are used to categorize a case as fair use. One of them involves the purpose and character of the use. This includes that the use should be informative, meaning it should add “new expression, meaning, or message” to the original work. However, the use of generative AI is not informative as it reproduces significant portions of other creators’ work.

Another factor is the effect that the use has on the market for the copyrighted work and its value. If the use negatively impacts the market and value, this weighs against fair use. It also makes a distinction between whether it has a commercial nature or whether it is for nonprofit purposes. Tech companies are trying to capitalize on the significant investments made by entities in journalism without seeking permission or making any payments. Take the New York Times as an example. In journalism nowadays, much of the revenue is earned by subscriptions and advertisements. If fewer people find and need the Times since its content can be found through Large Language Models, such as ChatGPT, they lose out on these earnings. This obviously impacts the market for the Times negatively, thus it is not fair use.

Another point is that the loss of these earnings will cause the content to suffer and possibly affecting creativity overall. And in turn, this will lead to less training data for generative AI models. It is therefore not only beneficial to the creators of the data; tech companies have a long-term interest in compensating industries for the use of their content as well.

The second large argument advocated by tech companies is that offering compensation to creators would negatively affect the development and improvement of existing large language models, as well as the formation and application of more specialized machine learning models. They argue that, should the plaintiffs succeed, the only generative AI systems deemed legal in the United States (where almost all lawsuits take place) would be those trained on public domain works or under proper licenses. This outcome would have implications for all entities that make use of generative AI, integrate it into their products, or utilize it for scientific research.

While it is true that compensation could impact the development and enhancement of generative AI, it’s not an insurmountable obstacle. Policies could exclude scientific research from compensation. The same goes for hobby projects or even small companies that might not be able to develop due to the compensating policies.

A policy could be made in the form of revenue-sharing or royalty payments. This would also serve as a compelling incentive for content producers to contribute their works as training data, as they have the potential to directly benefit from the financial success of AI systems utilizing their creative output. This could consist of a proportional agreement wherein the content creator receives a portion of the revenue corresponding to their contribution to the training data. The details of these agreements are not relevant; it mainly goes to show that there are viable options for compensation for both parties.

Certain media enterprises have already reached agreements regarding the utilization of their content. Recently, OpenAI reached a deal to compensate the German media conglomerate Axel Springer, publisher of Business Insider and Politico, for including excerpts from articles in ChatGPT responses. The tech company has also entered into an agreement with the Associated Press, gaining access to the news service’s archival content. This shows that it is indeed possible to reach these type of agreements, which are beneficial to both parties.

The mounting legal actions against tech companies emphasize the importance of the debate surrounding the use of copyrighted materials. Restricting laws might force tech companies to add a bit more effort into the training of their generative AI models, however, this doesn’t add up to the harm that is currently done to creators. The recent developments between media enterprises and tech companies serve as promising examples that collaboration and fairness are not impossible. It is becoming increasingly evident that fair compensation for the creators is not only justifiable but also necessary for maintaining a balance between all stakeholders and the evolution of generative AI.

The Big Data Robbery: How Tech Companies Steal Copyrighted Data

Tim Robben & Judith Mashudi

Leave a Reply Cancel reply

Don’t panic: AGI may steal your coffee mug, but it’ll also make sure you have time for that coffee break.

Digital DNA – Why AI Creations Must Bear Their Unique Mark

AI is here to drive the work away, or is it?