Recent years have seen an unexpectedly rapid rise in AI tools and technologies. It was only a few years ago when OpenAI unveiled their Dall-E 2 in 2022, an image generator that garnered attention from the popular press and sparked the beginning of the rise of Artificial Intelligence tools to the masses. Since the unveiling of Dall-E 2, other AI-assisted image generators such as MidJourney and Stable Diffusion have sparked controversy in the creative industry, which was soon followed by the release of an AI-powered chatbot with OpenAI’s ChatGPT.
However, this rapid development and widespread adoption raise complex ethical concerns. The use of publicly accessible works, including artistic creations, literary texts, and coding repositories by these AI technologies without explicit permission is a central issue. While these advancements herald a new era of efficiency and creativity, they also pose significant questions about intellectual property rights, authorship, and the economic implications for creators whose works are used as training data. The balance between innovation and ethical responsibility becomes increasingly challenging to navigate as AI continues to evolve at an unprecedented pace. We find it to be imperative that a framework is developed and that guidelines ensure responsible and ethical utilization of these powerful tools.
Although AI technologies are developed for a variety of different purposes, they have one fundamental factor in common – they need a lot of data. Taking DALL-E 2 as a prime example, this technology is a multimodal implementation of GPT-3 with 12 billion parameters, trained on text–image pairs from the Internet. Its predecessor, DALL-E, was developed alongside CLIP (Contrastive Language-Image Pre-training), a model trained on 400 million pairs of images with text captions scraped from the Internet. OpenAI’s Dall-E 2 is estimated to have needed images in the hundreds of millions – thus necessitating the need for a substantial amount of images to get the model to be of high quality. In this scenario, the data requirement is largely due to the complexity of attempting to have the model understand the semantic relationship between images and the words used to describe them. The more image-semantic relationship pairs the model is trained on, the better the model is at producing results that are closely aligned with the user’s needs. However, the need for a large amount of data doesn’t only apply to image generators, as a large corpus of texts is needed to train tools such as OpenAI’s ChatGPT and Github Copilot. For example, ChatGPT was estimated to approximately need 570 GB to 45 terabytes of text data.
Datasets provide the foundational knowledge that models need to develop and make accurate predictions. The capacity of a model to learn and generalize to new data significantly increases with the volume of data it has access to. This is particularly crucial for complex problems requiring high accuracy and precision. Larger datasets also improve the model’s robustness and generalizability by containing a diverse range of examples.
This reliance on extensive data collection presents ethical dilemmas related to the massive gathering of various works to train these AI systems. More specifically, is it ethical and justifiable for companies to utilize the works of artists, writers, and programmers to train these AI systems? For AI technologies to be effective and mimic human cognition accurately, they require massive datasets that often include copyrighted works, personal information, and sensitive data. The role that these datasets play in deploying a high-quality AI system is undoubtedly vital. Without these copious amounts of data, the advancement of these AI tools would not have progressed as quickly as it has in its current state, inspiring discussion throughout various industries due to these technologies sparking controversy regarding their use in both personal and professional settings. More specifically, much of the controversy behind technologies like ChatGPT and Midjourney stems from the fact that they utilised large datasets largely consisting of copyrighted work to produce good results. Yet, due to just how widely accessible and useful the tool is for a wide variety of people, this begs the question of whether utilising such technologies allows and even promotes the infringing of many copyrighted works.
Data collection is vital for these Generative AI tools to understand the intent of the user by finding semantically related relationships between text and images, for these AI systems to understand the specifics of a particular user request – such as the user asking the system to write in the style of a particular author, or to create the image in the style of a specific artist – requires artists’ works to be collected during scraping to fine-tune the specific systems to understand what makes each artist distinct fully. Thus, many companies that specialise in developing generative AI systems tend to scrape a variety of copyrighted works – almost always without the consent of the artists or writers.
Generative AI technologies, such as DALL-E and Stable Diffusion, bring new business risks including misinformation, plagiarism, copyright infringements, and harmful content. Trained on massive databases of images and texts, these tools often obscure the data’s source, posing reputational and financial risks. For example, the outrage sparked by an AI-generated piece winning a digital art competition highlights the ethical concerns within the art community, as such recognition can be seen as undermining the skill and effort of human artists. These programs, with Stable Diffusion using over 5 billion publicly available images, raise questions about dataset contents and potential infringement on existing works.
The accessibility of these AI systems to enterprises and individuals is partly due to practices like data scraping. SemiAnalysis estimates OpenAI’s ChatGPT operational costs at around $700k/day, illustrating the significant resources required to maintain such technologies. This balance of innovation and ethical considerations continues to shape the evolution and impact of generative AI in various sectors.
Requiring companies to seek permission from artists to use their artwork in AI model training presents significant challenges. The process of acquiring licenses for artworks involves complex legalities, potentially prolonging the data collection phase and increasing costs for both companies and artists. Artists, in turn, would likely seek compensation for their work, adding to the financial burden. Moreover, even with transparent and ethical practices, uncertainty remains about artists’ willingness to participate, driven by concerns over the potential devaluation of their future works. The impracticality of contacting every artist, writer, and programmer worldwide exacerbates the issue, as it would demand considerable time and resources. While it’s unjust for AI companies to use creative works without compensation or acknowledgement, the economic and logistical constraints of developing and maintaining AI systems make it a challenging endeavour.
However, there are some cases where there are more ethical approaches to the use of public works. For example, in the case of Github’s Copilot, the use of public repositories to train the AI model is justifiable since this AI model was trained completely on all programming languages that appear in public Github repositories. Github’s approach to using public repositories does not apply to all AI companies. A public repository hosted on GitHub is most likely open source and depending on the licence, is subject to a variety of uses. This allows Github to access a library of public works to train their models on. This works due to Github’s Copilot focus which is much narrower than other AI technologies. As ChatGPT is designed to be an all-in-one LLM with vast knowledge about the world, it has to know and understand the works of many visual and literary artists. When asked to write in the style of a particular author, it needs to be able to semantically relate to the specific author mentioned in the prompt. For this, it needs to generalise a representation of that author’s style of writing, which necessitates the need for massive amounts of training to be done on that particular author.
The rapid advancement of AI, especially in generative models, has demonstrated significant benefits and potential. This progress, partly due to the use of publicly available data, has made AI more accessible, yet it raises important concerns. Using creative works without creators’ consent not only poses legal issues of copyright infringement but also questions the moral rights of creators and the ethical utilization of their intellectual property. In the future of AI, prioritizing ethical practices is essential. This includes seeking explicit permission for data use, establishing fair compensation models, and focusing on works marked for free use. Ethical principles like bias and fairness, transparency, human oversight, privacy, performance safety, security, and sustainability should guide AI development. Efforts to address these issues should involve recognizing sources, encouraging collaboration between developers and content creators, and advocating for ethical practices. While there is no definitive answer to what constitutes an ‘ethical’ approach, ongoing dialogue among stakeholders – including developers, users, policymakers, and the public – is key. This collaboration is vital to ensuring AI development aligns with collective values and legal standards.