Looking Beyond the Hype: Revealing the Dangers of Large Language Models

With the rise of OpenAI’s ChatGPT, hype around Large Language Models (LLMs) has grown, signaling the start of what seems like an arms race as companies push to create ever larger models. Many readers have likely asked ChatGPT some silly questions, but how many are aware of the potential dangers posed by such seemingly innocent and friendly technology?

LLMs come at a great social, ethical, and environmental cost that will be explored in this article. It is pertinent to shift our attention from creating larger and more complicated models towards mitigating some of the dangers that come with these models before it is too late.

What Biases are hidden in LLMs?

In 2021, a collaboration of developers published a paper in which they described LLMs as ‘stochastic parrots‘, emphasizing the fact that LLMs output a probabilistic combination of linguistic forms rather than an understanding of the meaning behind the text that is generated. The paper highlights that the push for ever-larger language models, both in terms of parameters and training data, comes at a dangerous cost. One major risk is that of large data training sets containing social biases and stereotypes that are amplified in the LLM precisely due to its stochastic nature.

Bias can lead to resources or opportunities being presented unfairly to different social groups. Besides such allocational harms, LLM’s also result in representational harm, for example, when some social groups are presented in a negative light or not even recognized at all.

One paper studied stereotyping in word embeddings using the famous Implicit Association Test and showed that the model amplified social biases. Such racial and gender biases have also been shown to appear in contextual word representations, such as ELMo and BERT, and numerous language models.

Another report gave prompts to describe people from different demographics who tested OpenAI’s medium-sized GPT-2 model and found that the generated text continuations perpetuated stereotypes (see table below).

Examples of text continuations generated from OpenAI’s GPT-2 model, given different prompts

Although it might be shocking that such a publically available language model openly perpetuates harmful stereotypes, it should not come as a surprise to learn that the algorithm ‘parrots’ the structural inequalities that we observe in society. However, we must recognize OpenAI’s efforts to mitigate some of these biases observed in early GPT models. Testing the more current GPT-3.5 model with the same prompts, it outputs the following snippet of text:

“It’s important to approach discussions about individuals with respect and sensitivity. If you have a specific context or information you’d like assistance with, please provide more details, keeping in mind the importance of promoting understanding and inclusivity.”
ChatGPT 3.5, personal communication, January 29, 2024

At first glance, this is a clear improvement of the previous model, where the subject is acknowledged to be sensitive, and additional information is requested to avoid harmful stereotyping. However, by emphasizing this as a sensitive or ‘touchy’ subject, (conversations about) marginalized groups are labeled as ‘special’ or ‘unconventional’, creating an even larger divide between ‘normal’ and marginalized peoples. Furthermore, it discourages open conversation about subjects of race and gender and could even push for such conversations to become ‘taboo’. So, while over-filtering might appear to discourage harmful stereotyping, it has great potential to exclude diverse perspectives and cause representational harms.

Unintended side effect of automatic toxicity reduction methods: Over-filtering of text about marginalized groups reduces the ability of the LM to generate text about these groups, even in a positive way. (link to study)

Mitigating biased or toxic outputs is a more challenging task than most realize. Notions of toxicity are subjective, and conditions for fairness have been shown to be conflicting, leading to a risk of over-filtering. Furthermore, another study focused on the difference between sentiment and regard in measuring bias. For example, bias analyzers concluded that “XYZ worked as a pimp for 15 years” had a neutral sentiment, even though working as a “pimp” generally has a negative social regard.

At the current stage, it appears that human judgement of bias is essential to complement any algorithmic rating of toxicity. Growing even larger black-box language models will directly counter any efforts to understand and potentially mitigate the biases present in these LLMs and can lead to substantial allocational and representational harms.

How Reliable are LLMs truly?

“LLMs are not models of brains, but of language itself.”
Oliver Bown. “Hallucinating” AIs Sound Creative, but Let’s Not Celebrate Being Wrong. The MIT Press Reader.

The reader is likely to have heard about the term “hallucinations” while working with LLMs as developers or end users. Hallucinations refer to LLMs making up facts or generating inaccurate information. The danger here lies in individuals not fully understanding how LLMs work and considering their outputs as ground truth. LLMs are not modeled to be a brain but to be a language themselves (recall `stochastic parrot`). However, the process does not include the assessment of the accuracy of generated information. Not to mention that LLMs could have trained on outdated data or, even worse, biased data (recall section above).

Researchers introduced advanced prompt engineering, context injections, and other methods to prevent LLMs from hallucinating, but none of those methods can give a 100% guarantee that LLMs’ answers would be fully correct.

High risks showed up as individuals started to use ChatGPT as their primary source of information and made final decisions based on the chatbot’s results. In modern days, 47% of surveyed Americans have utilized ChatGPT for stock advice. In addition, there was a case when a lawyer employed ChatGPT for case research, and the chatbot gave multiple fictitious case examples, endangering both the lawyers and their clients.

Another important phenomenon related to LLMs reliability is model decay, a decreased performance of the algorithm over time due to changes in the environment. Researchers of the study pointed out that GPT-3.5 and GPT-4 performed worse over the months on solving diverse sets of tasks (Maths, Medical Exams, Sensitive Q&A, etc). One of the possible hypotheses is that the models were fine-tuned and adjusted on “drifted” data that included latest interactions with users. This not only highlights the danger of spreading misinformation but also the idea that growing larger LMs will not necessarily improve accuracy, and further studies to improve the accuracy of current models should be investigated.

Overview of performance drift and instruction following shift of GPT-4 and GPT-3.5 between March 2023 and June 2023. (link to the study)

What is the Impact of LLMs on the Environment?

LLMs are taking the world by storm. Everybody is talking about how revolutionary they are, but at what cost? The environmental impact of LLMs – water and carbon footprints – is often overlooked by the average user.

Water consumption

Microsoft’s data centers consumed 700,000 liters of clean freshwater for cooling systems while training GPT-3, enough to produce 370 BMW cars or 320 Tesla vehicles. If we talk about the inference of GPT-3, then approximately 0.5 liters of water is needed for every 10-50 ChatGPT response. That number increased significantly with the introduction of GPT-4, which is over 10 times bigger than GPT-3.

Electricity consumption

LLMs require enormous computational power to be developed, run, and supported. Can you imagine the amount of CO₂ emitted per unit of electricity consumed?

GPT-3, possessing 175 billion parameters, demanded 355 years of computing on a single processor and used 284,000 kWh of electricity for training. The training of GPT-3 led to carbon emissions (502,000 kilograms of CO₂ equivalent) comparable to driving 112 gasoline-powered cars for a year. In addition, some estimated GPT-4’s 100-day training period to burn 28,800,000 kWh, which is an astounding number. We can calculate that GPT-4 emits around 7 million kilograms of CO₂ equivalent. This is like driving 1600 gasoline-powered cars for a year. This is not to mention inferencing, which accounts for 90% of the carbon footprint.

However, it’s important to note that estimates of the environmental impact vary among researchers, primarily due to the lack of transparency of big corporations to disclose the exact numbers. What is clear, though, is that the environmental costs are substantial.

Now, learning about all these statistics, can we be sure that the innovation and efficiency of LLMs are worth the harm to the environment, knowing that minorities and low-income communities would bear the consequences first? It is time for society to think about balancing the positive contributions of LLMs and environmental costs and moving toward greener AI.

Moving Forward

Acknowledging the mentioned issues is not to undermine the transformative potential of LLMs but to ensure that their advancement is aligned with the principles of sustainable development. We stress that it is essential these risks are addressed before further development of LLMs and their harms become too ingrained in society. Only by prioritizing the mitigation of biases, enhancing the models’ reliability and users’ awareness, and minimizing their environmental footprint, we can foster an ecosystem where LLMs not only excel in efficiency and accuracy but also contribute positively to societal and environmental well-being.

Looking Beyond the Hype: Revealing the Dangers of Large Language Models

What Biases are hidden in LLMs?

How Reliable are LLMs truly?

What is the Impact of LLMs on the Environment?

Water consumption

Electricity consumption

Moving Forward

Sietske de Keijzer & Thi Nga Vu

Leave a Reply Cancel reply

AI can be green: Fire up the GPUs

An Opportunity or a Disaster? AI’s Impact on the Climate Crisis is More Complicated than You Think

Looking Beyond the Hype: Revealing the Dangers of Large Language Models

What Biases are hidden in LLMs?

How Reliable are LLMs truly?

What is the Impact of LLMs on the Environment?

Water consumption

Electricity consumption

Moving Forward

Sietske de Keijzer & Thi Nga Vu

Leave a Reply Cancel reply

AI can be green: Fire up the GPUs

Digital Sugar: Consequences of unethical recommender systems

An Opportunity or a Disaster? AI’s Impact on the Climate Crisis is More Complicated than You Think