Exploring large language models‘ biases in historical knowledge
(This article was originally posted on my personal blog)
Large language models (LLMs) such as ChatGPT are being increasingly used in educational and professional settings. It is important to understand and study the many biases present in such models before integrating them into existing applications and our daily lives.
One of the biases I studied in my previous article was regarding historical events. I probed LLMs to understand what historical knowledge they encoded in the form of major historical events. I found that they encoded a serious Western bias towards understanding major historical events.
On a similar vein, in this article, I probe language models regarding their understanding of important historical figures. I asked two LLMs who the most important historical people in history were. I repeated this process 10 times for 10 different languages. Some names, like Gandhi and Jesus, appeared extremely frequently. Other names, like Marie Curie or Cleopatra, appeared less frequently. Compared to the number of male names generated by the models, there were extremely few female names.
The biggest question I had was: Where were all the women?
Continuing the theme of evaluating historical biases encoded by language models, I probed OpenAI’s GPT-4 and Anthropic’s Claude regarding major historical figures. In this article, I show how both models contain:
- Gender bias: Both models disproportionately predict male historical figures. GPT-4 generated the names of female historical figures 5.4% of the time and Claude did so 1.8% of the time. This pattern held across all 10 languages.
- Geographic bias: Regardless of the language the model was prompted in, there was a bias towards predicting Western historical figures. GPT-4 generated historical figures from Europe 60% of the time and Claude did so 52% of the time.
- Language bias: Certain languages suffered from gender or geographic biases more. For example, when prompted in Russian, both GPT-4 and Claude generated zero women across all of my experiments. Additionally, language quality was lower for some languages. For example, when…