Seeds of Stereotypes: A Large-Scale Textual Analysis of Race and Gender Associations with Diseases in Online Sources
Lasse Hyldig Hansen1,2Cognitive Science, Aarhus University, Nikolaj Andersen1,2Cognitive Science, Aarhus University, Jack Gallifant2,3Laboratory for Computational Physiology, MIT; Department of Critical Care, Guy’s & St Thomas’ NHS Trust, Liam G. McCoy4Division of Neurology, University of Alberta, James K Stone5University of Manitoba Max Rady College of Medicine, Nura Izath6Faculty of Computing, Mbarara University of Science and Technology, Marcela Aguirre-Jerez7Digital Health Department, Fundacion Arturo Lopez Perez, Danielle S Bitterman8,9,10Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School; Department of Radiation Oncology, Brigham and Women’s Hospital/Dana-Farber Cancer Institute; Computational Health Informatics Program, Boston Children’s Hospital, Harvard Medical School, Judy Gichoya11Department of Radiology, Emory University School of Medicine, Leo Anthony Celi2,12,13Laboratory for Computational Physiology, MIT; Division of Pulmonary, Critical Care, and Sleep Medicine, Beth Israel Deaconess Medical Center; Department of Biostatistics, Harvard T.H. Chan School of Public Health
This study explores how Large Language Models (LLMs) used in healthcare can show biases related to race and gender. By analyzing a vast amount of text from sources like Arxiv, Wikipedia, and Common Crawl, we quantify how diseases are discussed alongside race and gender markers. Our goal was to identify potential biases that LLMs might learn from these texts.
The results revealed that gender terms are often linked to disease concepts, while racial terms are less frequently associated. We found significant disparities, with Black race mentions being overrepresented compared to population proportions. These findings emphasize the importance of examining and addressing biases in LLM training data, especially in healthcare, to develop fairer and more accurate models.
Seeds of Stereotypes: A Large-Scale Textual Analysis of Race and Gender Associations with Diseases in Online Sources
The "Seeds of Stereotypes" study investigates how Large Language Models (LLMs) used in healthcare might perpetuate biases related to race and gender. By analyzing a vast amount of text from diverse sources such as Arxiv, Wikipedia, and Common Crawl, researchers examined the contexts in which diseases are discussed alongside racial and gender markers. This exploration is crucial as it highlights potential biases that LLMs could learn from these texts, which may impact their applications in sensitive domains like healthcare.
Workflow diagram illustrating the process for analyzing race and gender co-occurrences with disease terms within online texts.
Analyzing Disease Associations
The study found that gender terms are frequently associated with disease concepts, while racial terms appear less often. Notably, there were significant disparities, with Black race mentions being overrepresented compared to population proportions. These results underscore the importance of critically examining and addressing biases in LLM training data to develop fairer and more accurate models.
Proportional Disease Mentions with Demographic References within a 100-word Contextual Window. Panel A shows the gender-associated mentions of various diseases, with Panel B detailing the mentions in connection with different races. In both panels, yellow bars indicate the proportion of disease mentions occurring without any specific demographic context.
Model Predictions vs. Real-World Data
Further analysis compared the disease mentions in the training data with real-world prevalence and GPT-4 outputs. The results revealed a mismatch between model predictions and real-world data, suggesting a lack of real-world grounding in these models. For example, Black race mentions are significantly overrepresented in the training data compared to actual prevalence rates, indicating potential bias in how these models learn associations.
Comparison of Disease Mentions by Race Across GPT-4 Estimates, Real World Prevalence, and Training Data. This figure contrasts the proportional estimates of disease mentions with demographic categorizations in GPT-4, actual prevalence rates, and occurrences in training data, confined to a 100-word context window. Comparison is limited to population health data of four racial categories—White, Black, Asian, and Hispanic. Side-by-side bar graphs facilitate direct visual comparison, illustrating the congruence or disparity between the estimated focus on certain diseases in text relative to their real-world demographic prevalence.
Exploring Solutions and Strategies
The project not only highlights these issues but also explores strategies to mitigate these biases. This includes examining different alignment strategies and their effectiveness in improving model accuracy and fairness across diverse demographic groups. These efforts are crucial to ensuring that LLMs provide equitable and unbiased information, fostering better healthcare outcomes.
The "Seeds of Stereotypes" study is a step towards understanding and addressing the biases inherent in LLMs, aiming to bridge the gap between model perceptions and real-world data. For more details and to explore our findings further, visit our project site.
Our work builds upon insights into how technology can impact outcomes across subgroups:
Notes: Cross-Care is a new benchmark that evaluates the likelihood of language models generating outputs that are grounded to real world prevalence of diseases across subgroups. We validate this method on different model architectures, sizes and alignment strategies.
This article can be cited as follows:
Hyldig Hansen L, Andersen N, Gallifant J, McCoy LG, Stone JK, Izath N, Aguirre-Jerez M, Bitterman DS, Gichoya J, Celi LA. Seeds of Stereotypes: A Large-Scale Textual Analysis of Race and Gender Associations with Diseases in Online Sources. arXiv e-prints. 2024 May:arXiv-2405.
@article{hyldig2024seeds, title={Seeds of Stereotypes: A Large-Scale Textual Analysis of Race and Gender Associations with Diseases in Online Sources}, author={Hyldig Hansen, Lasse and Andersen, Nikolaj and Gallifant, Jack and McCoy, Liam G and Stone, James K and Izath, Nura and Aguirre-Jerez, Marcela and Bitterman, Danielle S and Gichoya, Judy and Celi, Leo Anthony}, journal={arXiv e-prints}, pages={arXiv--2405}, year={2024} }