Jack Gallifant1,*MIT, Shan Chen2,3,4,*Harvard; Mass General Brigham; Boston Children's Hospital, Pedro Moreira1,8MIT; Universitat Pompeu Fabra, Nikolaj Munch1,5MIT; Aarhus University, Mingye Gao1MIT, Jackson Pond2,3Harvard; Mass General Brigham, Hugo Aerts2,3,7Harvard; Mass General Brigham; Maastricht University, Leo Anthony Celi1,2,9MIT; Harvard; Beth Israel Deaconess Medical Center, Thomas Hartvigsen1,6MIT; University of Virginia, Danielle S. Bitterman2,3,4,†Harvard; Mass General Brigham; Boston Children's Hospital
* Co-first authors
This study introduces RABBITS (Robust Assessment of Biomedical Benchmarks Involving drug Term Substitutions for Language Models), a novel dataset and evaluation framework designed to test the robustness of large language models (LLMs) in handling drug name variations in medical contexts. The research reveals a consistent performance drop of 1-10% in LLMs when brand names are substituted for generic drug names in medical benchmarks, with open-source models showing greater fragility compared to API-based models.
Our findings indicate potential contamination of test data in widely used pre-training datasets, which may contribute to this fragility. Notably, larger models, despite being more accurate on original datasets, exhibit greater sensitivity to drug name swaps. This fragility poses significant implications for the use of LLMs in healthcare settings, where consistent performance across different drug name expressions is crucial for patient safety and effective medical communication. The study underscores the importance of robust evaluation frameworks and highlights the need for improved model training strategies to enhance the reliability of LLMs in medical applications.
RABBITS: Revealing Language Models' Fragility to Drug Name Variations
RABBITS (Robust Assessment of Biomedical Benchmarks Involving drug Term Substitutions for Language Models) is an innovative research initiative that scrutinizes large language models (LLMs) for their application in healthcare, focusing particularly on their ability to handle drug name variations. This study unveils the surprising fragility of LLMs when faced with brand and generic drug name substitutions in medical contexts.
Workflow diagram illustrating the RABBITS project's process for analyzing and addressing language model fragility to drug name variations in healthcare settings.
Investigating Model Performance with Drug Name Swaps
We began by examining the performance of various LLMs on medical benchmarks, specifically MedQA and MedMCQA, when generic drug names were swapped with their brand name counterparts. This analysis revealed a consistent performance drop ranging from 1-10% across different models, with open-source models showing greater fragility compared to API-based models.
Visual representation of model performance changes when drug names are swapped, comparing accuracy on original datasets versus generic-to-brand name substitutions.
Dataset Contamination and Its Implications
Further investigation revealed significant contamination of test data in widely used pre-training datasets, specifically in the Dolma dataset. This contamination may contribute to the observed fragility, as models might be memorizing specific test examples rather than understanding the underlying medical concepts.
Dataset | Percentage |
---|---|
MedQA Train | 86.92% |
MedQA Val | 98.10% |
MedQA Test | 99.21% |
MedMCQA Train | 22.41% |
MedMCQA Val/Test | 34.13% |
Table showing the percentage of contamination of MedQA and MedMCQA benchmarks in the Dolma dataset.
Implications for Healthcare Applications
The RABBITS study not only highlights these issues but also explores their implications for healthcare applications. The observed fragility poses significant challenges for the use of LLMs in medical contexts, where consistent performance across different drug name expressions is crucial for patient safety and effective communication.
RABBITS aims to improve the robustness and applicability of LLMs in healthcare by providing a new benchmark for evaluating model performance across drug name variations. For further details, tools, and to engage with our data visualization platforms, visit our project site here.
Our work builds upon insights into how technology can impact outcomes across subgroups:
Notes: A new benchmark to evaluate biases in LLMs for healthcare applications. It highlights how demographic biases from training datasets can skew LLM outputs, misrepresenting disease prevalence across diverse demographic groups.
This article can be cited as follows:
Jack Gallifant, Shan Chen, Pedro Moreira, Nikolaj Munch, Mingye Gao, Jackson Pond, Hugo Aerts, Leo Anthony Celi, Thomas Hartvigsen, Danielle S. Bitterman. "Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks." Available at arXiv preprint arXiv:2406.12066, 2024.
@misc{gallifant2024language, title={Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks}, author={Jack Gallifant and Shan Chen and Pedro Moreira and Nikolaj Munch and Mingye Gao and Jackson Pond and Hugo Aerts and Leo Anthony Celi and Thomas Hartvigsen and Danielle S. Bitterman}, year={2024}, eprint={2406.12066}, archivePrefix={arXiv}, primaryClass={cs.CL} }