The effect of using a large language model to respond to patient messages
Shan Chen1,2,3Mass General Brigham; Harvard Medical School; Brigham and Women's Hospital and Dana-Farber Cancer Institute, Marco Guevara1,2,3Mass General Brigham; Harvard Medical School; Brigham and Women's Hospital and Dana-Farber Cancer Institute, Shalini Moningi2Brigham and Women's Hospital and Dana-Farber Cancer Institute, Frank Hoebers1,2,4Mass General Brigham; Harvard Medical School; Brigham and Women's Hospital and Dana-Farber Cancer Institute; GROW School for Oncology and Reproduction, Maastricht University, Hesham Elhalawani2Brigham and Women's Hospital and Dana-Farber Cancer Institute, Benjamin H. Kann1,2,3Mass General Brigham; Harvard Medical School; Brigham and Women's Hospital and Dana-Farber Cancer Institute, Fallon E. Chipidza2Brigham and Women's Hospital and Dana-Farber Cancer Institute, Jonathan Leeman2Brigham and Women's Hospital and Dana-Farber Cancer Institute, Hugo J. W. L. Aerts1,2,3,4Mass General Brigham; Harvard Medical School; Brigham and Women's Hospital and Dana-Farber Cancer Institute; GROW School for Oncology and Reproduction, Maastricht University, Timothy Miller3Computational Health Informatics Program, Boston Children's Hospital, Harvard Medical School, Guergana K. Savova3Computational Health Informatics Program, Boston Children's Hospital, Harvard Medical School, Jack Gallifant4,5Laboratory for Computational Physiology, Massachusetts Institute of Technology, Leo A. Celi4,5,6Laboratory for Computational Physiology, Massachusetts Institute of Technology; Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center; Department of Biostatistics, Harvard T H Chan School of Public Health, Raymond H. Mak1,2Mass General Brigham; Harvard Medical School; Brigham and Women's Hospital and Dana-Farber Cancer Institute, Maryam Lustberg7Department of Medical Oncology, Yale School of Medicine, Majid Afshar8Department of Medicine, University of Wisconsin School of Medicine and Public Health, Danielle S. Bitterman1,2,3*Mass General Brigham; Harvard Medical School; Brigham and Women's Hospital and Dana-Farber Cancer Institute; Computational Health Informatics Program, Boston Children's Hospital, Harvard Medical School
This study found pre-clinical evidence of anchoring based on LLM recommendations. Raising the question: Is using an LLM to assist with documentation simple decision-support, or will clinicians tend to take on the reasoning of the LLMs?
Despite being a simulation study, these early findings provide a safety signal indicated a need to thoroughly evaluate LLMs in their intended clinical contexts, reflecting the precise task and level of human oversight. Moving forward, more transparency from EHR vendors and institutions about prompting methods are urgently needed for evaluations. LLM assistance is a promising avenue to reduce clinician workload but has implications that could have downstream effect on patient outcomes. This situation necessitates treating LLMs with the same rigor in evaluation as any other software as a medical device.
The Effect of Using a Large Language Model to Respond to Patient Messages
Doctors are facing an increasing amount of paperwork and administrative tasks, especially with electronic health records (EHR) systems. This extra work takes time away from patient care and contributes to burnout. To help, some healthcare systems are using large language models (LLMs) like OpenAI's ChatGPT to handle tasks like responding to patient messages. Over the past 5–10 years, the number of messages sent through patient portals has increased significantly, and LLMs are being used to help manage this load. One of the first uses of LLMs in EHRs is to draft responses to patient questions.
Evaluating LLM-Assisted Responses
While previous studies have looked at how well LLMs answer medical questions, it’s unclear if they can really help doctors save time and reduce mental strain. To find out, a study was conducted at Brigham and Women’s Hospital in Boston. The study aimed to see how using LLMs to help draft patient messages might affect efficiency, clinical recommendations, and safety.
In this study, six experienced radiation oncologists first wrote responses to patient messages by hand. Then, they edited responses generated by GPT-4, a powerful LLM, to make sure they were appropriate to send to patients. This way, researchers could compare the effectiveness of manually written responses with those assisted by an LLM.
Study Findings and Implications
The study found that manually written responses were much shorter (34 words on average) compared to those drafted by the LLM (169 words) and those that were LLM-assisted (160 words). Doctors felt that using the LLM drafts made their work easier in about 77% of cases. However, there were some risks: about 7.1% of the LLM drafts could have been harmful if used without edits, and 0.6% could have been life-threatening. These harmful responses usually resulted from not correctly understanding the urgency of the situation.
Total number of responses that included each content category for manual, LLM draft, and LLM-assisted responses. (A) The overall distribution of content categories present in each response type. Pairwise comparisons of the overall distributions according to response type were done using Mann–Whitney U tests. (B) Visualization of the total count of each category for the three response types.
Exploring the Benefits and Risks
The study suggests that using LLMs could be very helpful for doctors by reducing their workload and making their responses more consistent and informative. However, it also shows that we need to be careful. LLMs can sometimes miss the urgency of a medical situation, which can be dangerous. It’s important to keep evaluating these tools to ensure they really help without introducing new risks.
As healthcare continues to adopt advanced technologies like LLMs, it’s crucial to balance their benefits with patient safety. This study highlights the importance of careful implementation and ongoing evaluation of LLMs in clinical settings. They have great potential to support doctors, but we must ensure they enhance, rather than hinder, patient care.
Our work builds upon insights into how technology can impact outcomes across subgroups:
Notes: Cross-Care is a new benchmark that evaluates the likelihood of language models generating outputs that are grounded to real world prevalence of diseases across subgroups. We validate this method on different model architectures, sizes and alignment strategies.
This article can be cited as follows:
Chen, S., Guevara, M., Moningi, S., Hoebers, F., Elhalawani, H., Kann, B. H., Chipidza, F. E., Leeman, J., Aerts, H. J. W. L., Miller, T., Savova, G. K., Gallifant, J., Celi, L. A., Mak, R. H., Lustberg, M., Afshar, M., & Bitterman, D. S. (2024). The effect of using a large language model to respond to patient messages. The Lancet Digital Health, 6(6), e379-e381. https://doi.org/10.1016/S2589-7500(24)00060-8
@article{chen2024effect, title={The effect of using a large language model to respond to patient messages}, author={Shan Chen and Marco Guevara and Shalini Moningi and Frank Hoebers and Hesham Elhalawani and Benjamin H. Kann and Fallon E. Chipidza and Jonathan Leeman and Hugo J. W. L. Aerts and Timothy Miller and Guergana K. Savova and Jack Gallifant and Leo A. Celi and Raymond H. Mak and Maryam Lustberg and Majid Afshar and Danielle S. Bitterman}, journal={The Lancet Digital Health}, volume={6}, number={6}, pages={e379--e381}, year={2024}, publisher={Elsevier}, doi={10.1016/S2589-7500(24)00060-8} }