1
|
Chew BH, Lai PSM, Sivaratnam DA, Basri NI, Appannah G, Mohd Yusof BN, Thambiah SC, Nor Hanipah Z, Wong PF, Chang LC. Efficient and Effective Diabetes Care in the Era of Digitalization and Hypercompetitive Research Culture: A Focused Review in the Western Pacific Region with Malaysia as a Case Study. Health Syst Reform 2025; 11:2417788. [PMID: 39761168 DOI: 10.1080/23288604.2024.2417788] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 08/28/2024] [Accepted: 10/14/2024] [Indexed: 01/11/2025] Open
Abstract
There are approximately 220 million (about 12% regional prevalence) adults living with diabetes mellitus (DM) with its related complications, and morbidity knowingly or unconsciously in the Western Pacific Region (WP). The estimated healthcare cost in the WP and Malaysia was 240 billion USD and 1.0 billion USD in 2021 and 2017, respectively, with unmeasurable suffering and loss of health quality and economic productivity. This urgently calls for nothing less than concerted and preventive efforts from all stakeholders to invest in transforming healthcare professionals and reforming the healthcare system that prioritizes primary medical care setting, empowering allied health professionals, improvising health organization for the healthcare providers, improving health facilities and non-medical support for the people with DM. This article alludes to challenges in optimal diabetes care and proposes evidence-based initiatives over a 5-year period in a detailed roadmap to bring about dynamic and efficient healthcare services that are effective in managing people with DM using Malaysia as a case study for reference of other countries with similar backgrounds and issues. This includes a scanning on the landscape of clinical research in DM, dimensions and spectrum of research misconducts, possible common biases along the whole research process, key preventive strategies, implementation and limitations toward high-quality research. Lastly, digital medicine and how artificial intelligence could contribute to diabetes care and open science practices in research are also discussed.
Collapse
Affiliation(s)
- Boon-How Chew
- Department of Family Medicine, Faculty of Medicine and Health Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Family Medicine Specialist Clinic, Hospital Sultan Abdul Aziz Shah (HSAAS Teaching Hospital), Persiaran MARDI - UPM, Serdang, Selangor, Malaysia
| | - Pauline Siew Mei Lai
- Department of Primary Care Medicine, Faculty of Medicine, Universiti Malaya, School of Medical and Life Sciences, Sunway University, Kuala Lumpur, Selangor, Malaysia
| | - Dhashani A/P Sivaratnam
- Department of Opthalmology, Faculty of .Medicine and Health Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Nurul Iftida Basri
- Department of Obstetrics and Gynaecology, Faculty of Medicine and Health Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Geeta Appannah
- Department of Nutrition, Faculty of Medicine and Health Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Barakatun Nisak Mohd Yusof
- Department of Dietetics, Faculty of Medicine and Health Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Subashini C Thambiah
- Department of Pathology, Faculty of Medicine and Health Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Zubaidah Nor Hanipah
- Department of Surgery, Faculty of Medicine and Health Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | | | - Li-Cheng Chang
- Kuang Health Clinic, Pekan Kuang, Gombak, Selangor, Malaysia
| |
Collapse
|
2
|
Pushpanathan K, Zou M, Srinivasan S, Wong WM, Mangunkusumo EA, Thomas GN, Lai Y, Sun CH, Lam JSH, Tan MCJ, Lin HAH, Ma W, Koh VTC, Chen DZ, Tham YC. Can OpenAI's New o1 Model Outperform Its Predecessors in Common Eye Care Queries? OPHTHALMOLOGY SCIENCE 2025; 5:100745. [PMID: 40291392 PMCID: PMC12022690 DOI: 10.1016/j.xops.2025.100745] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/30/2024] [Revised: 02/01/2025] [Accepted: 02/14/2025] [Indexed: 04/30/2025]
Abstract
Objective The newly launched OpenAI o1 is said to offer improved reasoning, potentially providing higher quality responses to eye care queries. However, its performance remains unassessed. We evaluated the performance of o1, ChatGPT-4o, and ChatGPT-4 in addressing ophthalmic-related queries, focusing on correctness, completeness, and readability. Design Cross-sectional study. Subjects Sixteen queries, previously identified as suboptimally responded to by ChatGPT-4 from prior studies, were used, covering 3 subtopics: myopia (6 questions), ocular symptoms (4 questions), and retinal conditions (6 questions). Methods For each subtopic, 3 attending-level ophthalmologists, masked to the model sources, evaluated the responses based on correctness, completeness, and readability (on a 5-point scale for each metric). Main Outcome Measures Mean summed scores of each model for correctness, completeness, and readability, rated on a 5-point scale (maximum score: 15). Results O1 scored highest in correctness (12.6) and readability (14.2), outperforming ChatGPT-4, which scored 10.3 (P = 0.010) and 12.4 (P < 0.001), respectively. No significant difference was found between o1 and ChatGPT-4o. When stratified by subtopics, o1 consistently demonstrated superior correctness and readability. In completeness, ChatGPT-4o achieved the highest score of 12.4, followed by o1 (10.8), though the difference was not statistically significant. o1 showed notable limitations in completeness for ocular symptom queries, scoring 5.5 out of 15. Conclusions While o1 is marketed as offering improved reasoning capabilities, its performance in addressing eye care queries does not significantly differ from its predecessor, ChatGPT-4o. Nevertheless, it surpasses ChatGPT-4, particularly in correctness and readability. Financial Disclosures Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Collapse
Affiliation(s)
- Krithi Pushpanathan
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
| | - Minjie Zou
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
| | - Sahana Srinivasan
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
| | - Wendy Meihua Wong
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Erlangga Ariadarma Mangunkusumo
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - George Naveen Thomas
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Yien Lai
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Chen-Hsin Sun
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Janice Sing Harn Lam
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Marcus Chun Jin Tan
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Hazel Anne Hui'En Lin
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Weizhi Ma
- Institute for AI Industry Research, Tsinghua University, Beijing, China
| | - Victor Teck Chang Koh
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - David Ziyou Chen
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Yih-Chung Tham
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
- Eye Academic Clinical Program (Eye ACP), Duke NUS Medical School, Singapore
| |
Collapse
|
3
|
Zaman A, Yassin MM, Mehmud I, Cao A, Lu J, Hassan H, Kang Y. Challenges, optimization strategies, and future horizons of advanced deep learning approaches for brain lesion segmentation. Methods 2025; 239:140-168. [PMID: 40306473 DOI: 10.1016/j.ymeth.2025.04.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2025] [Revised: 04/17/2025] [Accepted: 04/24/2025] [Indexed: 05/02/2025] Open
Abstract
Brain lesion segmentation is challenging in medical image analysis, aiming to delineate lesion regions precisely. Deep learning (DL) techniques have recently demonstrated promising results across various computer vision tasks, including semantic segmentation, object detection, and image classification. This paper offers an overview of recent DL algorithms for brain tumor and stroke segmentation, drawing on literature from 2021 to 2024. It highlights the strengths, limitations, current research challenges, and unexplored areas in imaging-based brain lesion classification based on insights from over 250 recent review papers. Techniques addressing difficulties like class imbalance and multi-modalities are presented. Optimization methods for improving performance regarding computational and structural complexity and processing speed are discussed. These include lightweight neural networks, multilayer architectures, and computationally efficient, highly accurate network designs. The paper also reviews generic and latest frameworks of different brain lesion detection techniques and highlights publicly available benchmark datasets and their issues. Furthermore, open research areas, application prospects, and future directions for DL-based brain lesion classification are discussed. Future directions include integrating neural architecture search methods with domain knowledge, predicting patient survival levels, and learning to separate brain lesions using patient statistics. To ensure patient privacy, future research is anticipated to explore privacy-preserving learning frameworks. Overall, the presented suggestions serve as a guideline for researchers and system designers involved in brain lesion detection and stroke segmentation tasks.
Collapse
Affiliation(s)
- Asim Zaman
- School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen 518055, China; College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China; Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Medical School, Shenzhen University, Shenzhen 518060, China
| | - Mazen M Yassin
- School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen 518055, China; College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China
| | - Irfan Mehmud
- Department of Urology, The Third Affiliated Hospital of Shenzhen University (Luohu Hospital Group), Shenzhen University, Shenzhen 518000, China; Institute of Urology, South China Hospital, Medicine School, Shenzhen University, Shenzhen 518000, China
| | - Anbo Cao
- College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China; School of Applied Technology, Shenzhen University, Shenzhen 518055, China
| | - Jiaxi Lu
- College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China; School of Applied Technology, Shenzhen University, Shenzhen 518055, China
| | - Haseeb Hassan
- College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China
| | - Yan Kang
- School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen 518055, China; College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China; School of Applied Technology, Shenzhen University, Shenzhen 518055, China; Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Medical School, Shenzhen University, Shenzhen 518060, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang 110169, China.
| |
Collapse
|
4
|
Su Y, Babore YB, Kahn CE. A Large Language Model to Detect Negated Expressions in Radiology Reports. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2025; 38:1297-1303. [PMID: 39322813 DOI: 10.1007/s10278-024-01274-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 08/28/2024] [Accepted: 09/12/2024] [Indexed: 09/27/2024]
Abstract
Natural language processing (NLP) is crucial to extract information accurately from unstructured text to provide insights for clinical decision-making, quality improvement, and medical research. This study compared the performance of a rule-based NLP system and a medical-domain transformer-based model to detect negated concepts in radiology reports. Using a corpus of 984 de-identified radiology reports from a large U.S.-based academic health system (1000 consecutive reports, excluding 16 duplicates), the investigators compared the rule-based medspaCy system and the Clinical Assertion and Negation Classification Bidirectional Encoder Representations from Transformers (CAN-BERT) system to detect negated expressions of terms from RadLex, the Unified Medical Language System Metathesaurus, and the Radiology Gamuts Ontology. Power analysis determined a sample size of 382 terms to achieve α = 0.05 and β = 0.8 for McNemar's test; based on an estimate of 15% negated terms, 2800 randomly selected terms were annotated manually as negated or not negated. Precision, recall, and F1 of the two models were compared using McNemar's test. Of the 2800 terms, 387 (13.8%) were negated. For negation detection, medspaCy attained a recall of 0.795, precision of 0.356, and F1 of 0.492. CAN-BERT achieved a recall of 0.785, precision of 0.768, and F1 of 0.777. Although recall was not significantly different, CAN-BERT had significantly better precision (χ2 = 304.64; p < 0.001). The transformer-based CAN-BERT model detected negated terms in radiology reports with high precision and recall; its precision significantly exceeded that of the rule-based medspaCy system. Use of this system will improve data extraction from textual reports to support information retrieval, AI model training, and discovery of causal relationships.
Collapse
Affiliation(s)
- Yvonne Su
- Department of Radiology, Perelman School of Medicine, University of Pennsylvania, 3400 Spruce Street, Philadelphia, 19104, PA, USA
| | - Yonatan B Babore
- Department of Radiology, Perelman School of Medicine, University of Pennsylvania, 3400 Spruce Street, Philadelphia, 19104, PA, USA
| | - Charles E Kahn
- Department of Radiology, Perelman School of Medicine, University of Pennsylvania, 3400 Spruce Street, Philadelphia, 19104, PA, USA.
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
5
|
Park SH, Dean G, Ortiz EM, Choi JI. Overview of South Korean Guidelines for Approval of Large Language or Multimodal Models as Medical Devices: Key Features and Areas for Improvement. Korean J Radiol 2025; 26:519-523. [PMID: 40288893 DOI: 10.3348/kjr.2025.0257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2025] [Accepted: 03/10/2025] [Indexed: 04/29/2025] Open
Affiliation(s)
- Seong Ho Park
- Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea.
| | - Geraldine Dean
- Telemedicine Clinic Ltd. (a Unilabs company), Barcelona, Spain
- Unilabs AI Centre of Excellence, Barcelona, Spain
- NHS Southwest London, London, United Kingdom
| | | | - Joon-Il Choi
- Department of Radiology, Seoul St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea
| |
Collapse
|
6
|
Boyle A, Huo B, Sylla P, Calabrese E, Kumar S, Slater BJ, Walsh DS, Vosburg RW. Large language model-generated clinical practice guideline for appendicitis. Surg Endosc 2025; 39:3539-3551. [PMID: 40251310 DOI: 10.1007/s00464-025-11723-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2024] [Accepted: 04/06/2025] [Indexed: 04/20/2025]
Abstract
BACKGROUND Clinical practice guidelines provide important evidence-based recommendations to optimize patient care, but their development is labor-intensive and time-consuming. Large language models have shown promise in supporting academic writing and the development of systematic reviews, but their ability to assist with guideline development has not been explored. In this study, we tested the capacity of LLMs to support each stage of guideline development, using the latest SAGES guideline on the surgical management of appendicitis as a comparison. METHODS Prompts were engineered to trigger LLMs to perform each task of guideline development, using key questions and PICOs derived from the SAGES guideline. ChatGPT-4, Google Gemini, Consensus, and Perplexity were queried on February 21, 2024. LLM performance was evaluated qualitatively, with narrative descriptions of each task's output. The Appraisal of Guidelines for Research and Evaluation in Surgery (AGREE-S) instrument was used to quantitatively assess the quality of the LLM-derived guideline compared to the existing SAGES guideline. RESULTS Popular LLMs were able to generate a search syntax, perform data analysis, and follow the GRADE approach and Evidence-to-Decision framework to produce guideline recommendations. These LLMs were unable to independently perform a systematic literature search or reliably perform screening, data extraction, or risk of bias assessment at the time of testing. AGREE-S appraisal produced a total score of 119 for the LLM-derived guideline and 156 for the SAGES guideline. In 19 of the 24 domains, the two guidelines scored within two points of each other. CONCLUSIONS LLMs demonstrate potential to assist with certain steps of guideline development, which may reduce time and resource burden associated with these tasks. As new models are developed, the role for LLMs in guideline development will continue to evolve. Ongoing research and multidisciplinary collaboration are needed to support the safe and effective integration of LLMs in each step of guideline development.
Collapse
Affiliation(s)
- Amy Boyle
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada
| | - Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, ON, Canada
| | - Patricia Sylla
- Division of Colon and Rectal Surgery, Department of Surgery, Mount Sinai Hospital, New York, NY, USA
| | - Elisa Calabrese
- Department of Surgery, University of Adelaide, The Queen Elizabeth Hospital, Adelaide, SA, Australia
| | - Sunjay Kumar
- Department of General Surgery, Thomas Jefferson University Hospital, Philadelphia, PA, USA
| | | | - Danielle S Walsh
- Professor of Surgery, Department of Surgery, University of Kentucky, Lexington, KY, USA
| | - R Wesley Vosburg
- Department of Surgery, Grand Strand Medical Center, Myrtle Beach, SC, USA.
| |
Collapse
|
7
|
Turner KM, Ahmad SA. Large language models as clinical decision support tools: A call for careful implementation in the care of patients with pancreatic cancer. Surgery 2025; 182:109378. [PMID: 40287319 DOI: 10.1016/j.surg.2025.109378] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2025] [Accepted: 03/31/2025] [Indexed: 04/29/2025]
Affiliation(s)
- Kevin M Turner
- Department of Surgery, University of Cincinnati College of Medicine, Cincinnati, OH. https://twitter.com/KevinTurnerMD
| | - Syed A Ahmad
- Department of Surgery, Division of Surgical Oncology, University of Cincinnati College of Medicine, Cincinnati, OH.
| |
Collapse
|
8
|
Dorfner FJ, Dada A, Busch F, Makowski MR, Han T, Truhn D, Kleesiek J, Sushil M, Adams LC, Bressem KK. Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks. J Am Med Inform Assoc 2025; 32:1015-1024. [PMID: 40190132 DOI: 10.1093/jamia/ocaf045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Accepted: 03/02/2025] [Indexed: 05/21/2025] Open
Abstract
OBJECTIVES Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study aims to critically evaluate the performance of biomedically fine-tuned LLMs against their general-purpose counterparts across a range of clinical tasks. MATERIALS AND METHODS We evaluated the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on clinical case challenges from NEJM and JAMA, and on multiple clinical tasks, such as information extraction, document summarization and clinical coding. We used a diverse set of benchmarks specifically chosen to be outside the likely fine-tuning datasets of biomedical models, ensuring a fair assessment of generalization capabilities. RESULTS Biomedical LLMs generally underperformed compared to general-purpose models, especially on tasks not focused on probing medical knowledge. While on the case challenges, larger biomedical and general-purpose models showed similar performance (eg, OpenBioLLM-70B: 66.4% vs Llama-3-70B-Instruct: 65% on JAMA), smaller biomedical models showed more pronounced underperformance (OpenBioLLM-8B: 30% vs Llama-3-8B-Instruct: 64.3% on NEJM). Similar trends appeared across CLUE benchmarks, with general-purpose models often achieving higher scores in text generation, question answering, and coding. Notably, biomedical LLMs also showed a higher tendency to hallucinate. DISCUSSION Our findings challenge the assumption that biomedical fine-tuning inherently improves LLM performance, as general-purpose models consistently performed better on unseen medical tasks. Retrieval-augmented generation may offer a more effective strategy for clinical adaptation. CONCLUSION Fine-tuning LLMs on biomedical data may not yield the anticipated benefits. Alternative approaches, such as retrieval augmentation, should be further explored for effective and reliable clinical integration of LLMs.
Collapse
Affiliation(s)
- Felix J Dorfner
- Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin 10117, Germany
- Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School, Charlestown, MA 02129, United States
| | - Amin Dada
- Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen 45131, Germany
| | - Felix Busch
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
| | - Marcus R Makowski
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
| | - Tianyu Han
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen 52074, Germany
| | - Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen 52074, Germany
| | - Jens Kleesiek
- Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen 45131, Germany
- Cancer Research Center Cologne Essen (CCCE), West German Cancer Center Essen, University Hospital Essen (AöR), Essen 45147, Germany
- German Cancer Consortium (DKTK, Partner Site Essen), Heidelberg, Germany
- Department of Physics, TU Dortmund, Dortmund 44227, Germany
| | - Madhumita Sushil
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, United States
| | - Lisa C Adams
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
| | - Keno K Bressem
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
- German Heart Center Munich, Technical University Munich, Munich 80636, Germany
| |
Collapse
|
9
|
Wang Z, Sun J, Liu H, Luo X, Li J, He W, Yang Z, Lv H, Chen Y, Wang Z. Development and Performance of a Large Language Model for the Quality Evaluation of Multi-Language Medical Imaging Guidelines and Consensus. J Evid Based Med 2025; 18:e70020. [PMID: 40181523 DOI: 10.1111/jebm.70020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/01/2025] [Revised: 03/11/2025] [Accepted: 03/16/2025] [Indexed: 04/05/2025]
Abstract
AIM This study aimed to develop and evaluate an automated large language model (LLM)-based system for assessing the quality of medical imaging guidelines and consensus (GACS) in different languages, focusing on enhancing evaluation efficiency, consistency, and reducing manual workload. METHOD We developed the QPC-HASE-GuidelineEval algorithm, which integrates a Four-Quadrant Questions Classification Strategy and Hybrid Search Enhancement. The model was validated on 45 medical imaging guidelines (36 in Chinese and 9 in English) published in 2021 and 2022. Key evaluation metrics included consistency with expert assessments, hybrid search paragraph matching accuracy, information completeness, comparisons of different paragraph matching approaches, and cost-time efficiency. RESULTS The algorithm demonstrated an average accuracy of 77%, excelling in simpler tasks but showing lower accuracy (29%-40%) in complex evaluations, such as explanations and visual aids. The average accuracy rates of the English and Chinese versions of the GACS were 74% and 76%, respectively (p = 0.37). Hybrid search demonstrated superior performance with paragraph matching accuracy (4.42) and information completeness (4.42), significantly outperforming keyword-based search (1.05/1.05) and sparse-dense retrieval (4.26/3.63). The algorithm significantly reduced evaluation time to 8 min and 30 s per guideline and reduced costs to approximately 0.5 USD per guideline, offering a considerable advantage over traditional manual methods. CONCLUSION The QPC-HASE-GuidelineEval algorithm, powered by LLMs, showed strong potential for improving the efficiency, scalability, and multi-language capability of guideline evaluations, though further enhancements are needed to handle more complex tasks that require deeper interpretation.
Collapse
Affiliation(s)
- Zhixiang Wang
- Department of Ultrasound, Beijing Friendship Hospital, Capital Medical University, Beijing, China
- Precision and Intelligent Imaging Laboratory, Beijing Friendship Hospital, Capital Medical University, Beijing, China
| | - Jing Sun
- Department of Radiology, Capital Medical University Affiliated Beijing Friendship Hospital, Beijing, China
| | - Hui Liu
- Research Unit of Evidence-Based Evaluation and Guidelines, Chinese Academy of Medical Sciences (2021RU017), School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
| | - Xufei Luo
- Research Unit of Evidence-Based Evaluation and Guidelines, Chinese Academy of Medical Sciences (2021RU017), School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
| | - Jia Li
- Department of Radiology, Capital Medical University Affiliated Beijing Friendship Hospital, Beijing, China
| | - Wenjun He
- Dermatology Hospital, Southern Medical University, Guangzhou, China
- Acacia Laboratory for Implementation Science, School of Health Management, Southern Medical University, Guangzhou, China
| | - Zhenhua Yang
- Vincent V.C. Woo Chinese Medicine Clinical Research Institute, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, China
| | - Han Lv
- Department of Radiology, Capital Medical University Affiliated Beijing Friendship Hospital, Beijing, China
| | - Yaolong Chen
- Research Unit of Evidence-Based Evaluation and Guidelines, Chinese Academy of Medical Sciences (2021RU017), School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
| | - Zhenchang Wang
- Department of Radiology, Capital Medical University Affiliated Beijing Friendship Hospital, Beijing, China
| |
Collapse
|
10
|
Satheakeerthy S, Jesudason D, Pietris J, Bacchi S, Chan WO. LLM-assisted medical documentation: efficacy, errors, and ethical considerations in ophthalmology. Eye (Lond) 2025; 39:1440-1442. [PMID: 40148503 PMCID: PMC12089378 DOI: 10.1038/s41433-025-03767-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2025] [Revised: 03/05/2025] [Accepted: 03/19/2025] [Indexed: 03/29/2025] Open
Affiliation(s)
- Shrirajh Satheakeerthy
- Faculty of Health & Medical Sciences, The University of Adelaide, Adelaide, SA, 5000, Australia
- SA Health, Central Adelaide Local Health Network (CALHN), Adelaide, SA, 5000, Australia
| | - Daniel Jesudason
- Faculty of Health & Medical Sciences, The University of Adelaide, Adelaide, SA, 5000, Australia.
| | - James Pietris
- SA Health, Central Adelaide Local Health Network (CALHN), Adelaide, SA, 5000, Australia
| | - Stephen Bacchi
- Faculty of Health & Medical Sciences, The University of Adelaide, Adelaide, SA, 5000, Australia
- Harvard Medical School, 25 Shattuck St, Boston, MA, 02115, USA
- Massachusetts General Hospital, 55 Fruit St, Boston, MA, 02114, USA
| | - Weng Onn Chan
- Faculty of Health & Medical Sciences, The University of Adelaide, Adelaide, SA, 5000, Australia
- SA Health, Central Adelaide Local Health Network (CALHN), Adelaide, SA, 5000, Australia
- The Queen Elizabeth Hospital, 28 Woodville Rd, Woodville South, SA, 5011, Australia
| |
Collapse
|
11
|
Guan H, Novoa-Laurentiev J, Zhou L. CD-Tron: Leveraging large clinical language model for early detection of cognitive decline from electronic health records. J Biomed Inform 2025; 166:104830. [PMID: 40320101 DOI: 10.1016/j.jbi.2025.104830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Revised: 03/28/2025] [Accepted: 04/13/2025] [Indexed: 05/08/2025]
Abstract
BACKGROUND Early detection of cognitive decline during the preclinical stage of Alzheimer's disease and related dementias (AD/ADRD) is crucial for timely intervention and treatment. Clinical notes in the electronic health record contain valuable information that can aid in the early identification of cognitive decline. In this study, we utilize advanced large clinical language models, fine-tuned on clinical notes, to improve the early detection of cognitive decline. METHODS We collected clinical notes from 2,166 patients spanning the 4 years preceding their initial mild cognitive impairment (MCI) diagnosis from the Enterprise Data Warehouse of Mass General Brigham. To train the model, we developed CD-Tron, built upon a large clinical language model that was finetuned using 4,949 expert-labeled note sections. For evaluation, the trained model was applied to 1,996 independent note sections to assess its performance on real-world unstructured clinical data. Additionally, we used explainable AI techniques, specifically SHAP values (SHapley Additive exPlanations), to interpret the model's predictions and provide insight into the most influential features. Error analysis was also facilitated to further analyze the model's prediction. RESULTS CD-Tron significantly outperforms baseline models, achieving notable improvements in precision, recall, and AUC metrics for detecting cognitive decline (CD). Tested on many real-world clinical notes, CD-Tron demonstrated high sensitivity with only one false negative, crucial for clinical applications prioritizing early and accurate CD detection. SHAP-based interpretability analysis highlighted key textual features contributing to model predictions, supporting transparency and clinician understanding. CONCLUSION CD-Tron offers a novel approach to early cognitive decline detection by applying large clinical language models to free-text EHR data. Pretrained on real-world clinical notes, it accurately identifies early cognitive decline and integrates SHAP for interpretability, enhancing transparency in predictions.
Collapse
Affiliation(s)
- Hao Guan
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA 02115, USA.
| | - John Novoa-Laurentiev
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA 02115, USA
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
12
|
Wu Y, Zhang Y, Wu Y, Zheng Q, Li X, Chen X. ChatIOS: Improving automatic 3-dimensional tooth segmentation via GPT-4V and multimodal pre-training. J Dent 2025; 157:105755. [PMID: 40228651 DOI: 10.1016/j.jdent.2025.105755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 03/26/2025] [Accepted: 04/10/2025] [Indexed: 04/16/2025] Open
Abstract
OBJECTIVES This study aims to propose a framework that integrates GPT-4V, a recent advanced version of ChatGPT, and multimodal pre-training techniques to enhance deep learning algorithms for 3-dimensional (3D) tooth segmentation in scans produced by intraoral scanners (IOSs). METHODS The framework was developed on 1800 intraoral scans of approximately 24,000 annotated teeth (training set: 1200 scans, 16,004 teeth; testing set: 600 scans, 7995 teeth), from the Teeth3DS dataset, which was gathered from 900 patients with both maxillary and mandible regions. The first step of the proposed framework, ChatIOS, is to pre-process the 3D IOS data to extract 3D point clouds. Then, GPT-4V generates detailed descriptions of 2-dimensional (2D) IOS images taken from different view angles. In the multimodal pre-training, triplets, which comprise point clouds, 2D images, and text descriptions, serve as inputs. A series of ablation studies were systematically conducted to illustrate the superior design of the automatic 3D tooth segmentation system. Our quantitative evaluation criteria included segmentation quality, processing speed, and clinical applicability. RESULTS When tested on 600 scans, ChatIOS substantially outperformed the existing benchmarks such as PointNet++ across all metrics, including mean intersection-over-union (mIoU, from 90.3 % to 93.0 % for maxillary and from 89.2 % to 92.3 % for mandible scans), segmentation accuracy (from 97.0 % to 98.0 % for maxillary and from 96.8 % to 97.9 % for mandible scans) and dice similarity coefficient (DSC, from 98.1 % to 98.7 % for maxillary and from 97.9 % to 98.6 % for mandible scans). Our model took only approximately 2s to generate segmentation outputs per scan and exhibited acceptable consistency with clinical expert evaluations. CONCLUSIONS Our ChatIOS framework can increase the effectiveness and efficiency of 3D tooth segmentation that clinical procedures require, including orthodontic and prosthetic treatments. This study presents an early exploration of the applications of GPT-4V in digital dentistry and also pioneers the multimodal pre-training paradigm for 3D tooth segmentation. CLINICAL SIGNIFICANCE Accurate segmentation of teeth on 3D intraoral scans is critical for orthodontic and prosthetic treatments. ChatIOS can integrate GPT-4V with pre-trained vision-language models (VLMs) to gain an in-depth understanding of IOS data, which can contribute to more efficient and precise tooth segmentation systems.
Collapse
Affiliation(s)
- Yongjia Wu
- Department of Orthodontics, Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Hangzhou, PR China.
| | - Yun Zhang
- Department of Orthodontics, Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Hangzhou, PR China.
| | - Yange Wu
- Department of Orthodontics, Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Hangzhou, PR China
| | - Qianhan Zheng
- Department of Orthodontics, Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Hangzhou, PR China
| | - Xiaojun Li
- Department of Periodontics, Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Hangzhou, PR China.
| | - Xuepeng Chen
- Department of Orthodontics, Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Hangzhou, PR China.
| |
Collapse
|
13
|
Armitage RC. How do GPs Want Large Language Models to be Applied in Primary Care, and What Are Their Concerns? A Cross-Sectional Survey. J Eval Clin Pract 2025; 31:e70129. [PMID: 40369934 PMCID: PMC12079004 DOI: 10.1111/jep.70129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/14/2024] [Revised: 03/16/2025] [Accepted: 04/14/2025] [Indexed: 05/16/2025]
Abstract
INTRODUCTION Although the potential utility of large language models (LLMs) in medicine and healthcare is substantial, no assessment has been made to date of how GPs want LLMs to be applied in primary care, or of which issues GPs are most concerned about regarding the implementation of LLMs into their clinical practice. This study's objective was to generate preliminary evidence that answers these questions, which are relevant because GPs themselves will ultimately harness the power of LLMs in primary care. METHODS Non-probability sampling was utilised: GPs practicing in the UK and who were members of one of two Facebook groups (one containing a community of UK primary care staff, the other containing a community of GMC-registered doctors in the UK) were invited to complete an online survey, which ran from 06 to 13 November 2024. RESULTS The survey received 113 responses, 107 of which were from GPs practicing in the UK. When LLM accuracy and safety were assumed to be guaranteed, broad enthusiasm for LLMs carrying out various nonclinical and clinical tasks in primary care was reported. The single nonclinical task and clinical task that respondents were most supportive of were the LLM listening to the consultation and writing notes in real-time for the GP to review, edit, and save (44.0%), and the LLM identifying outstanding clinical tasks and actioning them (51.0%), respectively. Respondents were concerned with a range of issues regarding LLMs being embedded into clinical systems, with patient safety being the most commonly reported single issue of concern (36.2%). DISCUSSION This study has generated preliminary evidence that is of potential utility to those developing LLMs for use in primary care. Further research is required to expand this evidence base to further inform the development of these technologies, and to ensure they are acceptable to the GPs who will use them.
Collapse
Affiliation(s)
- Richard C. Armitage
- Academic Unit of Population and Lifespan Sciences, School of MedicineUniversity of NottinghamNottinghamUK
| |
Collapse
|
14
|
Choi H, Lee D, Kang YK, Suh M. Empowering PET imaging reporting with retrieval-augmented large language models and reading reports database: a pilot single center study. Eur J Nucl Med Mol Imaging 2025; 52:2452-2462. [PMID: 39843863 DOI: 10.1007/s00259-025-07101-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Accepted: 01/17/2025] [Indexed: 01/24/2025]
Abstract
PURPOSE The potential of Large Language Models (LLMs) in enhancing a variety of natural language tasks in clinical fields includes medical imaging reporting. This pilot study examines the efficacy of a retrieval-augmented generation (RAG) LLM system considering zero-shot learning capability of LLMs, integrated with a comprehensive database of PET reading reports, in improving reference to prior reports and decision making. METHODS We developed a custom LLM framework with retrieval capabilities, leveraging a database of over 10 years of PET imaging reports from a single center. The system uses vector space embedding to facilitate similarity-based retrieval. Queries prompt the system to generate context-based answers and identify similar cases or differential diagnoses. From routine clinical PET readings, experienced nuclear medicine physicians evaluated the performance of system in terms of the relevance of queried similar cases and the appropriateness score of suggested potential diagnoses. RESULTS The system efficiently organized embedded vectors from PET reports, showing that imaging reports were accurately clustered within the embedded vector space according to the diagnosis or PET study type. Based on this system, a proof-of-concept chatbot was developed and showed the framework's potential in referencing reports of previous similar cases and identifying exemplary cases for various purposes. From routine clinical PET readings, 84.1% of the cases retrieved relevant similar cases, as agreed upon by all three readers. Using the RAG system, the appropriateness score of the suggested potential diagnoses was significantly better than that of the LLM without RAG. Additionally, it demonstrated the capability to offer differential diagnoses, leveraging the vast database to enhance the completeness and precision of generated reports. CONCLUSION The integration of RAG LLM with a large database of PET imaging reports suggests the potential to support clinical practice of nuclear medicine imaging reading by various tasks of AI including finding similar cases and deriving potential diagnoses from them. This study underscores the potential of advanced AI tools in transforming medical imaging reporting practices.
Collapse
Affiliation(s)
- Hongyoon Choi
- Department of Nuclear Medicine, Seoul National University Hospital, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea.
- Department of Nuclear Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea.
- Portrai, Inc., Seoul, Republic of Korea.
| | | | - Yeon-Koo Kang
- Department of Nuclear Medicine, Seoul National University Hospital, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea
| | - Minseok Suh
- Department of Nuclear Medicine, Seoul National University Hospital, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea
- Department of Nuclear Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
15
|
Lamprecht CB, Lyerly M, Lucke-Wold B. Commentary: CNS-CLIP: Transforming a Neurosurgical Journal Into a Multimodal Medical Model. Neurosurgery 2025; 96:e123-e124. [PMID: 39636115 DOI: 10.1227/neu.0000000000003298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2024] [Accepted: 10/17/2024] [Indexed: 12/07/2024] Open
Affiliation(s)
- Chris B Lamprecht
- Department of Neurosurgery, College of Medicine, University of Florida, Gainesville , Florida , USA
| | - Mac Lyerly
- Wake Forest University School of Medicine, Winston-Salem , North Carolina , USA
| | - Brandon Lucke-Wold
- Lillian S. Wells Department of Neurosurgery, University of Florida, Gainesville , Florida , USA
| |
Collapse
|
16
|
Deng A, Chen W, Dai J, Jiang L, Chen Y, Chen Y, Jiang J, Rao M. Current application of ChatGPT in undergraduate nuclear medicine education: Taking Chongqing Medical University as an example. MEDICAL TEACHER 2025; 47:997-1003. [PMID: 39305476 DOI: 10.1080/0142159x.2024.2399673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Accepted: 08/29/2024] [Indexed: 10/03/2024]
Abstract
BACKGROUND Nuclear Medicine(NM), as an inherently interdisciplinary field, integrates diverse scientific principles and advanced imaging techniques. The advent of ChatGPT, a large language model, opens new avenues for medical educational innovation. With its advanced natural language processing abilities and complex algorithms, ChatGPT holds the potential to substantially enrich medical education, particularly in NM. OBJECTIVE To investigate the current application of ChatGPT in undergraduate Nuclear Medicine Education(NME). METHODS Employing a mixed-methods sequential explanatory design, the research investigates the current status of NME, the use of ChatGPT and the attitude towards ChatGPT among teachers and students in the Second Clinical College of Chongqing Medical University. RESULTS The investigation yields several salient findings: (1) Students and educators in NM face numerous challenges in the learning process; (2) ChatGPT is found to possess significant applicability and potential benefits in NME; (3) There is a pronounced inclination among respondents to adopt ChatGPT, with a keen interest in its diverse applications within the educational sphere. CONCLUSION ChatGPT has been utilized to address the difficulties faced by undergraduates at Chongqing Medical University in NME, and has been applied in various aspects to assist learning. The findings of this survey may offer some insights into how ChatGPT can be integrated into practical medical education.
Collapse
Affiliation(s)
- Ailin Deng
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Wenyi Chen
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Jinjie Dai
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Liu Jiang
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Yicai Chen
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Yuhua Chen
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Jinyan Jiang
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Maohua Rao
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
| |
Collapse
|
17
|
Zhu X. Elevating Clinical Practice in Interventional Radiology With Strategic Prompt Engineering. AJR Am J Roentgenol 2025. [PMID: 40434170 DOI: 10.2214/ajr.25.33266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2025]
Affiliation(s)
- Xiaoli Zhu
- The First Affiliated Hospital of Soochow Unviersity
| |
Collapse
|
18
|
Tang T, Li X, Lin Y, Liu C. Comparing digital real-time versus virtual simulation systems in dental education for preclinical tooth preparation of molars for metal-ceramic crowns. BMC Oral Health 2025; 25:814. [PMID: 40426144 DOI: 10.1186/s12903-025-06161-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2025] [Accepted: 05/12/2025] [Indexed: 05/29/2025] Open
Abstract
PURPOSE This study aimed to compare the effectiveness of the Real-time Dental Training and Evaluation System (RDTES) and Virtual Simulation System (VSS) with the Traditional Head-Simulator (THS) method in teaching molar preparation for metal-ceramic crowns in preclinical dental education. METHODS Undergraduate students were divided into four groups: No Additional Training (NAT), THS, RDTES, and VSS. The primary outcomes measured were artificial and machine scoring of tooth preparations, with additional anonymous surveys assessing student feedback. RESULTS Both RDTES and VSS groups demonstrated significantly higher tooth preparation scores compared to the NAT group, with RDTES showing superior performance in machine scan scoring. Linear regression analysis revealed a clear positive correlation between scoring and scoring improvement for both artificial and machine assessments. Student surveys indicated RDTES was rated higher in accuracy, feedback quality, skill improvement, and teaching effectiveness. CONCLUSIONS RDTES and VSS significantly enhance students' mastery of molar tooth preparation, with RDTES providing more precise guidance on tooth preparation volume. These systems show broad application prospects and development potential in dental education.
Collapse
Affiliation(s)
- Tianyu Tang
- Department of Prosthodontics, The Affiliated Stomatology Hospital of Kunming Medical University, Kunming, Yunnan, 650106, China
| | - Xingxing Li
- Department of Prosthodontics, The Affiliated Stomatology Hospital of Kunming Medical University, Kunming, Yunnan, 650106, China
| | - Yunhong Lin
- Department of Prosthodontics, The Affiliated Stomatology Hospital of Kunming Medical University, Kunming, Yunnan, 650106, China
| | - Caojie Liu
- Department of Prosthodontics, The Affiliated Stomatology Hospital of Kunming Medical University, Kunming, Yunnan, 650106, China.
- Yunnan Key Laboratory of Stomatology, The Affiliated Stomatology Hospital of Kunming Medical University, Chenggong District, 1168 West Chunrong Road, Yuhua Avenue, Kunming, Yunnan, 650500, People's Republic of China.
| |
Collapse
|
19
|
Kunze KN, Bepple J, Bedi A, Ramkumar PN, Pean CA. Commercial Products Using Generative Artificial Intelligence Include Ambient Scribes, Automated Documentation and Scheduling, Revenue Cycle Management, Patient Engagement and Education, and Prior Authorization Platforms. Arthroscopy 2025:S0749-8063(25)00397-4. [PMID: 40419172 DOI: 10.1016/j.arthro.2025.05.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/06/2025] [Revised: 05/10/2025] [Accepted: 05/10/2025] [Indexed: 05/28/2025]
Abstract
The integration of artificial intelligence (AI) into clinical practice is rapidly transforming healthcare workflows. At the forefront are large language models (LLMs), embedded within commercial and enterprise platforms to optimize documentation, streamline administration, and personalize patient engagement. The evolution of LLMs in healthcare has been driven by rapid advancements in natural language processing (NLP) and deep learning. Emerging commercial products include Ambient Scribes, Automated Documentation and Scheduling, Revenue Cycle Management, Patient Engagement and Education Assistants, and Prior Authorization Platforms. Ambient Scribes remain the leading commercial generative AI product, with approximately 90 platforms in existence to date. Emerging applications may improve provider efficiency and payer-provider alignment by automating the prior authorization process to reduce the manual labor burden placed on clinicians and staff. Current limitations include (1) lack of regulatory oversight, (2) existing biases, (3) inconsistent interoperability with EHRs, and (4) lack of physician and stakeholder buy-in due to lack of confidence in LLM outputs. Looking forward requires discussion of ethical, clinical, and operational considerations.
Collapse
Affiliation(s)
- Kyle N Kunze
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, NY, USA.
| | | | - Asheesh Bedi
- Department of Orthopaedic Surgery, University of Michigan, Ann Arbor, MI, USA
| | | | - Christian A Pean
- Department of Orthopaedic Surgery, Duke University School of Medicine, Durham, NC, USA
| |
Collapse
|
20
|
Yang Q, Zuo H, Su R, Su H, Zeng T, Zhou H, Wang R, Chen J, Lin Y, Chen Z, Tan T. Dual retrieving and ranking medical large language model with retrieval augmented generation. Sci Rep 2025; 15:18062. [PMID: 40413225 PMCID: PMC12103550 DOI: 10.1038/s41598-025-00724-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Accepted: 04/30/2025] [Indexed: 05/27/2025] Open
Abstract
Recent advancements in large language models (LLMs) have significantly enhanced text generation across various sectors; however, their medical application faces critical challenges regarding both accuracy and real-time responsiveness. To address these dual challenges, we propose a novel two-step retrieval and ranking retrieval-augmented generation (RAG) framework that synergistically combines embedding search with Elasticsearch technology. Built upon a dynamically updated medical knowledge base incorporating expert-reviewed documents from leading healthcare institutions, our hybrid architecture employs ColBERTv2 for context-aware result ranking while maintaining computational efficiency. Experimental results show a 10% improvement in accuracy for complex medical queries compared to standalone LLM and single-search RAG variants, while acknowledging that latency challenges remain in emergency situations requiring sub-second responses in an experimental setting, which can be achieved in real-time using more powerful hardware in real-world deployments. This work establishes a new paradigm for reliable medical AI assistants that successfully balances accuracy and practical deployment considerations.
Collapse
Affiliation(s)
- Qimin Yang
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Huan Zuo
- School of Public Health, University of South China, Hengyang, China
- The Affiliated Changsha Central Hospital, Hengyang Medical School, University of South China, Changsha, China
| | - Runqi Su
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Hanyinghong Su
- School of Public Health, University of South China, Hengyang, China
| | - Tangyi Zeng
- The Affiliated Changsha Central Hospital, Hengyang Medical School, University of South China, Changsha, China
| | - Huimei Zhou
- The Affiliated Changsha Central Hospital, Hengyang Medical School, University of South China, Changsha, China
| | - Rongsheng Wang
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Jiexin Chen
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Yijun Lin
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Zhiyi Chen
- School of Public Health, University of South China, Hengyang, China.
- The Affiliated Changsha Central Hospital, Hengyang Medical School, University of South China, Changsha, China.
- Key Laboratory of Medical Imaging Precision Theranostics and Radiation Protection, College of Hunan Province, The Affiliated Changsha Central Hospital, University of South China, Changsha, China.
- Department of Medical Imaging, The Affiliated Changsha Central Hospital, Hengyang Medical School, University of South China, Changsha, China.
| | - Tao Tan
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China.
| |
Collapse
|
21
|
McInnis MG, Coleman B, Hurwitz E, Robinson PN, Williams AE, Haendel MA, McMurry JA. Integrating Knowledge: The Power of Ontologies in Psychiatric Research and Clinical Informatics. Biol Psychiatry 2025:S0006-3223(25)01213-2. [PMID: 40414449 DOI: 10.1016/j.biopsych.2025.05.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 05/07/2025] [Accepted: 05/14/2025] [Indexed: 05/27/2025]
Abstract
Ontologies are structured frameworks for representing knowledge by systematically defining concepts, categories, and their relationships. While widely adopted in biomedicine, ontologies remain largely absent in mental health research and clinical care, where the field continues to rely heavily on existing classification systems (DSM). Although useful for clinical communication and administrative purposes, they lack the semantic structure, computational, and reasoning properties needed to integrate diverse data sources or support artificial intelligence (AI)-enabled analysis. This reliance on classification systems limits efforts to analyze and interpret complex, heterogeneous psychiatric data. In mood disorders, particularly bipolar disorder, the lack of formalized semantic models contributes to diagnostic inconsistencies, fragmented data structures, and barriers to precision medicine. Ontologies, by contrast, provide a standardized, machine-readable foundation for linking multimodal data sources, such as electronic health records (EHRs), genetic and neuroimaging data, and social determinants of health, while enabling secure, de-identified computation. This review surveys the current landscape of mental health ontologies and highlights the Human Phenotype Ontology (HPO) as a promising framework for bridging psychiatric and medical phenotypes. We describe ongoing efforts to enhance HPO through curated psychiatric terms, refined definitions, and structured mappings of observed phenomena. The Global Bipolar Cohort (GBC), an international collaboration, exemplifies this approach through the development of a consensus-driven ontology tailored to bipolar disorder. By supporting semantic interoperability, reproducible research, and individualized care, ontology-based approaches provide essential infrastructure for overcoming the limitations of classification systems and advancing data-driven precision psychiatry.
Collapse
Affiliation(s)
| | - Ben Coleman
- University of Connecticut, Farmington, CT, USA
| | - Eric Hurwitz
- University of North Carolina, Chapel Hill, NC, USA
| | - Peter N Robinson
- University of Connecticut, Farmington, CT, USA; Berlin Institute of Health at Charite, Berlin, Germany
| | | | | | | |
Collapse
|
22
|
Chen YC, Lee SH, Sheu H, Lin SH, Hu CC, Fu SC, Yang CP, Lin YC. Enhancing responses from large language models with role-playing prompts: a comparative study on answering frequently asked questions about total knee arthroplasty. BMC Med Inform Decis Mak 2025; 25:196. [PMID: 40410756 PMCID: PMC12102839 DOI: 10.1186/s12911-025-03024-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2025] [Accepted: 05/09/2025] [Indexed: 05/25/2025] Open
Abstract
BACKGROUND The application of artificial intelligence (AI) in medical education and patient interaction is rapidly growing. Large language models (LLMs) such as GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus have shown potential in providing relevant medical information. This study aims to evaluate and compare the performance of these LLMs in answering frequently asked questions (FAQs) about Total Knee Arthroplasty (TKA), with a specific focus on the impact of role-playing prompts. METHODS Four leading LLMs-GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus-were evaluated using ten standardized patient inquiries related to TKA. Each model produced two distinct responses per question: one generated under zero-shot prompting (question-only), and one under role-playing prompting (instructed to simulate an experienced orthopaedic surgeon). Four orthopaedic surgeons evaluated responses for accuracy and comprehensiveness on a 5-point Likert scale, along with a binary measure for acceptability. Statistical analyses (Wilcoxon rank sum and Chi-squared tests; P < 0.05) were conducted to compare model performance. RESULTS ChatGPT-4 with role-playing prompts achieved the highest scores for accuracy (3.73), comprehensiveness (4.05), and acceptability (77.5%), followed closely by ChatGPT-3.5 with role-playing prompts (3.70, 3.85, 72.5%, respectively). Google Gemini and Claude 3 Opus demonstrated lower performance across all metrics. In between-model comparisons based on zero-shot prompting, ChatGPT-4 achieved significantly higher scores of both accuracy and comprehensiveness relative to Google Gemini (P = 0.031 and P = 0.009, respectively) and Claude 3 Opus (P = 0.019 and P = 0.002), and demonstrated higher acceptability than Claude 3 Opus (P = 0.006). Within-model comparisons showed role-playing significantly improved all metrics for ChatGPT-3.5 (P < 0.05) and acceptability for ChatGPT-4 (P = 0.033). No significant prompting effects were observed for Gemini or Claude. CONCLUSIONS This study demonstrates that role-playing prompts significantly enhance the performance of LLMs, particularly for ChatGPT-3.5 and ChatGPT-4, in answering FAQs related to TKA. ChatGPT-4, with role-playing prompts, showed superior performance in terms of accuracy, comprehensiveness, and acceptability. Despite occasional inaccuracies, LLMs hold promise for improving patient education and clinical decision-making in orthopaedic practice. CLINICAL TRIAL NUMBER Not applicable.
Collapse
Affiliation(s)
- Yi-Chen Chen
- Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Taoyuan, 333, Taiwan
| | - Sheng-Hsun Lee
- Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Taoyuan, 333, Taiwan
| | - Huan Sheu
- Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Taoyuan, 333, Taiwan
| | - Sheng-Hsuan Lin
- Institute of Statistics, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Institute of Data Science and Engineering, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Department of Applied Mathematics, National Dong Hwa University, Hualien, Taiwan
- Department of Biochemical and Molecular Medical Sciences, National Dong Hwa University, Hualien, Taiwan
| | - Chih-Chien Hu
- Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Taoyuan, 333, Taiwan
| | - Shih-Chen Fu
- Institute of Statistics, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Department of Biochemical and Molecular Medical Sciences, National Dong Hwa University, Hualien, Taiwan
| | - Cheng-Pang Yang
- Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Taoyuan, 333, Taiwan.
| | - Yu-Chih Lin
- Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Taoyuan, 333, Taiwan.
| |
Collapse
|
23
|
Khan M, Ahuja K, Tsirikos AI. AI and machine learning in paediatric spine deformity surgery. Bone Jt Open 2025; 6:569-581. [PMID: 40407025 PMCID: PMC12100669 DOI: 10.1302/2633-1462.65.bjo-2024-0089.r1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 05/26/2025] Open
Abstract
Paediatric spine deformity surgery is a high-stakes procedure. It demands the surgeon to have exceptional anatomical knowledge and precise visuospatial awareness. There is increasing demand for precision medicine, which rapid advancements in computational technologies have made possible with the recent explosion of AI and machine learning (ML). We present the surgical and ethical applications of AI and ML in diagnosis, prognosis, image processing, and outcomes in the field of paediatric spine deformity.
Collapse
Affiliation(s)
- Mohsin Khan
- Scottish National Spine Deformity Centre, Royal Hospital for Children and Young People, Edinburgh, UK
| | - Kaustubh Ahuja
- Scottish National Spine Deformity Centre, Royal Hospital for Children and Young People, Edinburgh, UK
| | - Athanasios I Tsirikos
- Scottish National Spine Deformity Centre, Royal Hospital for Children and Young People, Edinburgh, UK
| |
Collapse
|
24
|
Mao C, Li J, Pang PCI, Zhu Q, Chen R. Identifying Kidney Stone Risk Factors Through Patient Experiences With a Large Language Model: Text Analysis and Empirical Study. J Med Internet Res 2025; 27:e66365. [PMID: 40403294 DOI: 10.2196/66365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Revised: 12/16/2024] [Accepted: 04/10/2025] [Indexed: 05/24/2025] Open
Abstract
BACKGROUND Kidney stones, a prevalent urinary disease, pose significant health risks. Factors like insufficient water intake or a high-protein diet increase an individual's susceptibility to the disease. Social media platforms can be a valuable avenue for users to share their experiences in managing these risk factors. Analyzing such patient-reported information can provide crucial insights into risk factors, potentially leading to improved quality of life for other patients. OBJECTIVE This study aims to develop a model KSrisk-GPT, based on a large language model (LLM) to identify potential kidney stone risk factors from web-based user experiences. METHODS This study collected data on the topic of kidney stones on Zhihu in the past 5 years and obtained 11,819 user comments. Experts organized the most common risk factors for kidney stones into six categories. Then, we use the least-to-most prompting in the chain-of-thought prompting to enable GPT-4.0 to think like an expert and ask GPT to identify risk factors from the comments. Metrics, including accuracy, precision, recall, and F1-score, were used to evaluate the performance of such a model. RESULTS Our proposed method outperforms other models in identifying comments containing risk factors with 95.9% accuracy and F1-score, with a precision of 95.6% and a recall of 96.2%. Out of the 863 comments identified with risk factors, our analysis showed the most mentioned risk factors for kidney stones in Zhihu user discussions, mainly including dietary habits (high protein, high calcium intake), insufficient water intake, genetic factors, and lifestyle. In addition, new potential risk factors were discovered with GPT, such as excessive use of supplements like vitamin C and calcium, laxatives, and hyperparathyroidism. CONCLUSIONS Comments from social media users offer a new data source for disease prevention and understanding patient journeys. Our method not only sheds light on using LLMs to efficiently summarize risk factors from social media data but also on LLMs' potential to identify new potential factors from the patient's perspective.
Collapse
Affiliation(s)
- Chao Mao
- MPU-UC Joint Research Laboratory in Advanced Technologies for Smart Cities, Faculty of Applied Sciences, Macao Polytechnic University, Macao, Macao
| | - Jiaxuan Li
- MPU-UC Joint Research Laboratory in Advanced Technologies for Smart Cities, Faculty of Applied Sciences, Macao Polytechnic University, Macao, Macao
| | - Patrick Cheong-Iao Pang
- MPU-UC Joint Research Laboratory in Advanced Technologies for Smart Cities, Faculty of Applied Sciences, Macao Polytechnic University, Macao, Macao
| | - Quanjing Zhu
- Department of Laboratory Medicine, West China Hospital, Sichuan University, Chengdu, China
| | - Rong Chen
- Department of Rehabilitation Medicine, The First Affiliated Hospital, Sun Yat-Sen University, Guangzhou, China
| |
Collapse
|
25
|
Ma J, Yu J, Xie A, Huang T, Liu W, Ma M, Tao Y, Zang F, Zheng Q, Zhu W, Chen Y, Ning M, Zhu Y. Large language model evaluation in autoimmune disease clinical questions comparing ChatGPT 4o, Claude 3.5 Sonnet and Gemini 1.5 pro. Sci Rep 2025; 15:17635. [PMID: 40399509 PMCID: PMC12095533 DOI: 10.1038/s41598-025-02601-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2024] [Accepted: 05/14/2025] [Indexed: 05/23/2025] Open
Abstract
Large language models (LLMs) have established a presence in providing medical services to patients and supporting clinical practice for doctors. To explore the ability of LLMs in answering clinical questions related to autoimmune diseases, this study was designed with 65 questions related to autoimmune diseases, covering five domains: concepts, report interpretation, diagnosis, prevention and treatment, and prognosis. Types of diseases include Sjögren's syndrome, systemic lupus erythematosus, rheumatoid arthritis, systemic sclerosis, and others. These questions were answered by three LLMs: ChatGPT 4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. The responses were then evaluated by 8 clinicians based on criteria including relevance, completeness, accuracy, safety, readability, and simplicity. We analyzed the scores of the three LLMs across five domains and six dimensions and compared their accuracy in answering the report interpretation section with that of two senior doctors and two junior doctors. The results showed that the performance of the three LLMs in the evaluation of autoimmune diseases significantly surpassed that of both junior and senior doctors. Notably, Claude 3.5 Sonnet excelled in providing comprehensive and accurate responses to clinical questions on autoimmune diseases, demonstrating the great potential of LLMs in assisting doctors with the diagnosis, treatment, and management of autoimmune diseases.
Collapse
Affiliation(s)
- Juntao Ma
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Jie Yu
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Anran Xie
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Taihong Huang
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Wenjing Liu
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Mengyin Ma
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Yue Tao
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Fuyu Zang
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Qisi Zheng
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Wenbo Zhu
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Yuxin Chen
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China.
| | - Mingzhe Ning
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China.
- Yizheng Hospital of Nanjing Drum Tower Hospital Group, Yizheng 211900, Jiangsu, China, Yangzhou, China.
| | - Yijia Zhu
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China.
| |
Collapse
|
26
|
Alter IL, Dias C, Briano J, Rameau A. Digital health technologies in swallowing care from screening to rehabilitation: A narrative review. Auris Nasus Larynx 2025; 52:319-326. [PMID: 40403345 DOI: 10.1016/j.anl.2025.05.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2025] [Revised: 05/14/2025] [Accepted: 05/16/2025] [Indexed: 05/24/2025]
Abstract
OBJECTIVES Digital health technologies (DHTs) have rapidly advanced in the past two decades, through developments in mobile and wearable devices and most recently with the explosion of artificial intelligence (AI) capabilities and subsequent extension into the health space. DHT has myriad potential applications to deglutology, many of which have undergone promising investigations and developments in recent years. We present the first literature review on applications of DHT in swallowing health, from screening to therapeutics. Public health interventions for swallowing care are increasingly needed in the setting of aging populations in the West and East Asia, and DHT may offer a scalable and low-cost solution. METHODS A narrative review was performed using PubMed and Google Scholar to identify recent research on applications of AI and digital health in swallow practice. Database searches, conducted in September 2024, included terms such as "digital," "AI," "machine learning," "tools" in combination with "deglutition," "Otolaryngology," "Head and Neck," "speech language pathology," "swallow," and "dysphagia." Primary literature pertaining to digital health in deglutology was included for review. RESULTS We review the various applications of DHT in swallowing care, including prevention, screening, diagnosis, treatment planning and rehabilitation. CONCLUSION DHT may offer innovative and scalable solutions for swallowing care as public health needs grow and in the setting of limited specialized healthcare workforce. These technological advances are also being explored as time and resource saving solutions at many points of care in swallow practice. DHT could bring affordable and accurate information for self-management of dysphagia to broader patient populations that otherwise lack access to expert providers.
Collapse
Affiliation(s)
- Isaac L Alter
- Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, 240 E 59 St, NY, NY 10022, USA
| | - Carla Dias
- Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, 240 E 59 St, NY, NY 10022, USA
| | - Jack Briano
- Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, 240 E 59 St, NY, NY 10022, USA
| | - Anaïs Rameau
- Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, 240 E 59 St, NY, NY 10022, USA.
| |
Collapse
|
27
|
Bai X, Wang S, Zhao Y, Feng M, Ma W, Liu X. Application of AI Chatbot in Responding to Asynchronous Text-Based Messages From Patients With Cancer: Comparative Study. J Med Internet Res 2025; 27:e67462. [PMID: 40397947 DOI: 10.2196/67462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2024] [Revised: 12/22/2024] [Accepted: 04/14/2025] [Indexed: 05/23/2025] Open
Abstract
BACKGROUND Telemedicine, which incorporates artificial intelligence such as chatbots, offers significant potential for enhancing health care delivery. However, the efficacy of artificial intelligence chatbots compared to human physicians in clinical settings remains underexplored, particularly in complex scenarios involving patients with cancer and asynchronous text-based interactions. OBJECTIVE This study aimed to evaluate the performance of the GPT-4 (OpenAI) chatbot in responding to asynchronous text-based medical messages from patients with cancer by comparing its responses with those of physicians across two clinical scenarios: patient education and medical decision-making. METHODS We collected 4257 deidentified asynchronous text-based medical consultation records from 17 oncologists across China between January 1, 2020, and March 31, 2024. Each record included patient questions, demographic data, and disease-related details. The records were categorized into two scenarios: patient education (eg, symptom explanations and test interpretations) and medical decision-making (eg, treatment planning). The GPT-4 chatbot was used to simulate physician responses to these records, with each session conducted in a new conversation to avoid cross-session interference. The chatbot responses, along with the original physician responses, were evaluated by a medical review panel (3 oncologists) and a patient panel (20 patients with cancer). The medical panel assessed completeness, accuracy, and safety using a 3-level scale, whereas the patient panel rated completeness, trustworthiness, and empathy on a 5-point ordinal scale. Statistical analyses included chi-square tests for categorical variables and Wilcoxon signed-rank tests for ordinal ratings. RESULTS In the patient education scenario (n=2364), the chatbot scored higher than physicians in completeness (n=2301, 97.34% vs n=2213, 93.61% for fully complete responses; P=.002), with no significant differences in accuracy or safety (P>.05). In the medical decision-making scenario (n=1893), the chatbot exhibited lower accuracy (n=1834, 96.88% vs n=1855, 97.99% for fully accurate responses; P<.001) and trustworthiness (n=860, 50.71% vs n=1766, 93.29% rated as "Moderately trustworthy" or higher; P<.001) compared with physicians. Regarding empathy, the medical review panel rated the chatbot as demonstrating higher empathy scores across both scenarios, whereas the patient review panel reached the opposite conclusion, consistently favoring physicians in empathetic communication. Errors in chatbot responses were primarily due to misinterpretations of medical terminology or the lack of updated guidelines, with 3.12% (59/1893) of its responses potentially leading to adverse outcomes, compared with 2.01% (38/1893) for physicians. CONCLUSIONS The GPT-4 chatbot performs comparably to physicians in patient education by providing comprehensive and empathetic responses. However, its reliability in medical decision-making remains limited, particularly in complex scenarios requiring nuanced clinical judgment. These findings underscore the chatbot's potential as a supplementary tool in telemedicine while highlighting the need for physician oversight to ensure patient safety and accuracy.
Collapse
Affiliation(s)
- Xuexue Bai
- Department of Neurosurgery, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
- Department of Neurosurgery, Peking Union Medical College Hospital, Beijing, China
| | - Shiyong Wang
- Department of Neurosurgery, First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Yuanli Zhao
- Department of Neurosurgery, Peking Union Medical College Hospital, Beijing, China
| | - Ming Feng
- Department of Neurosurgery, Peking Union Medical College Hospital, Beijing, China
| | - Wenbin Ma
- Department of Neurosurgery, Peking Union Medical College Hospital, Beijing, China
| | - Xiaomin Liu
- Head and Neck Neuro-Oncology Center, Tianjin Huanhu Hospital, Tianjin, China
| |
Collapse
|
28
|
Andras D, Ilies RA, Esanu V, Agoston S, Marginean Jumate TF, Dindelegan GC. Artificial Intelligence as a Potential Tool for Predicting Surgical Margin Status in Early Breast Cancer Using Mammographic Specimen Images. Diagnostics (Basel) 2025; 15:1276. [PMID: 40428269 DOI: 10.3390/diagnostics15101276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2025] [Revised: 05/10/2025] [Accepted: 05/13/2025] [Indexed: 05/29/2025] Open
Abstract
Background/Objectives: Breast cancer is the most common malignancy among women globally, with an increasing incidence, particularly in younger populations. Achieving complete surgical excision is essential to reduce recurrence. Artificial intelligence (AI), including large language models like ChatGPT, has potential for supporting diagnostic tasks, though its role in surgical oncology remains limited. Methods: This retrospective study evaluated ChatGPT's performance (ChatGPT-4, OpenAI, March 2025) in predicting surgical margin status (R0 or R1) based on intraoperative mammograms of lumpectomy specimens. AI-generated responses were compared with histopathological findings. Performance was evaluated using sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value (NPV), F1 score, and Cohen's kappa coefficient. Results: Out of a total of 100 patients, ChatGPT achieved an accuracy of 84.0% in predicting surgical margin status. Sensitivity for identifying R1 cases (incomplete excision) was 60.0%, while specificity for R0 (complete excision) was 86.7%. The positive predictive value (PPV) was 33.3%, and the negative predictive value (NPV) was 95.1%. The F1 score for R1 classification was 0.43, and Cohen's kappa coefficient was 0.34, indicating moderate agreement with histopathological findings. Conclusions: ChatGPT demonstrated moderate accuracy in confirming complete excision but showed limited reliability in identifying incomplete margins. While promising, these findings emphasize the need for domain-specific training and further validation before such models can be implemented in clinical breast cancer workflows.
Collapse
Affiliation(s)
- David Andras
- Department of General Surgery, Iuliu Hatieganu University of Medicine and Pharmacy, 400006 Cluj-Napoca, Romania
- First Surgical Unit, Emergency County Hospital Cluj, 400006 Cluj-Napoca, Romania
| | - Radu Alexandru Ilies
- Faculty of Medicine, Iuliu Hatieganu University of Medicine and Pharmacy, 400012 Cluj-Napoca, Romania
| | - Victor Esanu
- First Surgical Unit, Emergency County Hospital Cluj, 400006 Cluj-Napoca, Romania
| | - Stefan Agoston
- Faculty of Medicine, Iuliu Hatieganu University of Medicine and Pharmacy, 400012 Cluj-Napoca, Romania
| | | | - George Calin Dindelegan
- Department of General Surgery, Iuliu Hatieganu University of Medicine and Pharmacy, 400006 Cluj-Napoca, Romania
- First Surgical Unit, Emergency County Hospital Cluj, 400006 Cluj-Napoca, Romania
| |
Collapse
|
29
|
Shashikumar SP, Mohammadi S, Krishnamoorthy R, Patel A, Wardi G, Ahn JC, Singh K, Aronoff-Spencer E, Nemati S. Development and prospective implementation of a large language model based system for early sepsis prediction. NPJ Digit Med 2025; 8:290. [PMID: 40379845 PMCID: PMC12084535 DOI: 10.1038/s41746-025-01689-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Accepted: 04/27/2025] [Indexed: 05/19/2025] Open
Abstract
Sepsis is a dysregulated host response to infection with high mortality and morbidity. Early detection and intervention have been shown to improve patient outcomes, but existing computational models relying on structured electronic health record data often miss contextual information from unstructured clinical notes. This study introduces COMPOSER-LLM, an open-source large language model (LLM) integrated with the COMPOSER model to enhance early sepsis prediction. For high-uncertainty predictions, the LLM extracts additional context to assess sepsis-mimics, improving accuracy. Evaluated on 2500 patient encounters, COMPOSER-LLM achieved a sensitivity of 72.1%, positive predictive value of 52.9%, F-1 score of 61.0%, and 0.0087 false alarms per patient hour, outperforming the standalone COMPOSER model. Prospective validation yielded similar results. Manual chart review found 62% of false positives had bacterial infections, demonstrating potential clinical utility. Our findings suggest that integrating LLMs with traditional models can enhance predictive performance by leveraging unstructured data, representing a significant advance in healthcare analytics.
Collapse
Affiliation(s)
| | - Sina Mohammadi
- Division of Biomedical Informatics, UC San Diego, San Diego, CA, USA
| | | | - Avi Patel
- Department of Emergency Medicine, UC San Diego, San Diego, CA, USA
| | - Gabriel Wardi
- Department of Emergency Medicine, UC San Diego, San Diego, CA, USA
- Division of Pulmonary, Critical Care and Sleep Medicine, UC San Diego, San Diego, CA, USA
| | - Joseph C Ahn
- Division of Biomedical Informatics, UC San Diego, San Diego, CA, USA
- Division of Gastroenterology and Hepatology, Mayo Clinic, Rochester, NY, USA
| | - Karandeep Singh
- Division of Biomedical Informatics, UC San Diego, San Diego, CA, USA
- Jacobs Center for Health Innovation, UC San Diego Health, San Diego, CA, USA
| | - Eliah Aronoff-Spencer
- Division of Infectious Diseases and Global Public Health, UC San Diego, San Diego, CA, USA
| | - Shamim Nemati
- Division of Biomedical Informatics, UC San Diego, San Diego, CA, USA.
| |
Collapse
|
30
|
Kanani MM, Monawer A, Brown L, King WE, Miller ZD, Venugopal N, Heagerty PJ, Jarvik JG, Cohen T, Cross NM. High-Performance Prompting for LLM Extraction of Compression Fracture Findings from Radiology Reports. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2025:10.1007/s10278-025-01530-6. [PMID: 40379860 DOI: 10.1007/s10278-025-01530-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/17/2025] [Revised: 04/20/2025] [Accepted: 04/28/2025] [Indexed: 05/19/2025]
Abstract
Extracting information from radiology reports can provide critical data to empower many radiology workflows. For spinal compression fractures, these data can facilitate evidence-based care for at-risk populations. Manual extraction from free-text reports is laborious, and error-prone. Large language models (LLMs) have shown promise; however, fine-tuning strategies to optimize performance in specific tasks can be resource intensive. A variety of prompting strategies have achieved similar results with fewer demands. Our study pioneers the use of Meta's Llama 3.1, together with prompt-based strategies, for automated extraction of compression fractures from free-text radiology reports, outputting structured data without model training. We tested performance on a time-based sample of CT exams covering the spine from 2/20/2024 to 2/22/2024 acquired across our healthcare enterprise (637 anonymized reports, age 18-102, 47% Female). Ground truth annotations were manually generated and compared against the performance of three models (Llama 3.1 70B, Llama 3.1 8B, and Vicuna 13B) with nine different prompting configurations for a total of 27 model/prompt experiments. The highest F1 score (0.91) was achieved by the 70B Llama 3.1 model when provided with a radiologist-written background, with similar results when the background was written by a separate LLM (0.86). The addition of few-shot examples to these prompts had variable impact on F1 measurements (0.89, 0.84 respectively). Comparable ROC-AUC and PR-AUC performance was observed. Our work demonstrated that an open-weights LLM excelled at extracting compression fractures findings from free-text radiology reports using prompt-based techniques without requiring extensive manually labeled examples for model training.
Collapse
Affiliation(s)
| | - Arezu Monawer
- Department of Radiology, University of Washington, Seattle, WA, USA
| | - Lauryn Brown
- Department of Radiology, University of Washington, Seattle, WA, USA
| | - William E King
- Department of Radiology, University of Washington, Seattle, WA, USA
| | - Zachary D Miller
- Department of Radiology, University of Washington, Seattle, WA, USA
| | - Nitin Venugopal
- Department of Radiology, University of Washington, Seattle, WA, USA
| | | | - Jeffrey G Jarvik
- Department of Radiology, University of Washington, Seattle, WA, USA
| | - Trevor Cohen
- Department of Biomedical Informatics, University of Washington, Seattle, WA, USA
| | - Nathan M Cross
- Department of Radiology, University of Washington, Seattle, WA, USA
| |
Collapse
|
31
|
Omar M, Agbareia R, Glicksberg BS, Nadkarni GN, Klang E. Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study. JMIR Med Inform 2025; 13:e66917. [PMID: 40378406 PMCID: PMC12101789 DOI: 10.2196/66917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 01/31/2025] [Accepted: 01/31/2025] [Indexed: 05/18/2025] Open
Abstract
Background The capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored. Objective This study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs' ability to accurately judge their own responses. Methods We used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%-100%). We calculated the correlation between each model's mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t tests. Results The correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (r=-0.40; P=.001), indicating that worse-performing models exhibited paradoxically higher confidence. For instance, a top-performing model-GPT-4o-had a mean accuracy of 74% (SD 9.4%), with a mean confidence of 63% (SD 8.3%), whereas a low-performing model-Qwen2-7B-showed a mean accuracy of 46% (SD 10.5%) but a mean confidence of 76% (SD 11.7%). The mean difference in confidence between correct and incorrect responses was low for all models, ranging from 0.6% to 5.4%, with GPT-4o having the highest mean difference (5.4%, SD 2.3%; P=.003). Conclusions Better-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.
Collapse
Affiliation(s)
- Mahmud Omar
- Division of Data-Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, Gustave L. Levy Place New York, New York, NY, 10029, United States, 1 212 241 6500
| | - Reem Agbareia
- Ophthalmology Department, Hadassah Medical Center, Jerusalem, Israel
| | - Benjamin S Glicksberg
- Division of Data-Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, Gustave L. Levy Place New York, New York, NY, 10029, United States, 1 212 241 6500
| | - Girish N Nadkarni
- Division of Data-Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, Gustave L. Levy Place New York, New York, NY, 10029, United States, 1 212 241 6500
| | - Eyal Klang
- Division of Data-Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, Gustave L. Levy Place New York, New York, NY, 10029, United States, 1 212 241 6500
| |
Collapse
|
32
|
Bednarczyk L, Reichenpfader D, Gaudet-Blavignac C, Ette AK, Zaghir J, Zheng Y, Bensahla A, Bjelogrlic M, Lovis C. Scientific Evidence for Clinical Text Summarization Using Large Language Models: Scoping Review. J Med Internet Res 2025; 27:e68998. [PMID: 40371947 DOI: 10.2196/68998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Revised: 02/21/2025] [Accepted: 03/12/2025] [Indexed: 05/16/2025] Open
Abstract
BACKGROUND Information overload in electronic health records requires effective solutions to alleviate clinicians' administrative tasks. Automatically summarizing clinical text has gained significant attention with the rise of large language models. While individual studies show optimism, a structured overview of the research landscape is lacking. OBJECTIVE This study aims to present the current state of the art on clinical text summarization using large language models, evaluate the level of evidence in existing research and assess the applicability of performance findings in clinical settings. METHODS This scoping review complied with the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines. Literature published between January 1, 2019, and June 18, 2024, was identified from 5 databases: PubMed, Embase, Web of Science, IEEE Xplore, and ACM Digital Library. Studies were excluded if they did not describe transformer-based models, did not focus on clinical text summarization, did not engage with free-text data, were not original research, were nonretrievable, were not peer-reviewed, or were not in English, French, Spanish, or German. Data related to study context and characteristics, scope of research, and evaluation methodologies were systematically collected and analyzed by 3 authors independently. RESULTS A total of 30 original studies were included in the analysis. All used observational retrospective designs, mainly using real patient data (n=28, 93%). The research landscape demonstrated a narrow research focus, often centered on summarizing radiology reports (n=17, 57%), primarily involving data from the intensive care unit (n=15, 50%) of US-based institutions (n=19, 73%), in English (n=26, 87%). This focus aligned with the frequent reliance on the open-source Medical Information Mart for Intensive Care dataset (n=15, 50%). Summarization methodologies predominantly involved abstractive approaches (n=17, 57%) on single-document inputs (n=4, 13%) with unstructured data (n=13, 43%), yet reporting on methodological details remained inconsistent across studies. Model selection involved both open-source models (n=26, 87%) and proprietary models (n=7, 23%). Evaluation frameworks were highly heterogeneous. All studies conducted internal validation, but external validation (n=2, 7%), failure analysis (n=6, 20%), and patient safety risks analysis (n=1, 3%) were infrequent, and none reported bias assessment. Most studies used both automated metrics and human evaluation (n=16, 53%), while 10 (33%) used only automated metrics, and 4 (13%) only human evaluation. CONCLUSIONS Key barriers hinder the translation of current research into trustworthy, clinically valid applications. Current research remains exploratory and limited in scope, with many applications yet to be explored. Performance assessments often lack reliability, and clinical impact evaluations are insufficient raising concerns about model utility, safety, fairness, and data privacy. Advancing the field requires more robust evaluation frameworks, a broader research scope, and a stronger focus on real-world applicability.
Collapse
Affiliation(s)
- Lydie Bednarczyk
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
| | - Daniel Reichenpfader
- Institute for Patient-centered Digital Health, Bern University of Applied Sciences, Biel, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | | | - Amon Kenna Ette
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Jamil Zaghir
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Yuanyuan Zheng
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Adel Bensahla
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Mina Bjelogrlic
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Christian Lovis
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| |
Collapse
|
33
|
Wang C, Wang F, Li S, Ren QW, Tan X, Fu Y, Liu D, Qian G, Cao Y, Yin R, Li K. Patient Triage and Guidance in Emergency Departments Using Large Language Models: Multimetric Study. J Med Internet Res 2025; 27:e71613. [PMID: 40374171 DOI: 10.2196/71613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2025] [Revised: 03/13/2025] [Accepted: 05/01/2025] [Indexed: 05/17/2025] Open
Abstract
BACKGROUND Emergency departments (EDs) face significant challenges due to overcrowding, prolonged waiting times, and staff shortages, leading to increased strain on health care systems. Efficient triage systems and accurate departmental guidance are critical for alleviating these pressures. Recent advancements in large language models (LLMs), such as ChatGPT, offer potential solutions for improving patient triage and outpatient department selection in emergency settings. OBJECTIVE The study aimed to assess the accuracy, consistency, and feasibility of GPT-4-based ChatGPT models (GPT-4o and GPT-4-Turbo) for patient triage using the Modified Early Warning Score (MEWS) and evaluate GPT-4o's ability to provide accurate outpatient department guidance based on simulated patient scenarios. METHODS A 2-phase experimental study was conducted. In the first phase, 2 ChatGPT models (GPT-4o and GPT-4-Turbo) were evaluated for MEWS-based patient triage accuracy using 1854 simulated patient scenarios. Accuracy and consistency were assessed before and after prompt engineering. In the second phase, GPT-4o was tested for outpatient department selection accuracy using 264 scenarios sourced from the Chinese Medical Case Repository. Each scenario was independently evaluated by GPT-4o thrice. Data analyses included Wilcoxon tests, Kendall correlation coefficients, and logistic regression analyses. RESULTS In the first phase, ChatGPT's triage accuracy, based on MEWS, improved following prompt engineering. Interestingly, GPT-4-Turbo outperformed GPT-4o. GPT-4-Turbo achieved an accuracy of 100% compared to GPT-4o's accuracy of 96.2%, despite GPT-4o initially showing better performance prior to prompt engineering. This finding suggests that GPT-4-Turbo may be more adaptable to prompt optimization. In the second phase, GPT-4o, with superior performance on emotional responsiveness compared to GPT-4-Turbo, demonstrated an overall guidance accuracy of 92.63% (95% CI 90.34%-94.93%), with the highest accuracy in internal medicine (93.51%, 95% CI 90.85%-96.17%) and the lowest in general surgery (91.46%, 95% CI 86.50%-96.43%). CONCLUSIONS ChatGPT demonstrated promising capability for supporting patient triage and outpatient guidance in EDs. GPT-4-Turbo showed greater adaptability to prompt engineering, whereas GPT-4o exhibited superior responsiveness and emotional interaction, which are essential for patient-facing tasks. Future studies should explore real-world implementation and address the identified limitations to enhance ChatGPT's clinical integration.
Collapse
Affiliation(s)
- Chenxu Wang
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Industrial Engineering, Sichuan University, Chengdu, China
| | - Fei Wang
- Department of Nursing, West China School of Medicine, Sichuan University, Chengdu, China
| | - Shuhan Li
- Department of Industrial Engineering, Sichuan University, Chengdu, China
| | - Qing-Wen Ren
- Department of Medicine, Queen Mary Hospital, University of Hong Kong, Hong Kong, China (Hong Kong)
| | - Xiaomei Tan
- Department of Industrial Engineering, Sichuan University, Chengdu, China
| | - Yaoyu Fu
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
| | - Di Liu
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Industrial Engineering, Sichuan University, Chengdu, China
- Med-X Center for Informatics, Sichuan University, Chengdu, China
| | - Guangwu Qian
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Computer Science, Sichuan University, Chengdu, China
| | - Yu Cao
- Department of Emergency Medicine, West China Hospital of Sichuan University, Chengdu, China
| | - Rong Yin
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Industrial Engineering, Sichuan University, Chengdu, China
- Med-X Center for Informatics, Sichuan University, Chengdu, China
| | - Kang Li
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
- Med-X Center for Informatics, Sichuan University, Chengdu, China
| |
Collapse
|
34
|
Jiao C, Rosas E, Asadigandomani H, Delsoz M, Madadi Y, Raja H, Munir WM, Tamm B, Mehravaran S, Djalilian AR, Yousefi S, Soleimani M. Diagnostic Performance of Publicly Available Large Language Models in Corneal Diseases: A Comparison with Human Specialists. Diagnostics (Basel) 2025; 15:1221. [PMID: 40428214 DOI: 10.3390/diagnostics15101221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2025] [Revised: 05/10/2025] [Accepted: 05/10/2025] [Indexed: 05/29/2025] Open
Abstract
Background/Objectives: This study evaluated the diagnostic accuracy of seven publicly available large language models (LLMs)-GPT-3.5, GPT-4.o Mini, GPT-4.o, Gemini 1.5 Flash, Claude 3.5 Sonnet, Grok3, and DeepSeek R1-in diagnosing corneal diseases, comparing their performance to human specialists. Methods: Twenty corneal disease cases from the University of Iowa's EyeRounds were presented to each LLM. Diagnostic accuracy was determined by comparing LLM-generated diagnoses to the confirmed case diagnoses. Four human cornea specialists evaluated the same cases to establish a benchmark and assess interobserver agreement. Results: Diagnostic accuracy varied significantly among LLMs (p = 0.001). GPT-4.o achieved the highest accuracy (80.0%), followed by Claude 3.5 Sonnet and Grok3 (70.0%), DeepSeek R1 (65.0%), GPT-3.5 (60.0%), GPT-4.o Mini (55.0%), and Gemini 1.5 Flash (30.0%). Human experts averaged 92.5% accuracy, outperforming all LLMs (p < 0.001, Cohen's d = -1.314). GPT-4.o showed no significant difference from human consensus (p = 0.250, κ = 0.348), while Claude and Grok3 showed fair agreement (κ = 0.219). DeepSeek R1 also performed reasonably (κ = 0.178), although not significantly. Conclusions: Among the evaluated LLMs, GPT-4.o, Claude 3.5 Sonnet, Grok3, and DeepSeek R1 demonstrated promising diagnostic accuracy, with GPT-4.o most closely matching human performance. However, performance remained inconsistent, especially in complex cases. LLMs may offer value as diagnostic support tools, but human expertise remains indispensable for clinical decision-making.
Collapse
Affiliation(s)
- Cheng Jiao
- Department of Ophthalmology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Erik Rosas
- Department of Ophthalmology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Hassan Asadigandomani
- Department of Ophthalmology, University of California San Francisco, San Francisco, CA 94143, USA
| | - Mohammad Delsoz
- Department of Ophthalmology, Hamilton Eye Institute, University of Tennessee Health Science Center, Memphis, TN 38103, USA
| | - Yeganeh Madadi
- Department of Ophthalmology, Hamilton Eye Institute, University of Tennessee Health Science Center, Memphis, TN 38103, USA
| | - Hina Raja
- Department of Ophthalmology, Hamilton Eye Institute, University of Tennessee Health Science Center, Memphis, TN 38103, USA
| | - Wuqaas M Munir
- Department of Ophthalmology and Visual Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Brendan Tamm
- Department of Ophthalmology and Visual Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Shiva Mehravaran
- Department of Biology, School of Computer, Mathematical, and Natural Sciences, Morgan State University, Baltimore, MD 21251, USA
| | - Ali R Djalilian
- Department of Ophthalmology and Visual Sciences, University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Siamak Yousefi
- Department of Ophthalmology, Hamilton Eye Institute, University of Tennessee Health Science Center, Memphis, TN 38103, USA
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN 38136, USA
| | - Mohammad Soleimani
- Department of Ophthalmology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
35
|
Chen D, Chauhan K, Parsa R, Liu ZA, Liu FF, Mak E, Eng L, Hannon BL, Croke J, Hope A, Fallah-Rad N, Wong P, Raman S. Patient perceptions of empathy in physician and artificial intelligence chatbot responses to patient questions about cancer. NPJ Digit Med 2025; 8:275. [PMID: 40360673 PMCID: PMC12075825 DOI: 10.1038/s41746-025-01671-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2024] [Accepted: 04/24/2025] [Indexed: 05/15/2025] Open
Abstract
Artificial intelligence chatbots can draft empathetic responses to cancer questions, but how patients perceive chatbot empathy remains unclear. Here, we found that people with cancer rated chatbot responses as more empathetic than physician responses. However, differences between patient and physician perceptions of empathy highlight the need for further research to tailor clinical messaging to better meet patient needs. Chatbots may be effective in generating empathetic template responses to patient questions under clinician oversight.
Collapse
Affiliation(s)
- David Chen
- Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada
| | - Kabir Chauhan
- Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada
| | - Rod Parsa
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada
| | - Zhihui Amy Liu
- Department of Biostatistics, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Fei-Fei Liu
- Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
| | - Ernie Mak
- Department of Supportive Care, University Health Network, Toronto, ON, Canada
- Department of Family & Community Medicine, University of Toronto, Toronto, ON, Canada
| | - Lawson Eng
- Division of Medical Oncology and Hematology, Department of Medicine, Princess Margaret Cancer Centre/University Health Network Toronto, Toronto, ON, Canada
- Division of Medical Oncology, Department of Medicine, University of Toronto, Toronto, ON, Canada
| | - Breffni Louise Hannon
- Department of Supportive Care, University Health Network, Toronto, ON, Canada
- Department of Medicine, University of Toronto, Toronto, ON, Canada
| | - Jennifer Croke
- Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
| | - Andrew Hope
- Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
| | - Nazanin Fallah-Rad
- Division of Medical Oncology and Hematology, Department of Medicine, Princess Margaret Cancer Centre/University Health Network Toronto, Toronto, ON, Canada
| | - Phillip Wong
- Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
| | - Srinivas Raman
- Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada.
- Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
36
|
Shi B, Chen L, Pang S, Wang Y, Wang S, Li F, Zhao W, Guo P, Zhang L, Fan C, Zou Y, Wu X. Large Language Models and Artificial Neural Networks for Assessing 1-Year Mortality in Patients With Myocardial Infarction: Analysis From the Medical Information Mart for Intensive Care IV (MIMIC-IV) Database. J Med Internet Res 2025; 27:e67253. [PMID: 40354652 DOI: 10.2196/67253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2024] [Revised: 04/01/2025] [Accepted: 04/17/2025] [Indexed: 05/14/2025] Open
Abstract
BACKGROUND Accurate mortality risk prediction is crucial for effective cardiovascular risk management. Recent advancements in artificial intelligence (AI) have demonstrated potential in this specific medical field. Qwen-2 and Llama-3 are high-performance, open-source large language models (LLMs) available online. An artificial neural network (ANN) algorithm derived from the SWEDEHEART (Swedish Web System for Enhancement and Development of Evidence-Based Care in Heart Disease Evaluated According to Recommended Therapies) registry, termed SWEDEHEART-AI, can predict patient prognosis following acute myocardial infarction (AMI). OBJECTIVE This study aims to evaluate the 3 models mentioned above in predicting 1-year all-cause mortality in critically ill patients with AMI. METHODS The Medical Information Mart for Intensive Care IV (MIMIC-IV) database is a publicly available data set in critical care medicine. We included 2758 patients who were first admitted for AMI and discharged alive. SWEDEHEART-AI calculated the mortality rate based on each patient's 21 clinical variables. Qwen-2 and Llama-3 analyzed the content of patients' discharge records and directly provided a 1-decimal value between 0 and 1 to represent 1-year death risk probabilities. The patients' actual mortality was verified using follow-up data. The predictive performance of the 3 models was assessed and compared using the Harrell C-statistic (C-index), the area under the receiver operating characteristic curve (AUROC), calibration plots, Kaplan-Meier curves, and decision curve analysis. RESULTS SWEDEHEART-AI demonstrated strong discrimination in predicting 1-year all-cause mortality in patients with AMI, with a higher C-index than Qwen-2 and Llama-3 (C-index 0.72, 95% CI 0.69-0.74 vs C-index 0.65, 0.62-0.67 vs C-index 0.56, 95% CI 0.53-0.58, respectively; all P<.001 for both comparisons). SWEDEHEART-AI also showed high and consistent AUROC in the time-dependent ROC curve. The death rates calculated by SWEDEHEART-AI were positively correlated with actual mortality, and the 3 risk classes derived from this model showed clear differentiation in the Kaplan-Meier curve (P<.001). Calibration plots indicated that SWEDEHEART-AI tended to overestimate mortality risk, with an observed-to-expected ratio of 0.478. Compared with the LLMs, SWEDEHEART-AI demonstrated positive and greater net benefits at risk thresholds below 19%. CONCLUSIONS SWEDEHEART-AI, a trained ANN model, demonstrated the best performance, with strong discrimination and clinical utility in predicting 1-year all-cause mortality in patients with AMI from an intensive care cohort. Among the LLMs, Qwen-2 outperformed Llama-3 and showed moderate predictive value. Qwen-2 and SWEDEHEART-AI exhibited comparable classification effectiveness. The future integration of LLMs into clinical decision support systems holds promise for accurate risk stratification in patients with AMI; however, further research is needed to optimize LLM performance and address calibration issues across diverse patient populations.
Collapse
Affiliation(s)
- Boqun Shi
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Liangguo Chen
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Shuo Pang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Yue Wang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Shen Wang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Fadong Li
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Wenxin Zhao
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Pengrong Guo
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Leli Zhang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Chu Fan
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Yi Zou
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Xiaofan Wu
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| |
Collapse
|
37
|
Luo Y, Jiao M, Fotedar N, Ding JE, Karakis I, Rao VR, Asmar M, Xian X, Aboud O, Wen Y, Lin JJ, Hung FM, Sun H, Rosenow F, Liu F. Clinical Value of ChatGPT for Epilepsy Presurgical Decision-Making: Systematic Evaluation of Seizure Semiology Interpretation. J Med Internet Res 2025; 27:e69173. [PMID: 40354107 DOI: 10.2196/69173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2024] [Revised: 02/03/2025] [Accepted: 03/10/2025] [Indexed: 05/14/2025] Open
Abstract
BACKGROUND For patients with drug-resistant focal epilepsy, surgical resection of the epileptogenic zone (EZ) is an effective treatment to control seizures. Accurate localization of the EZ is crucial and is typically achieved through comprehensive presurgical approaches such as seizure semiology interpretation, electroencephalography (EEG), magnetic resonance imaging (MRI), and intracranial EEG (iEEG). However, interpreting seizure semiology is challenging because it heavily relies on expert knowledge. The semiologies are often inconsistent and incoherent, leading to variability and potential limitations in presurgical evaluation. To overcome these challenges, advanced technologies like large language models (LLMs)-with ChatGPT being a notable example-offer valuable tools for analyzing complex textual information, making them well-suited to interpret detailed seizure semiology descriptions and accurately localize the EZ. OBJECTIVE This study evaluates the clinical value of ChatGPT for interpreting seizure semiology to localize EZs in presurgical assessments for patients with focal epilepsy and compares its performance with that of epileptologists. METHODS We compiled 2 data cohorts: a publicly sourced cohort of 852 semiology-EZ pairs from 193 peer-reviewed journal publications and a private cohort of 184 semiology-EZ pairs collected from Far Eastern Memorial Hospital (FEMH) in Taiwan. ChatGPT was evaluated to predict the most likely EZ locations using 2 prompt methods: zero-shot prompting (ZSP) and few-shot prompting (FSP). To compare the performance of ChatGPT, 8 epileptologists were recruited to participate in an online survey to interpret 100 randomly selected semiology records. The responses from ChatGPT and epileptologists were compared using 3 metrics: regional sensitivity (RSens), weighted sensitivity (WSens), and net positive inference rate (NPIR). RESULTS In the publicly sourced cohort, ChatGPT demonstrated high RSens reliability, achieving 80% to 90% for the frontal and temporal lobes; 20% to 40% for the parietal lobe, occipital lobe, and insular cortex; and only 3% for the cingulate cortex. The WSens, which accounts for biased data distribution, consistently exceeded 67%, while the mean NPIR remained around 0. These evaluation results based on the private FEMH cohort are consistent with those from the publicly sourced cohort. A group t test with 1000 bootstrap samples revealed that ChatGPT-4 significantly outperformed epileptologists in RSens for the most frequently implicated EZs, such as the frontal and temporal lobes (P<.001). Additionally, ChatGPT-4 demonstrated superior overall performance in WSens (P<.001). However, no significant differences were observed between ChatGPT and the epileptologists in NPIR, highlighting comparable performance in this metric. CONCLUSIONS ChatGPT demonstrated clinical value as a tool to assist decision-making during epilepsy preoperative workups. With ongoing advancements in LLMs, their reliability and accuracy are anticipated to improve.
Collapse
Affiliation(s)
- Yaxi Luo
- Department of Computer Science, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
| | - Meng Jiao
- Department of Systems and Enterprises, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
| | - Neel Fotedar
- School of Medicine, Case Western Reserve University, Cleveland, OH, United States
- Department of Neurology, University Hospitals Cleveland Medical Center, Cleveland, OH, United States
| | - Jun-En Ding
- Department of Systems and Enterprises, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
| | - Ioannis Karakis
- Department of Neurology, School of Medicine, Emory University, Atlanta, GA, United States
- Department of Neurology, School of Medicine, University of Crete, Heraklion, Greece
| | - Vikram R Rao
- Department of Neurology and Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA, United States
| | - Melissa Asmar
- Department of Neurology, University of California, Davis, Davis, CA, United States
| | - Xiaochen Xian
- H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, United States
| | - Orwa Aboud
- Department of Neurology and Neurological Surgery, University of California, Davis, Davis, CA, United States
| | - Yuxin Wen
- Fowler School of Engineering, Chapman University, Orange, CA, United States
| | - Jack J Lin
- Department of Neurology, University of California, Davis, Davis, CA, United States
| | - Fang-Ming Hung
- Center of Artificial Intelligence, Far Eastern Memorial Hospital, New Taipei City, Taiwan
- Surgical Trauma Intensive Care Unit, Far Eastern Memorial Hospital, New Taipei City, Taiwan
| | - Hai Sun
- Department of Neurosurgery, Rutgers Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, New Brunswick, NJ, United States
| | - Felix Rosenow
- Department of Neurology, Epilepsy Center Frankfurt Rhine-Main, Goethe University Frankfurt, Frankfurt am Main, Germany
| | - Feng Liu
- Department of Systems and Enterprises, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
- Semcer Center for Healthcare Innovation, Stevens Institute of Technology, Hoboken, NJ, United States
| |
Collapse
|
38
|
Wei W, Shao J, Lyu RQ, Hemono R, Ma X, Giorgio J, Zheng Z, Ji F, Zhang X, Katabaro E, Mlowe M, Sabasaba A, Lister C, Shabani S, Njau P, McCoy SI, Wang J. Enhanced Language Models for Predicting and Understanding HIV Care Disengagement: A Case Study in Tanzania. RESEARCH SQUARE 2025:rs.3.rs-6608559. [PMID: 40386417 PMCID: PMC12083686 DOI: 10.21203/rs.3.rs-6608559/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/26/2025]
Abstract
Summary Sustained engagement in HIV care and adherence to antiretroviral therapy (ART) are essential for achieving the UNAIDS "95-95-95" targets. Despite increased ART access, disengagement from care remains a significant issue, particularly in sub-Saharan Africa. Traditional machine learning (ML) models have shown moderate success in predicting care disengagement, which would enable early intervention. We develop an enhanced large language model (LLM) fine-tuned with electronic medical records (EMRs) to predict people at risk of disengaging from HIV care in Tanzania and to provide interpretative insights into modifiable risk factors. Methods We developed a novel AI model by enhancing a pre-trained LLM (LLaMA 3.1, an open-source pre-trained LLM released by Meta) using routinely collected EMRs from Tanzania's National HIV Care and Treatment Program from January 1, 2018, to June 30, 2023 (4,809,765 records for 261,192 people) to identify people at risk of disengaging from HIV care or developing adverse outcomes. Outcomes included risk of ART non-adherence, non-suppressed viral load, and loss to follow-up. Models were evaluated internally (Kagera region) and externally (Geita region), with performance compared against state-of-art ML models and zero-shot LLMs. Additionally, a team of HIV physicians in Tanzania assessed the LLM's predictions along with LLM provided justifications for a subset of patient records to evaluate their clinical relevance and reasoning. Findings The enhanced LLMs consistently outperformed the supervised ML model and zero-shot LLMs across all outcomes in both internal and external validation datasets. When focusing on the 25% of PLHIV predicted as most likely to lost-to-follow-up (LTFU), the model correctly identified 78% (2,515 of 3,224) of people living with HIV (PLHIV) genuinely at risk in internal validation and 73% (7,105 of 9,733) in external validation. Attention score analysis indicated that the enhanced LLM focused on keywords such as gaps in follow-up care and ART adherence. The human expert evaluation showed alignment between clinician assessments and the LLM's predictions in 65% of cases, with experts finding the model's justifications reasonable and clinically relevant in 92.3% of aligned cases. Interpretation If implemented in HIV clinics, this LLM-based AI model could help allocate resources efficiently and deliver targeted interventions, improving retention in care and advancing the UNAIDS "95-95-95" targets. By functioning like a clinician-analyzing patient summaries, predicting risks, and offering reasoning-the enhanced LLM could be integrated into clinical workflows to complement human expertise, facilitating timely interventions and informed decision-making. If implemented widely, this human-AI collaboration has the potential to improve health outcomes for people living with HIV and reduce onward transmission. Funding The study was supported by a grant from the US National Institutes of Health (NIH): NIH NIMH 1R01MH125746.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | - Matilda Mlowe
- Health for a Prosperous Nation, Dar es Salaam, Tanzania
| | - Amon Sabasaba
- Health for a Prosperous Nation, Dar es Salaam, Tanzania
| | | | | | | | | | | |
Collapse
|
39
|
Deng L, Wu Y, Ren Y, Lu H. Autonomous Self-Evolving Research on Biomedical Data: The DREAM Paradigm. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2025:e2417066. [PMID: 40344513 DOI: 10.1002/advs.202417066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/17/2024] [Revised: 04/12/2025] [Indexed: 05/11/2025]
Abstract
In contemporary biomedical research, the efficiency of data-driven methodologies is constrained by large data volumes, the complexity of tool selection, and limited human resources. To address these challenges, a Data-dRiven self-Evolving Autonomous systeM (DREAM) is developed as the first fully autonomous biomedical research system capable of independently conducting scientific investigations without human intervention. DREAM autonomously formulates and evolves scientific questions, configures computational environments, and performs result evaluation and validation. Unlike existing semi-autonomous systems, DREAM operates without manual intervention and is validated in real-world biomedical scenarios. It exceeds the average performance of top scientists in question generation, achieves a higher success rate in environment configuration than experienced human researchers, and uncovers novel scientific findings. In the context of the Framingham Heart Study, it demonstrated an efficiency that is over 10 000 times greater than that of average scientists. As a fully autonomous, self-evolving system, DREAM offers a robust and efficient solution for accelerating biomedical discovery and advancing other data-driven scientific disciplines.
Collapse
Affiliation(s)
- Luojia Deng
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
- SJTU-Yale Joint Center for Biostatistics and Data Science, Technical Center for Digital Medicine, National Center for Translational Medicine, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yijie Wu
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
- SJTU-Yale Joint Center for Biostatistics and Data Science, Technical Center for Digital Medicine, National Center for Translational Medicine, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yongyong Ren
- SJTU-Yale Joint Center for Biostatistics and Data Science, Technical Center for Digital Medicine, National Center for Translational Medicine, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Hui Lu
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
- SJTU-Yale Joint Center for Biostatistics and Data Science, Technical Center for Digital Medicine, National Center for Translational Medicine, Shanghai Jiao Tong University, Shanghai, 200240, China
| |
Collapse
|
40
|
Tomita K, Nishida T, Kitaguchi Y, Kitazawa K, Miyake M. Image Recognition Performance of GPT-4V(ision) and GPT-4o in Ophthalmology: Use of Images in Clinical Questions. Clin Ophthalmol 2025; 19:1557-1564. [PMID: 40357454 PMCID: PMC12068282 DOI: 10.2147/opth.s494480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Accepted: 04/09/2025] [Indexed: 05/15/2025] Open
Abstract
Purpose To compare the diagnostic accuracy of Generative Pre-trained Transformer with Vision (GPT)-4, GPT-4 with Vision (GPT-4V), and GPT-4o for clinical questions in ophthalmology. Patients and Methods The questions were collected from the "Diagnosis This" section on the American Academy of Ophthalmology website. We tested 580 questions and presented ChatGPT with the same questions under two conditions: 1) multimodal model, incorporating both the question text and associated images, and 2) text-only model. We then compared the difference in accuracy using McNemar tests among multimodal (GPT-4o and GPT-4V) and text-only (GPT-4V) models. The percentage of general correct answers was also collected from the website. Results Multimodal GPT-4o performed the best accuracy (77.1%), followed by multimodal GPT-4V (71.0%), and then text-only GPT-4V (68.7%); (P values < 0.001, 0.012, and 0.001, respectively). All GPT-4 models showed higher accuracy than the general correct answers on the website (64.6%). Conclusion The addition of information from images enhances the performance of GPT-4V in diagnosing clinical questions in ophthalmology. This suggests that integrating multimodal data could be crucial in developing more effective and reliable diagnostic tools in medical fields.
Collapse
Affiliation(s)
- Kosei Tomita
- Department of Ophthalmology, Kawasaki Medical School, Okayama, Japan
| | - Takashi Nishida
- Hamilton Glaucoma Center, Shiley Eye Institute, Viterbi Family Department of Ophthalmology, University of California, San Diego, La Jolla, CA, USA
| | - Yoshiyuki Kitaguchi
- Department of Ophthalmology, Osaka University Graduate School of Medicine, Osaka, Japan
| | - Koji Kitazawa
- Department of Ophthalmology, Kyoto Prefectural University of Medicine, Kyoto, Japan
| | - Masahiro Miyake
- Department of Ophthalmology and Visual Sciences, Kyoto University Graduate School of Medicine, Kyoto, Japan
| |
Collapse
|
41
|
Liu C, Zhang H, Zheng Z, Liu W, Gu C, Lan Q, Zhang W, Yang J. ChatOCT: Embedded Clinical Decision Support Systems for Optical Coherence Tomography in Offline and Resource-Limited Settings. J Med Syst 2025; 49:59. [PMID: 40332685 DOI: 10.1007/s10916-025-02188-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Accepted: 04/23/2025] [Indexed: 05/08/2025]
Abstract
Optical Coherence Tomography (OCT) is a critical imaging modality for diagnosing ocular and systemic conditions, yet its accessibility is hindered by the need for specialized expertise and high computational demands. To address these challenges, we introduce ChatOCT, an offline-capable, domain-adaptive clinical decision support system (CDSS) that integrates structured expert Q&A generation, OCT-specific knowledge injection, and activation-aware model compression. Unlike existing systems, ChatOCT functions without internet access, making it suitable for low-resource environments. ChatOCT is built upon LLaMA-2-7B, incorporating domain-specific knowledge from PubMed and OCT News through a two-stage training process: (1) knowledge injection for OCT-specific expertise and (2) Q&A instruction tuning for structured, interactive diagnostic reasoning. To ensure feasibility in offline environments, we apply activation-aware weight quantization, reducing GPU memory usage to ~ 4.74 GB, enabling deployment on standard OCT hardware. A novel expert answer generation framework mitigates hallucinations by structuring responses in a multi-step process, ensuring accuracy and interpretability. ChatOCT outperforms state-of-the-art baselines such as LLaMA-2, PMC-LLaMA-13B, and ChatDoctor by 10-15 points in coherence, relevance, and clinical utility, while reducing GPU memory requirements by 79%, while maintaining real-time responsiveness (~ 20 ms inference time). Expert ophthalmologists rated ChatOCT's outputs as clinically actionable and aligned with real-world decision-making needs, confirming its potential to assist frontline healthcare providers. ChatOCT represents an innovative offline clinical decision support system for optical coherence tomography (OCT) that runs entirely on local embedded hardware, enabling real-time analysis in resource-limited settings without internet connectivity. By offering a scalable, generalizable pipeline that integrates knowledge injection, instruction tuning, and model compression, ChatOCT provides a blueprint for next-generation, resource-efficient clinical AI solutions across multiple medical domains.
Collapse
Affiliation(s)
- Chang Liu
- School of Biomedical Engineering, Shanghai Jiao Tong University, Xuhui District, No. 3 Teaching Building, 1954 Huashan RD, Shanghai, China
| | - Haoran Zhang
- School of Biomedical Engineering, Shanghai Jiao Tong University, Xuhui District, No. 3 Teaching Building, 1954 Huashan RD, Shanghai, China
| | - Zheng Zheng
- Department of Ophthalmology, Shanghai General Hospital, Shanghai, China
- National Clinical Research Center for Eye Diseases, Shanghai, China
| | - Wenjia Liu
- Department of Ophthalmology, Shanghai General Hospital, Shanghai, China
- National Clinical Research Center for Eye Diseases, Shanghai, China
| | - Chengfu Gu
- School of Biomedical Engineering, Shanghai Jiao Tong University, Xuhui District, No. 3 Teaching Building, 1954 Huashan RD, Shanghai, China
| | - Qi Lan
- School of Biomedical Engineering, Shanghai Jiao Tong University, Xuhui District, No. 3 Teaching Building, 1954 Huashan RD, Shanghai, China
| | - Weiyi Zhang
- School of Biomedical Engineering, Shanghai Jiao Tong University, Xuhui District, No. 3 Teaching Building, 1954 Huashan RD, Shanghai, China
| | - Jianlong Yang
- School of Biomedical Engineering, Shanghai Jiao Tong University, Xuhui District, No. 3 Teaching Building, 1954 Huashan RD, Shanghai, China.
| |
Collapse
|
42
|
Wan F, Wang T, Wang K, Si Y, Fondrevelle J, Du S, Duclos A. Surgery scheduling based on large language models. Artif Intell Med 2025; 166:103151. [PMID: 40349664 DOI: 10.1016/j.artmed.2025.103151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2024] [Revised: 01/13/2025] [Accepted: 05/01/2025] [Indexed: 05/14/2025]
Abstract
Large Language Models (LLMs) have shown remarkable potential in various fields. This study explores their application in solving multi-objective combinatorial optimization problems-surgery scheduling problem. Traditional multi-objective optimization algorithms, such as the Non-dominated Sorting Genetic Algorithm II (NSGA-II), often require domain expertise for designing precise operators. Here, we propose LLM-NSGA, where LLMs act as evolutionary optimizers, performing selection, crossover, and mutation operations. Results show that for 40 cases, LLMs can independently generate high-quality solutions from prompts. As problem size increases, LLM-NSGA outperformed traditional approaches like NSGA-II and MOEA/D, achieving average improvements of 5.39 %, 80 %, and 0.42 % in three objectives. While LLM-NSGA provided similar results to EoH, another LLM-based method, it outperformed EoH in overall resource allocation. Additionally, we applied LLMs for hyperparameter optimization, comparing them with Bayesian Optimization and Ant Colony Optimization (ACO). LLMs reduced runtime by an average of 23.68 %, and their generated parameters, validated with NSGA-II, produced better surgery scheduling solutions. This demonstrates that LLMs can not only help traditional algorithms find better solutions but also optimize their parameters efficiently.
Collapse
Affiliation(s)
- Fang Wan
- INSA LYON, Université Lyon2, Université Claude Bernard Lyon1, Université Jean Monnet Saint-Etienne, DISP UR4570, France.
| | - Tao Wang
- INSA LYON, Université Lyon2, Université Claude Bernard Lyon1, Université Jean Monnet Saint-Etienne, DISP UR4570, France
| | - Kezhi Wang
- Department of Computer Science, Brunel University London, UK
| | | | - Julien Fondrevelle
- INSA LYON, Université Lyon2, Université Claude Bernard Lyon1, Université Jean Monnet Saint-Etienne, DISP UR4570, France
| | - Shuimiao Du
- Sino-European School of Shanghai University, China
| | - Antoine Duclos
- Research on Healthcare Performance RESHAPE, Université Claude Bernard, Lyon 1, France
| |
Collapse
|
43
|
Rosenthal JT, Beecy A, Sabuncu MR. Rethinking clinical trials for medical AI with dynamic deployments of adaptive systems. NPJ Digit Med 2025; 8:252. [PMID: 40328886 PMCID: PMC12056174 DOI: 10.1038/s41746-025-01674-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2024] [Accepted: 04/24/2025] [Indexed: 05/08/2025] Open
Abstract
There is a growing recognition of the need for clinical trials to safely and effectively deploy artificial intelligence (AI) in clinical settings. We introduce dynamic deployment as a framework for AI clinical trials tailored for the dynamic nature of large language models, making possible complex medical AI systems which continuously learn and adapt in situ from new data and interactions with users while enabling continuous real-time monitoring and clinical validation.
Collapse
Affiliation(s)
- Jacob T Rosenthal
- Tri-Institutional MD-PhD program of Weill Cornell/Rockefeller/Sloan Kettering, New York, NY, USA.
- Department of Radiology, Weill Cornell Medicine, New York, NY, USA.
| | - Ashley Beecy
- Division of Cardiology, Department of Medicine, Weill Cornell Medicine and NewYork-Presbyterian, New York, NY, USA
| | - Mert R Sabuncu
- Department of Radiology, Weill Cornell Medicine, New York, NY, USA
- School of Electrical and Computer Engineering, Cornell Tech and Cornell University, New York, NY, USA
| |
Collapse
|
44
|
Othman AA, Flaharty KA, Ledgister Hanchard SE, Hu P, Duong D, Waikel RL, Solomon BD. Assessing large language model performance related to aging in genetic conditions. NPJ AGING 2025; 11:33. [PMID: 40319013 PMCID: PMC12049513 DOI: 10.1038/s41514-025-00226-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 04/22/2025] [Indexed: 05/07/2025]
Abstract
Most genetic conditions are described in pediatric populations, leaving a gap in understanding their clinical progression and management in adulthood. Motivated by other applications of large language models (LLMs), we evaluated whether Llama-2-70b-chat (70b) and GPT-3.5 (GPT) could generate plausible medical vignettes, patient-geneticist dialogues and management plans for a hypothetical child and adult patients across 282 genetic conditions (selected by prevalence and categorized based on age-related characteristics). Results showed that LLMs provided appropriate age-based responses in both child and adult outputs based on Correctness and Completeness scores graded by clinicians. Sub-analysis of metabolic conditions including those typically presents neonatally with crisis also showed age-appropriate LLM responses. However 70b and GPT obtained low Correctness and Completeness scores at producing plausible management plans (55-66% for 70b and a wider range, 50-90%, for GPT). This suggests that LLMs still have some limitations in clinical applications.
Collapse
Affiliation(s)
- Amna A Othman
- Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| | - Kendall A Flaharty
- Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Suzanna E Ledgister Hanchard
- Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Ping Hu
- Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Dat Duong
- Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Rebekah L Waikel
- Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Benjamin D Solomon
- Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
45
|
Bitterman J, D'Angelo A, Holachek A, Eubanks JE. Advancements in large language model accuracy for answering physical medicine and rehabilitation board review questions. PM R 2025. [PMID: 40318209 DOI: 10.1002/pmrj.13386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2024] [Revised: 01/16/2025] [Accepted: 02/17/2025] [Indexed: 05/07/2025]
Abstract
BACKGROUND There have been significant advances in machine learning and artificial intelligence technology over the past few years, leading to the release of large language models (LLMs) such as ChatGPT. There are many potential applications for LLMs in health care, but it is critical to first determine how accurate LLMs are before putting them into practice. No studies have evaluated the accuracy and precision of LLMs in responding to questions related to the field of physical medicine and rehabilitation (PM&R). OBJECTIVE To determine the accuracy and precision of two OpenAI LLMs (GPT-3.5, released in November 2022, and GPT-4o, released in May 2024) in answering questions related to PM&R knowledge. DESIGN Cross-sectional study. Both LLMs were tested on the same 744 PM&R knowledge questions that covered all aspects of the field (general rehabilitation, stroke, traumatic brain injury, spinal cord injury, musculoskeletal medicine, pain medicine, electrodiagnostic medicine, pediatric rehabilitation, prosthetics and orthotics, rheumatology, and pharmacology). Each LLM was tested three times on the same question set to assess for precision. SETTING N/A. PATIENTS N/A. INTERVENTIONS N/A. MAIN OUTCOME MEASURE Percentage of correctly answered questions. RESULTS For three runs of the 744-question set, GPT-3.5 answered 56.3%, 56.5%, and 56.9% of the questions correctly. For three runs of the same question set, GPT-4o answered 83.6%, 84%, and 84.1% of the questions correctly. GPT-4o outperformed GPT-3.5 in all subcategories of PM&R questions. CONCLUSIONS LLM technology is rapidly advancing, with the more recent GPT-4o model performing much better on PM&R knowledge questions compared to GPT-3.5. There is potential for LLMs in augmenting clinical practice, medical training, and patient education. However, the technology has limitations and physicians should remain cautious in using it in practice at this time.
Collapse
Affiliation(s)
- Jason Bitterman
- Division of Physical Medicine and Rehabilitation, Hartford Healthcare Medical Group, Hartford, Connecticut, USA
| | - Alexander D'Angelo
- Nebraska Medicine Department of Physical Medicine and Rehabilitation, University of Nebraska Medical Center, Omaha, Nebraska, USA
| | | | - James E Eubanks
- Department of Orthopedics and Physical Medicine, Division of Physical Medicine and Rehabilitation, Medical University of South Carolina (MUSC), Charleston, South Carolina, USA
- Department of Physical Medicine and Rehabilitation, University of Pittsburgh Medical Center (UPMC), Pittsburgh, Pennsylvania, USA
| |
Collapse
|
46
|
Noda R, Tanabe K, Ichikawa D, Shibagaki Y. GPT-4's performance in supporting physician decision-making in nephrology multiple-choice questions. Sci Rep 2025; 15:15439. [PMID: 40316716 PMCID: PMC12048615 DOI: 10.1038/s41598-025-99774-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2024] [Accepted: 04/22/2025] [Indexed: 05/04/2025] Open
Abstract
Generative Pre-trained Transformer (GPT)-4, a versatile conversational artificial intelligence, has potential applications in medicine, but its ability to support physicians' decision-making remains unclear. We evaluated GPT-4's performance in assisting physicians with nephrology questions. Forty-five single-answer multiple-choice questions were extracted from the Core Curriculum in Nephrology articles published in the American Journal of Kidney Diseases from October 2021 to June 2023. Eight junior physicians without board certification and ten senior physicians with board certification answered these questions twice: first unaided, then with the opportunity to revise their answers based on GPT-4's outputs. GPT-4 correctly answered 77.8% of the questions. Before using GPT-4, junior physicians had a median (interquartile range) proportion of correct answers of 53.3% (48.3-53.3), senior physicians 65.6% (60.6-66.7). After GPT-4 support, the median proportion of correct answers significantly increased to 72.2% (68.3-76.1) for juniors and 75.6% (73.3-80.0) for seniors (p = 0.008, p = 0.004). The improvement was significantly higher for junior physicians (p = 0.017). However, Senior physicians showed a decreased proportion of correct answers in one of the clinical categories. GPT-4 significantly improved physicians' accuracy in nephrology, especially among less experienced physicians, but may have negative impacts in specific subfields. Careful consideration is required when using GPT-4 to support physicians' decision-making.
Collapse
Affiliation(s)
- Ryunosuke Noda
- Division of Nephrology and Hypertension, Department of Internal Medicine, St. Marianna University School of Medicine, 2-16-1 Sugao, Miyamae-ku, Kawasaki, Kanagawa, 216-8511, Japan.
| | - Kenichiro Tanabe
- Pathophysiology and Bioregulation, St. Marianna University School of Medicine, Kawasaki, Japan
| | - Daisuke Ichikawa
- Division of Nephrology and Hypertension, Department of Internal Medicine, St. Marianna University School of Medicine, 2-16-1 Sugao, Miyamae-ku, Kawasaki, Kanagawa, 216-8511, Japan
| | - Yugo Shibagaki
- Division of Nephrology and Hypertension, Department of Internal Medicine, St. Marianna University School of Medicine, 2-16-1 Sugao, Miyamae-ku, Kawasaki, Kanagawa, 216-8511, Japan
| |
Collapse
|
47
|
Kim H, Hwang H, Lee J, Park S, Kim D, Lee T, Yoon C, Sohn J, Park J, Reykhart O, Fetherston T, Choi D, Kwak SH, Chen Q, Kang J. Small language models learn enhanced reasoning skills from medical textbooks. NPJ Digit Med 2025; 8:240. [PMID: 40316765 PMCID: PMC12048634 DOI: 10.1038/s41746-025-01653-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2025] [Accepted: 04/19/2025] [Indexed: 05/04/2025] Open
Abstract
Small language models (SLM) offer promise for medical applications by addressing the privacy and hardware constraints of large language models; however, their limited parameters (often fewer than ten billion) hinder multi-step reasoning for complex medical tasks. This study presents Meerkat, a new family of medical SLMs designed to be lightweight while enhancing reasoning capabilities. We begin by designing an effective and efficient training method. This involves extracting high-quality chain-of-thought reasoning paths from 18 medical textbooks, which are then combined with diverse instruction-following datasets within the medical domain, totaling 441K training examples. Fine-tuning was conducted on open-source SLMs using this curated dataset. Our Meerkat-7B and Meerkat-8B models outperformed their counterparts by 22.3% and 10.6% across six exam datasets, respectively. They also improved scores on the NEJM Case Challenge from 7 to 16 and from 13 to 20, surpassing the human score of 13.7. Additionally, they demonstrated superiority in expert evaluations, excelling in all metrics-completeness, factuality, clarity, and logical consistency-of reasoning abilities.
Collapse
Grants
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- Ministry of Health & Welfare, Republic of Korea
Collapse
Affiliation(s)
| | | | - Jiwoo Lee
- Korea University, Seoul, Republic of Korea
| | | | - Dain Kim
- Korea University, Seoul, Republic of Korea
| | | | | | | | | | | | | | | | - Soo Heon Kwak
- Seoul National University Hospital, Seoul, Republic of Korea
| | | | - Jaewoo Kang
- Korea University, Seoul, Republic of Korea.
- AIGEN Sciences, Seoul, Republic of Korea.
| |
Collapse
|
48
|
Ngoc Nguyen O, Amin D, Bennett J, Hetlevik Ø, Malik S, Tout A, Vornhagen H, Vellinga A. GP or ChatGPT? Ability of large language models (LLMs) to support general practitioners when prescribing antibiotics. J Antimicrob Chemother 2025; 80:1324-1330. [PMID: 40079276 PMCID: PMC12046391 DOI: 10.1093/jac/dkaf077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2024] [Accepted: 02/28/2025] [Indexed: 03/15/2025] Open
Abstract
INTRODUCTION Large language models (LLMs) are becoming ubiquitous and widely implemented. LLMs could also be used for diagnosis and treatment. National antibiotic prescribing guidelines are customized and informed by local laboratory data on antimicrobial resistance. METHODS Based on 24 vignettes with information on type of infection, gender, age group and comorbidities, GPs and LLMs were prompted to provide a treatment. Four countries (Ireland, UK, USA and Norway) were included and a GP from each country and six LLMs (ChatGPT, Gemini, Copilot, Mistral AI, Claude and Llama 3.1) were provided with the vignettes, including their location (country). Responses were compared with the country's national prescribing guidelines. In addition, limitations of LLMs such as hallucination, toxicity and data leakage were assessed. RESULTS GPs' answers to the vignettes showed high accuracy in relation to diagnosis (96%-100%) and yes/no antibiotic prescribing (83%-92%). GPs referenced (100%) and prescribed (58%-92%) according to national guidelines, but dose/duration of treatment was less accurate (50%-75%). Overall, the GPs' accuracy had a mean of 74%. LLMs scored high in relation to diagnosis (92%-100%), antibiotic prescribing (88%-100%) and the choice of antibiotic (59%-100%) but correct referencing often failed (38%-96%), in particular for the Norwegian guidelines (0%-13%). Data leakage was shown to be an issue as personal information was repeated in the models' responses to the vignettes. CONCLUSIONS LLMs may be safe to guide antibiotic prescribing in general practice. However, to interpret vignettes, apply national guidelines and prescribe the right dose and duration, GPs remain best placed.
Collapse
Affiliation(s)
- Oanh Ngoc Nguyen
- CARA Network, School of Public Health, Physiotherapy and Sports Science, University College Dublin, Dublin, Ireland
| | - Doaa Amin
- CARA Network, School of Public Health, Physiotherapy and Sports Science, University College Dublin, Dublin, Ireland
| | - James Bennett
- NIHR In Practice Fellow, Hull York Medical School, University of Hull, Hull HU6 7RX, UK
| | - Øystein Hetlevik
- Department of Global Public Health and Primary Care, University of Bergen, Bergen, Norway
| | - Sara Malik
- Midleton Medi Center, Midleton, Co Cork, Ireland
| | - Andrew Tout
- Division of General Internal Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Heike Vornhagen
- CARA Network, Insight Centre for Data Analytics, University of Galway, Galway, Ireland
| | - Akke Vellinga
- CARA Network, School of Public Health, Physiotherapy and Sports Science, University College Dublin, Dublin, Ireland
| |
Collapse
|
49
|
Hou Z, Liu H, Bian J, He X, Zhuang Y. Enhancing medical coding efficiency through domain-specific fine-tuned large language models. NPJ HEALTH SYSTEMS 2025; 2:14. [PMID: 40321467 PMCID: PMC12045799 DOI: 10.1038/s44401-025-00018-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/01/2025] [Accepted: 04/11/2025] [Indexed: 05/08/2025]
Abstract
Medical coding is essential for healthcare operations yet remains predominantly manual, error-prone (up to 20%), and costly (up to $18.2 billion annually). Although large language models (LLMs) have shown promise in natural language processing, their application to medical coding has produced limited accuracy. In this study, we evaluated whether fine-tuning LLMs with specialized ICD-10 knowledge can automate code generation across clinical documentation. We adopted a two-phase approach: initial fine-tuning using 74,260 ICD-10 code-description pairs, followed by enhanced training to address linguistic and lexical variations. Evaluations using a proprietary model (GPT-4o mini) on a cloud platform and an open-source model (Llama) on local GPUs demonstrated that initial fine-tuning increased exact matching from <1% to 97%, while enhanced fine-tuning further improved performance in complex scenarios, with real-world clinical notes achieving 69.20% exact match and 87.16% category match. These findings indicate that domain-specific fine-tuned LLMs can reduce manual burdens and improve reliability.
Collapse
Affiliation(s)
- Zhen Hou
- Department of Biomedical Engineering and Informatics, Luddy School of Informatics, Computing, and Engineering, Indiana University, Indianapolis, IN USA
| | - Hao Liu
- Department of Biomedical Engineering and Informatics, Luddy School of Informatics, Computing, and Engineering, Indiana University, Indianapolis, IN USA
- School of Computing, College of Science and Mathematics, Montclair State University, Montclair, NJ USA
| | - Jiang Bian
- Department of Biomedical Engineering and Informatics, Luddy School of Informatics, Computing, and Engineering, Indiana University, Indianapolis, IN USA
- Department of Biostatistics and Health Data Science, School of Medicine, Indiana University, Indianapolis, IN USA
- Regenstrief Institute, Indiana University, Indianapolis, IN USA
- Indiana University Health, Indianapolis, IN USA
| | - Xing He
- Department of Biostatistics and Health Data Science, School of Medicine, Indiana University, Indianapolis, IN USA
- Regenstrief Institute, Indiana University, Indianapolis, IN USA
| | - Yan Zhuang
- Department of Biomedical Engineering and Informatics, Luddy School of Informatics, Computing, and Engineering, Indiana University, Indianapolis, IN USA
| |
Collapse
|
50
|
Song ES, Lee S. Comparative Analysis of the Response Accuracies of Large Language Models in the Korean National Dental Hygienist Examination Across Korean and English Questions. Int J Dent Hyg 2025; 23:267-276. [PMID: 39415339 PMCID: PMC11982589 DOI: 10.1111/idh.12848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Revised: 08/29/2024] [Accepted: 09/26/2024] [Indexed: 10/18/2024]
Abstract
INTRODUCTION Large language models such as Gemini, GPT-3.5, and GPT-4 have demonstrated significant potential in the medical field. Their performance in medical licensing examinations globally has highlighted their capabilities in understanding and processing specialized medical knowledge. This study aimed to evaluate and compare the performance of Gemini, GPT-3.5, and GPT-4 in the Korean National Dental Hygienist Examination. The accuracy of answering the examination questions in both Korean and English was assessed. METHODS This study used a dataset comprising questions from the Korean National Dental Hygienist Examination over 5 years (2019-2023). A two-way analysis of variance (ANOVA) test was employed to investigate the impacts of model type and language on the accuracy of the responses. Questions were input into each model under standardized conditions, and responses were classified as correct or incorrect based on predefined criteria. RESULTS GPT-4 consistently outperformed the other models, achieving the highest accuracy rates across both language versions annually. In particular, it showed superior performance in English, suggesting advancements in its training algorithms for language processing. However, all models demonstrated variable accuracies in subjects with localized characteristics, such as health and medical law. CONCLUSIONS These findings indicate that GPT-4 holds significant promise for application in medical education and standardized testing, especially in English. However, the variability in performance across different subjects and languages underscores the need for ongoing improvements and the inclusion of more diverse and localized training datasets to enhance the models' effectiveness in multilingual and multicultural contexts.
Collapse
Grants
- ProjectNumber The Ministry of Science and ICT, the Ministry of Trade, Industry and Energy, the Ministry of Health & Welfare, the Ministry of Food and Drug Safety, KOREA
- 1711196792,RS-2023-00253380 The Ministry of Science and ICT, the Ministry of Trade, Industry and Energy, the Ministry of Health & Welfare, the Ministry of Food and Drug Safety, KOREA
Collapse
Affiliation(s)
- Eun Sun Song
- Department of Oral Anatomy, Dental Research InstituteSchool of Dentistry Seoul National UniversitySeoulSouth Korea
| | - Seung‐Pyo Lee
- Department of Oral Anatomy, Dental Research InstituteSchool of Dentistry Seoul National UniversitySeoulSouth Korea
| |
Collapse
|