1
|
Chew BH, Lai PSM, Sivaratnam DA, Basri NI, Appannah G, Mohd Yusof BN, Thambiah SC, Nor Hanipah Z, Wong PF, Chang LC. Efficient and Effective Diabetes Care in the Era of Digitalization and Hypercompetitive Research Culture: A Focused Review in the Western Pacific Region with Malaysia as a Case Study. Health Syst Reform 2025; 11:2417788. [PMID: 39761168 DOI: 10.1080/23288604.2024.2417788] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 08/28/2024] [Accepted: 10/14/2024] [Indexed: 01/11/2025] Open
Abstract
There are approximately 220 million (about 12% regional prevalence) adults living with diabetes mellitus (DM) with its related complications, and morbidity knowingly or unconsciously in the Western Pacific Region (WP). The estimated healthcare cost in the WP and Malaysia was 240 billion USD and 1.0 billion USD in 2021 and 2017, respectively, with unmeasurable suffering and loss of health quality and economic productivity. This urgently calls for nothing less than concerted and preventive efforts from all stakeholders to invest in transforming healthcare professionals and reforming the healthcare system that prioritizes primary medical care setting, empowering allied health professionals, improvising health organization for the healthcare providers, improving health facilities and non-medical support for the people with DM. This article alludes to challenges in optimal diabetes care and proposes evidence-based initiatives over a 5-year period in a detailed roadmap to bring about dynamic and efficient healthcare services that are effective in managing people with DM using Malaysia as a case study for reference of other countries with similar backgrounds and issues. This includes a scanning on the landscape of clinical research in DM, dimensions and spectrum of research misconducts, possible common biases along the whole research process, key preventive strategies, implementation and limitations toward high-quality research. Lastly, digital medicine and how artificial intelligence could contribute to diabetes care and open science practices in research are also discussed.
Collapse
Affiliation(s)
- Boon-How Chew
- Department of Family Medicine, Faculty of Medicine and Health Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Family Medicine Specialist Clinic, Hospital Sultan Abdul Aziz Shah (HSAAS Teaching Hospital), Persiaran MARDI - UPM, Serdang, Selangor, Malaysia
| | - Pauline Siew Mei Lai
- Department of Primary Care Medicine, Faculty of Medicine, Universiti Malaya, School of Medical and Life Sciences, Sunway University, Kuala Lumpur, Selangor, Malaysia
| | - Dhashani A/P Sivaratnam
- Department of Opthalmology, Faculty of .Medicine and Health Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Nurul Iftida Basri
- Department of Obstetrics and Gynaecology, Faculty of Medicine and Health Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Geeta Appannah
- Department of Nutrition, Faculty of Medicine and Health Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Barakatun Nisak Mohd Yusof
- Department of Dietetics, Faculty of Medicine and Health Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Subashini C Thambiah
- Department of Pathology, Faculty of Medicine and Health Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Zubaidah Nor Hanipah
- Department of Surgery, Faculty of Medicine and Health Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | | | - Li-Cheng Chang
- Kuang Health Clinic, Pekan Kuang, Gombak, Selangor, Malaysia
| |
Collapse
|
2
|
Baker HP, Aggarwal S, Kalidoss S, Hess M, Haydon R, Strelzow JA. Diagnostic accuracy of ChatGPT-4 in orthopedic oncology: a comparative study with residents. Knee 2025; 55:153-160. [PMID: 40311171 DOI: 10.1016/j.knee.2025.04.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 03/16/2025] [Accepted: 04/05/2025] [Indexed: 05/03/2025]
Abstract
BACKGROUND Artificial intelligence (AI) is increasingly being explored for its potential role in medical diagnostics. ChatGPT-4, a large language model (LLM) with image analysis capabilities, may assist in histopathological interpretation, but its accuracy in musculoskeletal oncology remains untested. This study evaluates ChatGPT-4's diagnostic accuracy in identifying musculoskeletal tumors from histology slides compared to orthopedic surgery residents. METHODS A comparative study was conducted using 24 histology slides randomly selected from an orthopedic oncology registry. Five teams of orthopedic surgery residents (PGY-1 to PGY-5) participated in a diagnostic competition, providing their best diagnosis for each slide. ChatGPT-4 was tested separately using identical histology images and clinical vignettes, with two independent attempts. Statistical analyses, including one-way ANOVA and independent t-tests were performed to compare diagnostic accuracy. RESULTS Orthopedic residents significantly outperformed ChatGPT-4 in diagnosing musculoskeletal tumors. The mean diagnostic accuracy among resident teams was 55%, while ChatGPT-4 achieved 25% on its first attempt and 33% on its second attempt. One-way ANOVA revealed a significant difference in accuracy across groups (F = 8.51, p = 0.033). Independent t-tests confirmed that residents performed significantly better than ChatGPT-4 (t = 5.80, p = 0.0004 for first attempt; t = 4.25, p = 0.0028 for second attempt). Both residents and ChatGPT-4 struggled with specific cases, particularly soft tissue sarcomas. CONCLUSIONS ChatGPT-4 demonstrated limited accuracy in interpreting histopathological slides compared to orthopedic residents. While AI holds promise for medical diagnostics, its current capabilities in musculoskeletal oncology remain insufficient for independent clinical use. These findings should be viewed as exploratory rather than confirmatory, and further research with larger, more diverse datasets is needed to assess AI's role in histopathology. Future studies should investigate AI-assisted workflows, refine prompt engineering, and explore AI models specifically trained for histopathological diagnosis.
Collapse
Affiliation(s)
- Hayden P Baker
- The University of Chicago Department of Orthopaedic Surgery, Chicago, IL 60637, United States.
| | - Sarthak Aggarwal
- The University of Chicago Department of Orthopaedic Surgery, Chicago, IL 60637, United States
| | - Senthooran Kalidoss
- The University of Chicago Department of Orthopaedic Surgery, Chicago, IL 60637, United States
| | - Matthew Hess
- The University of Chicago Department of Orthopaedic Surgery, Chicago, IL 60637, United States
| | - Rex Haydon
- The University of Chicago Department of Orthopaedic Surgery, Chicago, IL 60637, United States
| | - Jason A Strelzow
- The University of Chicago Department of Orthopaedic Surgery, Chicago, IL 60637, United States
| |
Collapse
|
3
|
Wan F, Wang T, Wang K, Si Y, Fondrevelle J, Du S, Duclos A. Surgery scheduling based on large language models. Artif Intell Med 2025; 166:103151. [PMID: 40349664 DOI: 10.1016/j.artmed.2025.103151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2024] [Revised: 01/13/2025] [Accepted: 05/01/2025] [Indexed: 05/14/2025]
Abstract
Large Language Models (LLMs) have shown remarkable potential in various fields. This study explores their application in solving multi-objective combinatorial optimization problems-surgery scheduling problem. Traditional multi-objective optimization algorithms, such as the Non-dominated Sorting Genetic Algorithm II (NSGA-II), often require domain expertise for designing precise operators. Here, we propose LLM-NSGA, where LLMs act as evolutionary optimizers, performing selection, crossover, and mutation operations. Results show that for 40 cases, LLMs can independently generate high-quality solutions from prompts. As problem size increases, LLM-NSGA outperformed traditional approaches like NSGA-II and MOEA/D, achieving average improvements of 5.39 %, 80 %, and 0.42 % in three objectives. While LLM-NSGA provided similar results to EoH, another LLM-based method, it outperformed EoH in overall resource allocation. Additionally, we applied LLMs for hyperparameter optimization, comparing them with Bayesian Optimization and Ant Colony Optimization (ACO). LLMs reduced runtime by an average of 23.68 %, and their generated parameters, validated with NSGA-II, produced better surgery scheduling solutions. This demonstrates that LLMs can not only help traditional algorithms find better solutions but also optimize their parameters efficiently.
Collapse
Affiliation(s)
- Fang Wan
- INSA LYON, Université Lyon2, Université Claude Bernard Lyon1, Université Jean Monnet Saint-Etienne, DISP UR4570, France.
| | - Tao Wang
- INSA LYON, Université Lyon2, Université Claude Bernard Lyon1, Université Jean Monnet Saint-Etienne, DISP UR4570, France
| | - Kezhi Wang
- Department of Computer Science, Brunel University London, UK
| | | | - Julien Fondrevelle
- INSA LYON, Université Lyon2, Université Claude Bernard Lyon1, Université Jean Monnet Saint-Etienne, DISP UR4570, France
| | - Shuimiao Du
- Sino-European School of Shanghai University, China
| | - Antoine Duclos
- Research on Healthcare Performance RESHAPE, Université Claude Bernard, Lyon 1, France
| |
Collapse
|
4
|
Pushpanathan K, Zou M, Srinivasan S, Wong WM, Mangunkusumo EA, Thomas GN, Lai Y, Sun CH, Lam JSH, Tan MCJ, Lin HAH, Ma W, Koh VTC, Chen DZ, Tham YC. Can OpenAI's New o1 Model Outperform Its Predecessors in Common Eye Care Queries? OPHTHALMOLOGY SCIENCE 2025; 5:100745. [PMID: 40291392 PMCID: PMC12022690 DOI: 10.1016/j.xops.2025.100745] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/30/2024] [Revised: 02/01/2025] [Accepted: 02/14/2025] [Indexed: 04/30/2025]
Abstract
Objective The newly launched OpenAI o1 is said to offer improved reasoning, potentially providing higher quality responses to eye care queries. However, its performance remains unassessed. We evaluated the performance of o1, ChatGPT-4o, and ChatGPT-4 in addressing ophthalmic-related queries, focusing on correctness, completeness, and readability. Design Cross-sectional study. Subjects Sixteen queries, previously identified as suboptimally responded to by ChatGPT-4 from prior studies, were used, covering 3 subtopics: myopia (6 questions), ocular symptoms (4 questions), and retinal conditions (6 questions). Methods For each subtopic, 3 attending-level ophthalmologists, masked to the model sources, evaluated the responses based on correctness, completeness, and readability (on a 5-point scale for each metric). Main Outcome Measures Mean summed scores of each model for correctness, completeness, and readability, rated on a 5-point scale (maximum score: 15). Results O1 scored highest in correctness (12.6) and readability (14.2), outperforming ChatGPT-4, which scored 10.3 (P = 0.010) and 12.4 (P < 0.001), respectively. No significant difference was found between o1 and ChatGPT-4o. When stratified by subtopics, o1 consistently demonstrated superior correctness and readability. In completeness, ChatGPT-4o achieved the highest score of 12.4, followed by o1 (10.8), though the difference was not statistically significant. o1 showed notable limitations in completeness for ocular symptom queries, scoring 5.5 out of 15. Conclusions While o1 is marketed as offering improved reasoning capabilities, its performance in addressing eye care queries does not significantly differ from its predecessor, ChatGPT-4o. Nevertheless, it surpasses ChatGPT-4, particularly in correctness and readability. Financial Disclosures Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Collapse
Affiliation(s)
- Krithi Pushpanathan
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
| | - Minjie Zou
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
| | - Sahana Srinivasan
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
| | - Wendy Meihua Wong
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Erlangga Ariadarma Mangunkusumo
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - George Naveen Thomas
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Yien Lai
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Chen-Hsin Sun
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Janice Sing Harn Lam
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Marcus Chun Jin Tan
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Hazel Anne Hui'En Lin
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Weizhi Ma
- Institute for AI Industry Research, Tsinghua University, Beijing, China
| | - Victor Teck Chang Koh
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - David Ziyou Chen
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Yih-Chung Tham
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
- Eye Academic Clinical Program (Eye ACP), Duke NUS Medical School, Singapore
| |
Collapse
|
5
|
Zaman A, Yassin MM, Mehmud I, Cao A, Lu J, Hassan H, Kang Y. Challenges, optimization strategies, and future horizons of advanced deep learning approaches for brain lesion segmentation. Methods 2025; 239:140-168. [PMID: 40306473 DOI: 10.1016/j.ymeth.2025.04.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2025] [Revised: 04/17/2025] [Accepted: 04/24/2025] [Indexed: 05/02/2025] Open
Abstract
Brain lesion segmentation is challenging in medical image analysis, aiming to delineate lesion regions precisely. Deep learning (DL) techniques have recently demonstrated promising results across various computer vision tasks, including semantic segmentation, object detection, and image classification. This paper offers an overview of recent DL algorithms for brain tumor and stroke segmentation, drawing on literature from 2021 to 2024. It highlights the strengths, limitations, current research challenges, and unexplored areas in imaging-based brain lesion classification based on insights from over 250 recent review papers. Techniques addressing difficulties like class imbalance and multi-modalities are presented. Optimization methods for improving performance regarding computational and structural complexity and processing speed are discussed. These include lightweight neural networks, multilayer architectures, and computationally efficient, highly accurate network designs. The paper also reviews generic and latest frameworks of different brain lesion detection techniques and highlights publicly available benchmark datasets and their issues. Furthermore, open research areas, application prospects, and future directions for DL-based brain lesion classification are discussed. Future directions include integrating neural architecture search methods with domain knowledge, predicting patient survival levels, and learning to separate brain lesions using patient statistics. To ensure patient privacy, future research is anticipated to explore privacy-preserving learning frameworks. Overall, the presented suggestions serve as a guideline for researchers and system designers involved in brain lesion detection and stroke segmentation tasks.
Collapse
Affiliation(s)
- Asim Zaman
- School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen 518055, China; College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China; Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Medical School, Shenzhen University, Shenzhen 518060, China
| | - Mazen M Yassin
- School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen 518055, China; College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China
| | - Irfan Mehmud
- Department of Urology, The Third Affiliated Hospital of Shenzhen University (Luohu Hospital Group), Shenzhen University, Shenzhen 518000, China; Institute of Urology, South China Hospital, Medicine School, Shenzhen University, Shenzhen 518000, China
| | - Anbo Cao
- College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China; School of Applied Technology, Shenzhen University, Shenzhen 518055, China
| | - Jiaxi Lu
- College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China; School of Applied Technology, Shenzhen University, Shenzhen 518055, China
| | - Haseeb Hassan
- College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China
| | - Yan Kang
- School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen 518055, China; College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China; School of Applied Technology, Shenzhen University, Shenzhen 518055, China; Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Medical School, Shenzhen University, Shenzhen 518060, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang 110169, China.
| |
Collapse
|
6
|
Solomon TPJ, Laye MJ. The sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots: An assessment of accuracy, completeness, clarity, quality of evidence, and test-retest reliability. PLoS One 2025; 20:e0325982. [PMID: 40512755 PMCID: PMC12165421 DOI: 10.1371/journal.pone.0325982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2025] [Accepted: 05/21/2025] [Indexed: 06/16/2025] Open
Abstract
BACKGROUND Generative artificial intelligence (AI) chatbots are increasingly utilised in various domains, including sports nutrition. Despite their growing popularity, there is limited evidence on the accuracy, completeness, clarity, evidence quality, and test-retest reliability of AI-generated sports nutrition advice. This study evaluates the performance of ChatGPT, Gemini, and Claude's basic and advanced models across these metrics to determine their utility in providing sports nutrition information. MATERIALS AND METHODS Two experiments were conducted. In Experiment 1, chatbots were tested with simple and detailed prompts in two domains: Sports nutrition for training and Sports nutrition for racing. Intraclass correlation coefficient (ICC) was used to assess interrater agreement and chatbot performance was assessed by measuring accuracy, completeness, clarity, evidence quality, and test-retest reliability. In Experiment 2, chatbot performance was evaluated by measuring the accuracy and test-retest reliability of chatbots' answers to multiple-choice questions based on a sports nutrition certification exam. ANOVAs and logistic mixed models were used to analyse chatbot performance. RESULTS In Experiment 1, interrater agreement was good (ICC = 0.893) and accuracy varied from 74% (Gemini1.5pro) to 31% (ClaudePro). Detailed prompts improved Claude's accuracy but had little impact on ChatGPT or Gemini. Completeness scores were highest for ChatGPT-4o compared to other chatbots, which scored low to moderate. The quality of cited evidence was low for all chatbots when simple prompts were used but improved with detailed prompts. In Experiment 2, accuracy ranged from 89% (Claude3.5Sonnet) to 61% (ClaudePro). Test-retest reliability was acceptable across all metrics in both experiments. CONCLUSIONS While generative AI chatbots demonstrate potential in providing sports nutrition guidance, their accuracy is moderate at best and inconsistent between models. Until significant advancements are made, athletes and coaches should consult registered dietitians for tailored nutrition advice.
Collapse
Affiliation(s)
| | - Matthew J. Laye
- Idaho College of Osteopathic Medicine, Meridian, Idaho, United States of America
| |
Collapse
|
7
|
Forero DA, Abreu SE, Tovar BE, Oermann MH. Large Language Models and the Analyses of Adherence to Reporting Guidelines in Systematic Reviews and Overviews of Reviews (PRISMA 2020 and PRIOR). J Med Syst 2025; 49:80. [PMID: 40504403 PMCID: PMC12162794 DOI: 10.1007/s10916-025-02212-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2025] [Accepted: 05/29/2025] [Indexed: 06/16/2025]
Abstract
In the context of Evidence-Based Practice (EBP), Systematic Reviews (SRs), Meta-Analyses (MAs) and overview of reviews have become cornerstones for the synthesis of research findings. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 and Preferred Reporting Items for Overviews of Reviews (PRIOR) statements have become major reporting guidelines for SRs/MAs and for overviews of reviews, respectively. In recent years, advances in Generative Artificial Intelligence (genAI) have been proposed as a potential major paradigm shift in scientific research. The main aim of this research was to examine the performance of four LLMs for the analysis of adherence to PRISMA 2020 and PRIOR, in a sample of 20 SRs and 20 overviews of reviews. We tested the free versions of four commonly used LLMs: ChatGPT (GPT-4o), DeepSeek (V3), Gemini (2.0 Flash) and Qwen (2.5 Max). Adherence to PRISMA 2020 and PRIOR was compared with scores defined previously by human experts, using several statistical tests. In our results, all the four LLMs showed a low performance for the analysis of adherence to PRISMA 2020, overestimating the percentage of adherence (from 23 to 30%). For PRIOR, the LLMs presented lower differences in the estimation of adherence (from 6 to 14%) and ChatGPT showed a performance similar to human experts. This is the first report of the performance of four commonly used LLMs for the analysis of adherence to PRISMA 2020 and PRIOR. Future studies of adherence to other reporting guidelines will be helpful in health sciences research.
Collapse
Affiliation(s)
- Diego A Forero
- School of Health and Sport Sciences, Fundación Universitaria del Área Andina, Bogotá, Colombia.
| | - Sandra E Abreu
- Psychology Program, Fundación Universitaria del Área Andina, Medellín, Colombia
| | - Blanca E Tovar
- Nursing Program, School of Health and Sport Sciences, Fundación Universitaria del Área Andina, Bogotá, Colombia
| | | |
Collapse
|
8
|
Linardon J, Messer M, Anderson C, Liu C, McClure Z, Jarman HK, Goldberg SB, Torous J. Role of large language models in mental health research: an international survey of researchers' practices and perspectives. BMJ MENTAL HEALTH 2025; 28:e301787. [PMID: 40514050 PMCID: PMC12164621 DOI: 10.1136/bmjment-2025-301787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 05/09/2025] [Accepted: 06/03/2025] [Indexed: 06/16/2025]
Abstract
BACKGROUND Large language models (LLMs) offer significant potential to streamline research workflows and enhance productivity. However, limited data exist on the extent of their adoption within the mental health research community. OBJECTIVE We examined how LLMs are being used in mental health research, the types of tasks they support, barriers to their adoption and broader attitudes towards their integration. METHODS 714 mental health researchers from 42 countries and various career stages (from PhD student, to early career researcher, to Professor) completed a survey assessing LLM-related practices and perspectives. FINDINGS 496 (69.5%) reported using LLMs to assist with research, with 94% indicating use of ChatGPT. The most common applications were for proofreading written work (69%) and refining or generating code (49%). LLM use was more prevalent among early career researchers. Common challenges reported by users included inaccurate responses (78%), ethical concerns (48%) and biased outputs (27%). However, many users indicated that LLMs improved efficiency (73%) and output quality (44%). Reasons for non-use were concerns with ethical issues (53%) and accuracy of outputs (50%). Most agreed that they wanted more training on responsible use (77%), that researchers should be required to disclose use of LLMs in manuscripts (79%) and that they were concerned about LLMs affecting how their work is evaluated (60%). CONCLUSION While LLM use is widespread in mental health research, key barriers and implementation challenges remain. CLINICAL IMPLICATIONS LLMs may streamline mental health research processes, but clear guidelines are needed to support their ethical and transparent use across the research lifecycle.
Collapse
Affiliation(s)
- Jake Linardon
- SEED Lifespan Strategic Research Centre, School of Psychology, Faculty of Health, Deakin University, Geelong, Victoria, Australia
| | - Mariel Messer
- SEED Lifespan Strategic Research Centre, School of Psychology, Faculty of Health, Deakin University, Geelong, Victoria, Australia
| | - Cleo Anderson
- SEED Lifespan Strategic Research Centre, School of Psychology, Faculty of Health, Deakin University, Geelong, Victoria, Australia
| | - Claudia Liu
- SEED Lifespan Strategic Research Centre, School of Psychology, Faculty of Health, Deakin University, Geelong, Victoria, Australia
| | - Zoe McClure
- SEED Lifespan Strategic Research Centre, School of Psychology, Faculty of Health, Deakin University, Geelong, Victoria, Australia
| | - Hannah K Jarman
- SEED Lifespan Strategic Research Centre, School of Psychology, Faculty of Health, Deakin University, Geelong, Victoria, Australia
| | - Simon B Goldberg
- Department of Counselling Psychology, University of Wisconsin, Madison, Wisconsin, USA
- Center for Healthy Minds, University of Wisconsin, Madison, Wisconsin, USA
| | - John Torous
- Division of Digital Psychiatry, Department of Psychiatry, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
| |
Collapse
|
9
|
Su H, Sun Y, Li R, Zhang A, Yang Y, Xiao F, Duan Z, Chen J, Hu Q, Yang T, Xu B, Zhang Q, Zhao J, Li Y, Li H. Large Language Models in Medical Diagnostics: Scoping Review With Bibliometric Analysis. J Med Internet Res 2025; 27:e72062. [PMID: 40489764 DOI: 10.2196/72062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2025] [Revised: 03/24/2025] [Accepted: 04/21/2025] [Indexed: 06/11/2025] Open
Abstract
BACKGROUND The integration of large language models (LLMs) into medical diagnostics has garnered substantial attention due to their potential to enhance diagnostic accuracy, streamline clinical workflows, and address health care disparities. However, the rapid evolution of LLM research necessitates a comprehensive synthesis of their applications, challenges, and future directions. OBJECTIVE This scoping review aimed to provide an overview of the current state of research regarding the use of LLMs in medical diagnostics. The study sought to answer four primary subquestions, as follows: (1) Which LLMs are commonly used? (2) How are LLMs assessed in diagnosis? (3) What is the current performance of LLMs in diagnosing diseases? (4) Which medical domains are investigating the application of LLMs? METHODS This scoping review was conducted according to the Joanna Briggs Institute Manual for Evidence Synthesis and adheres to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews). Relevant literature was searched from the Web of Science, PubMed, Embase, IEEE Xplore, and ACM Digital Library databases from 2022 to 2025. Articles were screened and selected based on predefined inclusion and exclusion criteria. Bibliometric analysis was performed using VOSviewer to identify major research clusters and trends. Data extraction included details on LLM types, application domains, and performance metrics. RESULTS The field is rapidly expanding, with a surge in publications after 2023. GPT-4 and its variants dominated research (70/95, 74% of studies), followed by GPT-3.5 (34/95, 36%). Key applications included disease classification (text or image-based), medical question answering, and diagnostic content generation. LLMs demonstrated high accuracy in specialties like radiology, psychiatry, and neurology but exhibited biases in race, gender, and cost predictions. Ethical concerns, including privacy risks and model hallucination, alongside regulatory fragmentation, were critical barriers to clinical adoption. CONCLUSIONS LLMs hold transformative potential for medical diagnostics but require rigorous validation, bias mitigation, and multimodal integration to address real-world complexities. Future research should prioritize explainable artificial intelligence frameworks, specialty-specific optimization, and international regulatory harmonization to ensure equitable and safe clinical deployment.
Collapse
Affiliation(s)
- Hankun Su
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
- Xiangya School of Medicine, Central South University, Changsha, China
| | - Yuanyuan Sun
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Ruiting Li
- School of Biomedical Sciences and Engineering, South China University of Technology, Guangzhou, China
| | - Aozhe Zhang
- Xiangya School of Medicine, Central South University, Changsha, China
| | - Yuemeng Yang
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
- Xiangya School of Medicine, Central South University, Changsha, China
| | - Fen Xiao
- Department of Metabolism and Endocrinology, Second Xiangya Hospital of Central South University, Changsha, China
| | - Zhiying Duan
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Jingjing Chen
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Qin Hu
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Tianli Yang
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Bin Xu
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Qiong Zhang
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Jing Zhao
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Yanping Li
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Hui Li
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| |
Collapse
|
10
|
Stephan D, Bertsch AS, Schumacher S, Puladi B, Burwinkel M, Al-Nawas B, Kämmerer PW, Thiem DG. Improving Patient Communication by Simplifying AI-Generated Dental Radiology Reports With ChatGPT: Comparative Study. J Med Internet Res 2025; 27:e73337. [PMID: 40489773 DOI: 10.2196/73337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2025] [Revised: 04/15/2025] [Accepted: 04/28/2025] [Indexed: 06/11/2025] Open
Abstract
BACKGROUND Medical reports, particularly radiology findings, are often written for professional communication, making them difficult for patients to understand. This communication barrier can reduce patient engagement and lead to misinterpretation. Artificial intelligence (AI), especially large language models such as ChatGPT, offers new opportunities for simplifying medical documentation to improve patient comprehension. OBJECTIVE We aimed to evaluate whether AI-generated radiology reports simplified by ChatGPT improve patient understanding, readability, and communication quality compared to original AI-generated reports. METHODS In total, 3 versions of radiology reports were created using ChatGPT: an original AI-generated version (text 1), a patient-friendly, simplified version (text 2), and a further simplified and accessibility-optimized version (text 3). A total of 300 patients (n=100, 33.3% per group), excluding patients with medical education, were randomly assigned to review one text version and complete a standardized questionnaire. Readability was assessed using the Flesch Reading Ease (FRE) score and LIX indices. RESULTS Both simplified texts showed significantly higher readability scores (text 1: FRE score=51.1; text 2: FRE score=55.0; and text 3: FRE score=56.4; P<.001) and lower LIX scores, indicating enhanced clarity. Text 3 had the shortest sentences, had the fewest long words, and scored best on all patient-rated dimensions. Questionnaire results revealed significantly higher ratings for texts 2 and 3 across clarity (P<.001), tone (P<.001), structure, and patient engagement. For example, patients rated the ability to understand findings without help highest for text 3 (mean 1.5, SD 0.7) and lowest for text 1 (mean 3.1, SD 1.4). Both simplified texts significantly improved patients' ability to prepare for clinical conversations and promoted shared decision-making. CONCLUSIONS AI-generated simplification of radiology reports significantly enhances patient comprehension and engagement. These findings highlight the potential of ChatGPT as a tool to improve patient-centered communication. While promising, future research should focus on ensuring clinical accuracy and exploring applications across diverse patient populations to support equitable and effective integration of AI in health care communication.
Collapse
Affiliation(s)
- Daniel Stephan
- Department of Oral and Maxillofacial Surgery, Facial Plastic Surgery, University Medical Centre, Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Annika S Bertsch
- Department of Oral and Maxillofacial Surgery, Facial Plastic Surgery, University Medical Centre, Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Sophia Schumacher
- Department of Oral and Maxillofacial Surgery, Facial Plastic Surgery, University Medical Centre, Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Behrus Puladi
- Department of Oral and Maxillofacial Surgery, University Hospital Rheinisch-Westfälische Technische Hochschule Aachen, Aachen, Germany
| | - Matthias Burwinkel
- Department of Oral and Maxillofacial Surgery, Facial Plastic Surgery, University Medical Centre, Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Bilal Al-Nawas
- Department of Oral and Maxillofacial Surgery, Facial Plastic Surgery, University Medical Centre, Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Peer W Kämmerer
- Department of Oral and Maxillofacial Surgery, Facial Plastic Surgery, University Medical Centre, Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Daniel Ge Thiem
- Department of Oral and Maxillofacial Surgery, Facial Plastic Surgery, University Medical Centre, Johannes Gutenberg-University Mainz, Mainz, Germany
| |
Collapse
|
11
|
Hijazi W, Builoff V, Killekar A, Shanbhag A, Miller RJ, Dey D, Liang JX, Flood K, Berman D, Bourque JM, Phillips LM, Chareonthaitawee P, Slomka PJ. Priming with specific context improves large language model performance on nuclear cardiology board preparation test. J Nucl Cardiol 2025:102269. [PMID: 40490095 DOI: 10.1016/j.nuclcard.2025.102269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2025] [Revised: 05/06/2025] [Accepted: 06/02/2025] [Indexed: 06/11/2025]
Affiliation(s)
- Waseem Hijazi
- Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Valerie Builoff
- Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Aditya Killekar
- Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Aakash Shanbhag
- Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA; Signal and Image Processing Institute, Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA, USA
| | - Robert Jh Miller
- Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA; Department of Cardiac Sciences, University of Calgary, Calgary AB, Canada
| | - Damini Dey
- Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Joanna X Liang
- Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Kathleen Flood
- American Society of Nuclear Cardiology, Fairfax, Virginia, USA
| | - Daniel Berman
- Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Jamieson M Bourque
- Division of Cardiovascular Medicine and Radiology, University of Virginia Health System, Charlottesville, VA, USA
| | - Lawrence M Phillips
- Leon H. Charney Division of Cardiology, Department of Medicine, NYU Grossman School of Medicine, New York, NY, USA
| | | | - Piotr J Slomka
- Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA.
| |
Collapse
|
12
|
Wu X, Huang Y, He Q. A large language model improves clinicians' diagnostic performance in complex critical illness cases. Crit Care 2025; 29:230. [PMID: 40481529 PMCID: PMC12143052 DOI: 10.1186/s13054-025-05468-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2025] [Accepted: 05/24/2025] [Indexed: 06/11/2025] Open
Abstract
BACKGROUND Large language models (LLMs) have demonstrated potential in assisting clinical decision-making. However, studies evaluating LLMs' diagnostic performance on complex critical illness cases are lacking. We aimed to assess the diagnostic accuracy and response quality of an artificial intelligence (AI) model, and evaluate its potential benefits in assisting critical care residents with differential diagnosis of complex cases. METHODS This prospective comparative study collected challenging critical illness cases from the literature. Critical care residents from tertiary teaching hospitals were recruited and randomly assigned to non-AI-assisted physician and AI-assisted physician groups. We selected a reasoning model, DeepSeek-R1, for our study. We evaluated the model's response quality using Likert scales, and we compared the diagnostic accuracy and efficiency between groups. RESULTS A total of 48 cases were included. Thirty-two critical care residents were recruited, with 16 residents assigned to each group. Each resident handled an average of 3 cases. DeepSeek-R1's responses received median Likert grades of 4.0 (IQR 4.0-5.0; 95% CI 4.0-4.5) for completeness, 5.0 (IQR 4.0-5.0; 95% CI 4.5-5.0) for clarity, and 5.0 (IQR 4.0-5.0; 95% CI 4.0-5.0) for usefulness. The AI model's top diagnosis accuracy was 60% (29/48; 95% CI 0.456-0.729), with a median differential diagnosis quality score of 5.0 (IQR 4.0-5.0; 95% CI 4.5-5.0). Top diagnosis accuracy was 27% (13/48; 95% CI 0.146-0.396) in the non-AI-assisted physician group versus 58% (28/48; 95% CI 0.438-0.729) in the AI-assisted physician group. Median differential quality scores were 3.0 (IQR 0-5.0; 95% CI 2.0-4.0) without and 5.0 (IQR 3.0-5.0; 95% CI 3.0-5.0) with AI assistance. The AI model showed higher diagnostic accuracy than residents, and AI assistance significantly improved residents' accuracy. The residents' diagnostic time significantly decreased with AI assistance (median, 972 s; IQR 570-1320; 95% CI 675-1200) versus without (median, 1920 s; IQR 1320-2640; 95% CI 1710-2370). CONCLUSIONS For diagnostically difficult critical illness cases, DeepSeek-R1 generates high-quality information, achieves reasonable diagnostic accuracy, and significantly improves residents' diagnostic accuracy and efficiency. Reasoning models are suggested to be promising diagnostic adjuncts in intensive care units.
Collapse
Affiliation(s)
- Xintong Wu
- Department of Intensive Care Medicine, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China
| | - Yu Huang
- Department of Intensive Care Medicine, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China.
| | - Qing He
- Department of Intensive Care Medicine, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China
| |
Collapse
|
13
|
Su Y, Babore YB, Kahn CE. A Large Language Model to Detect Negated Expressions in Radiology Reports. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2025; 38:1297-1303. [PMID: 39322813 DOI: 10.1007/s10278-024-01274-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 08/28/2024] [Accepted: 09/12/2024] [Indexed: 09/27/2024]
Abstract
Natural language processing (NLP) is crucial to extract information accurately from unstructured text to provide insights for clinical decision-making, quality improvement, and medical research. This study compared the performance of a rule-based NLP system and a medical-domain transformer-based model to detect negated concepts in radiology reports. Using a corpus of 984 de-identified radiology reports from a large U.S.-based academic health system (1000 consecutive reports, excluding 16 duplicates), the investigators compared the rule-based medspaCy system and the Clinical Assertion and Negation Classification Bidirectional Encoder Representations from Transformers (CAN-BERT) system to detect negated expressions of terms from RadLex, the Unified Medical Language System Metathesaurus, and the Radiology Gamuts Ontology. Power analysis determined a sample size of 382 terms to achieve α = 0.05 and β = 0.8 for McNemar's test; based on an estimate of 15% negated terms, 2800 randomly selected terms were annotated manually as negated or not negated. Precision, recall, and F1 of the two models were compared using McNemar's test. Of the 2800 terms, 387 (13.8%) were negated. For negation detection, medspaCy attained a recall of 0.795, precision of 0.356, and F1 of 0.492. CAN-BERT achieved a recall of 0.785, precision of 0.768, and F1 of 0.777. Although recall was not significantly different, CAN-BERT had significantly better precision (χ2 = 304.64; p < 0.001). The transformer-based CAN-BERT model detected negated terms in radiology reports with high precision and recall; its precision significantly exceeded that of the rule-based medspaCy system. Use of this system will improve data extraction from textual reports to support information retrieval, AI model training, and discovery of causal relationships.
Collapse
Affiliation(s)
- Yvonne Su
- Department of Radiology, Perelman School of Medicine, University of Pennsylvania, 3400 Spruce Street, Philadelphia, 19104, PA, USA
| | - Yonatan B Babore
- Department of Radiology, Perelman School of Medicine, University of Pennsylvania, 3400 Spruce Street, Philadelphia, 19104, PA, USA
| | - Charles E Kahn
- Department of Radiology, Perelman School of Medicine, University of Pennsylvania, 3400 Spruce Street, Philadelphia, 19104, PA, USA.
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
14
|
Park SH, Dean G, Ortiz EM, Choi JI. Overview of South Korean Guidelines for Approval of Large Language or Multimodal Models as Medical Devices: Key Features and Areas for Improvement. Korean J Radiol 2025; 26:519-523. [PMID: 40288893 PMCID: PMC12123075 DOI: 10.3348/kjr.2025.0257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2025] [Accepted: 03/10/2025] [Indexed: 04/29/2025] Open
Affiliation(s)
- Seong Ho Park
- Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea.
| | - Geraldine Dean
- Telemedicine Clinic Ltd. (a Unilabs company), Barcelona, Spain
- Unilabs AI Centre of Excellence, Barcelona, Spain
- NHS Southwest London, London, United Kingdom
| | | | - Joon-Il Choi
- Department of Radiology, Seoul St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea
| |
Collapse
|
15
|
Boyle A, Huo B, Sylla P, Calabrese E, Kumar S, Slater BJ, Walsh DS, Vosburg RW. Large language model-generated clinical practice guideline for appendicitis. Surg Endosc 2025; 39:3539-3551. [PMID: 40251310 DOI: 10.1007/s00464-025-11723-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2024] [Accepted: 04/06/2025] [Indexed: 04/20/2025]
Abstract
BACKGROUND Clinical practice guidelines provide important evidence-based recommendations to optimize patient care, but their development is labor-intensive and time-consuming. Large language models have shown promise in supporting academic writing and the development of systematic reviews, but their ability to assist with guideline development has not been explored. In this study, we tested the capacity of LLMs to support each stage of guideline development, using the latest SAGES guideline on the surgical management of appendicitis as a comparison. METHODS Prompts were engineered to trigger LLMs to perform each task of guideline development, using key questions and PICOs derived from the SAGES guideline. ChatGPT-4, Google Gemini, Consensus, and Perplexity were queried on February 21, 2024. LLM performance was evaluated qualitatively, with narrative descriptions of each task's output. The Appraisal of Guidelines for Research and Evaluation in Surgery (AGREE-S) instrument was used to quantitatively assess the quality of the LLM-derived guideline compared to the existing SAGES guideline. RESULTS Popular LLMs were able to generate a search syntax, perform data analysis, and follow the GRADE approach and Evidence-to-Decision framework to produce guideline recommendations. These LLMs were unable to independently perform a systematic literature search or reliably perform screening, data extraction, or risk of bias assessment at the time of testing. AGREE-S appraisal produced a total score of 119 for the LLM-derived guideline and 156 for the SAGES guideline. In 19 of the 24 domains, the two guidelines scored within two points of each other. CONCLUSIONS LLMs demonstrate potential to assist with certain steps of guideline development, which may reduce time and resource burden associated with these tasks. As new models are developed, the role for LLMs in guideline development will continue to evolve. Ongoing research and multidisciplinary collaboration are needed to support the safe and effective integration of LLMs in each step of guideline development.
Collapse
Affiliation(s)
- Amy Boyle
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada
| | - Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, ON, Canada
| | - Patricia Sylla
- Division of Colon and Rectal Surgery, Department of Surgery, Mount Sinai Hospital, New York, NY, USA
| | - Elisa Calabrese
- Department of Surgery, University of Adelaide, The Queen Elizabeth Hospital, Adelaide, SA, Australia
| | - Sunjay Kumar
- Department of General Surgery, Thomas Jefferson University Hospital, Philadelphia, PA, USA
| | | | - Danielle S Walsh
- Professor of Surgery, Department of Surgery, University of Kentucky, Lexington, KY, USA
| | - R Wesley Vosburg
- Department of Surgery, Grand Strand Medical Center, Myrtle Beach, SC, USA.
| |
Collapse
|
16
|
Turner KM, Ahmad SA. Large language models as clinical decision support tools: A call for careful implementation in the care of patients with pancreatic cancer. Surgery 2025; 182:109378. [PMID: 40287319 DOI: 10.1016/j.surg.2025.109378] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2025] [Accepted: 03/31/2025] [Indexed: 04/29/2025]
Affiliation(s)
- Kevin M Turner
- Department of Surgery, University of Cincinnati College of Medicine, Cincinnati, OH. https://twitter.com/KevinTurnerMD
| | - Syed A Ahmad
- Department of Surgery, Division of Surgical Oncology, University of Cincinnati College of Medicine, Cincinnati, OH.
| |
Collapse
|
17
|
Hoch CC, Funk PF, Guntinas-Lichius O, Volk GF, Lüers JC, Hussain T, Wirth M, Schmidl B, Wollenberg B, Alfertshofer M. Harnessing advanced large language models in otolaryngology board examinations: an investigation using python and application programming interfaces. Eur Arch Otorhinolaryngol 2025; 282:3317-3328. [PMID: 40281318 PMCID: PMC12122622 DOI: 10.1007/s00405-025-09404-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2024] [Accepted: 04/07/2025] [Indexed: 04/29/2025]
Abstract
PURPOSE This study aimed to explore the capabilities of advanced large language models (LLMs), including OpenAI's GPT-4 variants, Google's Gemini series, and Anthropic's Claude series, in addressing highly specialized otolaryngology board examination questions. Additionally, the study included a longitudinal assessment of GPT-3.5 Turbo, which was evaluated using the same set of questions one year ago to identify changes in its performance over time. METHODS We utilized a question bank comprising 2,576 multiple-choice and single-choice questions from a German online education platform tailored for otolaryngology board certification preparation. The questions were submitted to 11 different LLMs, including GPT-3.5 Turbo, GPT-4 variants, Gemini models, and Claude models, through Application Programming Interfaces (APIs) using Python scripts, facilitating efficient data collection and processing. RESULTS GPT-4o demonstrated the highest accuracy among all models, particularly excelling in categories such as allergology and head and neck tumors. While the Claude models showed competitive performance, they generally lagged behind the GPT-4 variants. A comparison of GPT-3.5 Turbo's performance revealed a significant decline in accuracy over the past year. Newer LLMs displayed varied performance levels, with single-choice questions consistently yielding higher accuracy than multiple-choice questions across all models. CONCLUSION While newer LLMs show strong potential in addressing specialized medical content, the observed decline in GPT-3.5 Turbo's performance over time underscores the necessity for continuous evaluation. This study highlights the critical need for ongoing optimization and efficient API usage to improve LLMs potential for applications in medical education and certification.
Collapse
Affiliation(s)
- Cosima C Hoch
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany.
| | - Paul F Funk
- Department of Otorhinolaryngology, Jena University Hospital, Friedrich-Schiller-University Jena, 07747, Jena, Germany
| | - Orlando Guntinas-Lichius
- Department of Otorhinolaryngology, Jena University Hospital, Friedrich-Schiller-University Jena, 07747, Jena, Germany
| | - Gerd Fabian Volk
- Department of Otorhinolaryngology, Jena University Hospital, Friedrich-Schiller-University Jena, 07747, Jena, Germany
| | - Jan-Christoffer Lüers
- Department of Otorhinolaryngology, Head and Neck Surgery, Medical Faculty, University of Cologne, 50937, Cologne, Germany
| | - Timon Hussain
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany
| | - Markus Wirth
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany
| | - Benedikt Schmidl
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany
| | - Barbara Wollenberg
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany
| | - Michael Alfertshofer
- Department of Oral and Maxillofacial Surgery, Institute of Health, Charité- Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, 10117, Berlin, Germany
| |
Collapse
|
18
|
Dorfner FJ, Dada A, Busch F, Makowski MR, Han T, Truhn D, Kleesiek J, Sushil M, Adams LC, Bressem KK. Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks. J Am Med Inform Assoc 2025; 32:1015-1024. [PMID: 40190132 DOI: 10.1093/jamia/ocaf045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Accepted: 03/02/2025] [Indexed: 05/21/2025] Open
Abstract
OBJECTIVES Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study aims to critically evaluate the performance of biomedically fine-tuned LLMs against their general-purpose counterparts across a range of clinical tasks. MATERIALS AND METHODS We evaluated the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on clinical case challenges from NEJM and JAMA, and on multiple clinical tasks, such as information extraction, document summarization and clinical coding. We used a diverse set of benchmarks specifically chosen to be outside the likely fine-tuning datasets of biomedical models, ensuring a fair assessment of generalization capabilities. RESULTS Biomedical LLMs generally underperformed compared to general-purpose models, especially on tasks not focused on probing medical knowledge. While on the case challenges, larger biomedical and general-purpose models showed similar performance (eg, OpenBioLLM-70B: 66.4% vs Llama-3-70B-Instruct: 65% on JAMA), smaller biomedical models showed more pronounced underperformance (OpenBioLLM-8B: 30% vs Llama-3-8B-Instruct: 64.3% on NEJM). Similar trends appeared across CLUE benchmarks, with general-purpose models often achieving higher scores in text generation, question answering, and coding. Notably, biomedical LLMs also showed a higher tendency to hallucinate. DISCUSSION Our findings challenge the assumption that biomedical fine-tuning inherently improves LLM performance, as general-purpose models consistently performed better on unseen medical tasks. Retrieval-augmented generation may offer a more effective strategy for clinical adaptation. CONCLUSION Fine-tuning LLMs on biomedical data may not yield the anticipated benefits. Alternative approaches, such as retrieval augmentation, should be further explored for effective and reliable clinical integration of LLMs.
Collapse
Affiliation(s)
- Felix J Dorfner
- Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin 10117, Germany
- Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School, Charlestown, MA 02129, United States
| | - Amin Dada
- Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen 45131, Germany
| | - Felix Busch
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
| | - Marcus R Makowski
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
| | - Tianyu Han
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen 52074, Germany
| | - Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen 52074, Germany
| | - Jens Kleesiek
- Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen 45131, Germany
- Cancer Research Center Cologne Essen (CCCE), West German Cancer Center Essen, University Hospital Essen (AöR), Essen 45147, Germany
- German Cancer Consortium (DKTK, Partner Site Essen), Heidelberg, Germany
- Department of Physics, TU Dortmund, Dortmund 44227, Germany
| | - Madhumita Sushil
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, United States
| | - Lisa C Adams
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
| | - Keno K Bressem
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
- German Heart Center Munich, Technical University Munich, Munich 80636, Germany
| |
Collapse
|
19
|
Wang Z, Sun J, Liu H, Luo X, Li J, He W, Yang Z, Lv H, Chen Y, Wang Z. Development and Performance of a Large Language Model for the Quality Evaluation of Multi-Language Medical Imaging Guidelines and Consensus. J Evid Based Med 2025; 18:e70020. [PMID: 40181523 DOI: 10.1111/jebm.70020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/01/2025] [Revised: 03/11/2025] [Accepted: 03/16/2025] [Indexed: 04/05/2025]
Abstract
AIM This study aimed to develop and evaluate an automated large language model (LLM)-based system for assessing the quality of medical imaging guidelines and consensus (GACS) in different languages, focusing on enhancing evaluation efficiency, consistency, and reducing manual workload. METHOD We developed the QPC-HASE-GuidelineEval algorithm, which integrates a Four-Quadrant Questions Classification Strategy and Hybrid Search Enhancement. The model was validated on 45 medical imaging guidelines (36 in Chinese and 9 in English) published in 2021 and 2022. Key evaluation metrics included consistency with expert assessments, hybrid search paragraph matching accuracy, information completeness, comparisons of different paragraph matching approaches, and cost-time efficiency. RESULTS The algorithm demonstrated an average accuracy of 77%, excelling in simpler tasks but showing lower accuracy (29%-40%) in complex evaluations, such as explanations and visual aids. The average accuracy rates of the English and Chinese versions of the GACS were 74% and 76%, respectively (p = 0.37). Hybrid search demonstrated superior performance with paragraph matching accuracy (4.42) and information completeness (4.42), significantly outperforming keyword-based search (1.05/1.05) and sparse-dense retrieval (4.26/3.63). The algorithm significantly reduced evaluation time to 8 min and 30 s per guideline and reduced costs to approximately 0.5 USD per guideline, offering a considerable advantage over traditional manual methods. CONCLUSION The QPC-HASE-GuidelineEval algorithm, powered by LLMs, showed strong potential for improving the efficiency, scalability, and multi-language capability of guideline evaluations, though further enhancements are needed to handle more complex tasks that require deeper interpretation.
Collapse
Affiliation(s)
- Zhixiang Wang
- Department of Ultrasound, Beijing Friendship Hospital, Capital Medical University, Beijing, China
- Precision and Intelligent Imaging Laboratory, Beijing Friendship Hospital, Capital Medical University, Beijing, China
| | - Jing Sun
- Department of Radiology, Capital Medical University Affiliated Beijing Friendship Hospital, Beijing, China
| | - Hui Liu
- Research Unit of Evidence-Based Evaluation and Guidelines, Chinese Academy of Medical Sciences (2021RU017), School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
| | - Xufei Luo
- Research Unit of Evidence-Based Evaluation and Guidelines, Chinese Academy of Medical Sciences (2021RU017), School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
| | - Jia Li
- Department of Radiology, Capital Medical University Affiliated Beijing Friendship Hospital, Beijing, China
| | - Wenjun He
- Dermatology Hospital, Southern Medical University, Guangzhou, China
- Acacia Laboratory for Implementation Science, School of Health Management, Southern Medical University, Guangzhou, China
| | - Zhenhua Yang
- Vincent V.C. Woo Chinese Medicine Clinical Research Institute, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, China
| | - Han Lv
- Department of Radiology, Capital Medical University Affiliated Beijing Friendship Hospital, Beijing, China
| | - Yaolong Chen
- Research Unit of Evidence-Based Evaluation and Guidelines, Chinese Academy of Medical Sciences (2021RU017), School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
| | - Zhenchang Wang
- Department of Radiology, Capital Medical University Affiliated Beijing Friendship Hospital, Beijing, China
| |
Collapse
|
20
|
Guan H, Novoa-Laurentiev J, Zhou L. CD-Tron: Leveraging large clinical language model for early detection of cognitive decline from electronic health records. J Biomed Inform 2025; 166:104830. [PMID: 40320101 DOI: 10.1016/j.jbi.2025.104830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Revised: 03/28/2025] [Accepted: 04/13/2025] [Indexed: 05/08/2025]
Abstract
BACKGROUND Early detection of cognitive decline during the preclinical stage of Alzheimer's disease and related dementias (AD/ADRD) is crucial for timely intervention and treatment. Clinical notes in the electronic health record contain valuable information that can aid in the early identification of cognitive decline. In this study, we utilize advanced large clinical language models, fine-tuned on clinical notes, to improve the early detection of cognitive decline. METHODS We collected clinical notes from 2,166 patients spanning the 4 years preceding their initial mild cognitive impairment (MCI) diagnosis from the Enterprise Data Warehouse of Mass General Brigham. To train the model, we developed CD-Tron, built upon a large clinical language model that was finetuned using 4,949 expert-labeled note sections. For evaluation, the trained model was applied to 1,996 independent note sections to assess its performance on real-world unstructured clinical data. Additionally, we used explainable AI techniques, specifically SHAP values (SHapley Additive exPlanations), to interpret the model's predictions and provide insight into the most influential features. Error analysis was also facilitated to further analyze the model's prediction. RESULTS CD-Tron significantly outperforms baseline models, achieving notable improvements in precision, recall, and AUC metrics for detecting cognitive decline (CD). Tested on many real-world clinical notes, CD-Tron demonstrated high sensitivity with only one false negative, crucial for clinical applications prioritizing early and accurate CD detection. SHAP-based interpretability analysis highlighted key textual features contributing to model predictions, supporting transparency and clinician understanding. CONCLUSION CD-Tron offers a novel approach to early cognitive decline detection by applying large clinical language models to free-text EHR data. Pretrained on real-world clinical notes, it accurately identifies early cognitive decline and integrates SHAP for interpretability, enhancing transparency in predictions.
Collapse
Affiliation(s)
- Hao Guan
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA 02115, USA.
| | - John Novoa-Laurentiev
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA 02115, USA
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
21
|
Satheakeerthy S, Jesudason D, Pietris J, Bacchi S, Chan WO. LLM-assisted medical documentation: efficacy, errors, and ethical considerations in ophthalmology. Eye (Lond) 2025; 39:1440-1442. [PMID: 40148503 PMCID: PMC12089378 DOI: 10.1038/s41433-025-03767-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2025] [Revised: 03/05/2025] [Accepted: 03/19/2025] [Indexed: 03/29/2025] Open
Affiliation(s)
- Shrirajh Satheakeerthy
- Faculty of Health & Medical Sciences, The University of Adelaide, Adelaide, SA, 5000, Australia
- SA Health, Central Adelaide Local Health Network (CALHN), Adelaide, SA, 5000, Australia
| | - Daniel Jesudason
- Faculty of Health & Medical Sciences, The University of Adelaide, Adelaide, SA, 5000, Australia.
| | - James Pietris
- SA Health, Central Adelaide Local Health Network (CALHN), Adelaide, SA, 5000, Australia
| | - Stephen Bacchi
- Faculty of Health & Medical Sciences, The University of Adelaide, Adelaide, SA, 5000, Australia
- Harvard Medical School, 25 Shattuck St, Boston, MA, 02115, USA
- Massachusetts General Hospital, 55 Fruit St, Boston, MA, 02114, USA
| | - Weng Onn Chan
- Faculty of Health & Medical Sciences, The University of Adelaide, Adelaide, SA, 5000, Australia
- SA Health, Central Adelaide Local Health Network (CALHN), Adelaide, SA, 5000, Australia
- The Queen Elizabeth Hospital, 28 Woodville Rd, Woodville South, SA, 5011, Australia
| |
Collapse
|
22
|
Wu Y, Zhang Y, Wu Y, Zheng Q, Li X, Chen X. ChatIOS: Improving automatic 3-dimensional tooth segmentation via GPT-4V and multimodal pre-training. J Dent 2025; 157:105755. [PMID: 40228651 DOI: 10.1016/j.jdent.2025.105755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 03/26/2025] [Accepted: 04/10/2025] [Indexed: 04/16/2025] Open
Abstract
OBJECTIVES This study aims to propose a framework that integrates GPT-4V, a recent advanced version of ChatGPT, and multimodal pre-training techniques to enhance deep learning algorithms for 3-dimensional (3D) tooth segmentation in scans produced by intraoral scanners (IOSs). METHODS The framework was developed on 1800 intraoral scans of approximately 24,000 annotated teeth (training set: 1200 scans, 16,004 teeth; testing set: 600 scans, 7995 teeth), from the Teeth3DS dataset, which was gathered from 900 patients with both maxillary and mandible regions. The first step of the proposed framework, ChatIOS, is to pre-process the 3D IOS data to extract 3D point clouds. Then, GPT-4V generates detailed descriptions of 2-dimensional (2D) IOS images taken from different view angles. In the multimodal pre-training, triplets, which comprise point clouds, 2D images, and text descriptions, serve as inputs. A series of ablation studies were systematically conducted to illustrate the superior design of the automatic 3D tooth segmentation system. Our quantitative evaluation criteria included segmentation quality, processing speed, and clinical applicability. RESULTS When tested on 600 scans, ChatIOS substantially outperformed the existing benchmarks such as PointNet++ across all metrics, including mean intersection-over-union (mIoU, from 90.3 % to 93.0 % for maxillary and from 89.2 % to 92.3 % for mandible scans), segmentation accuracy (from 97.0 % to 98.0 % for maxillary and from 96.8 % to 97.9 % for mandible scans) and dice similarity coefficient (DSC, from 98.1 % to 98.7 % for maxillary and from 97.9 % to 98.6 % for mandible scans). Our model took only approximately 2s to generate segmentation outputs per scan and exhibited acceptable consistency with clinical expert evaluations. CONCLUSIONS Our ChatIOS framework can increase the effectiveness and efficiency of 3D tooth segmentation that clinical procedures require, including orthodontic and prosthetic treatments. This study presents an early exploration of the applications of GPT-4V in digital dentistry and also pioneers the multimodal pre-training paradigm for 3D tooth segmentation. CLINICAL SIGNIFICANCE Accurate segmentation of teeth on 3D intraoral scans is critical for orthodontic and prosthetic treatments. ChatIOS can integrate GPT-4V with pre-trained vision-language models (VLMs) to gain an in-depth understanding of IOS data, which can contribute to more efficient and precise tooth segmentation systems.
Collapse
Affiliation(s)
- Yongjia Wu
- Department of Orthodontics, Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Hangzhou, PR China.
| | - Yun Zhang
- Department of Orthodontics, Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Hangzhou, PR China.
| | - Yange Wu
- Department of Orthodontics, Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Hangzhou, PR China
| | - Qianhan Zheng
- Department of Orthodontics, Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Hangzhou, PR China
| | - Xiaojun Li
- Department of Periodontics, Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Hangzhou, PR China.
| | - Xuepeng Chen
- Department of Orthodontics, Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Hangzhou, PR China.
| |
Collapse
|
23
|
Armitage RC. How do GPs Want Large Language Models to be Applied in Primary Care, and What Are Their Concerns? A Cross-Sectional Survey. J Eval Clin Pract 2025; 31:e70129. [PMID: 40369934 PMCID: PMC12079004 DOI: 10.1111/jep.70129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/14/2024] [Revised: 03/16/2025] [Accepted: 04/14/2025] [Indexed: 05/16/2025]
Abstract
INTRODUCTION Although the potential utility of large language models (LLMs) in medicine and healthcare is substantial, no assessment has been made to date of how GPs want LLMs to be applied in primary care, or of which issues GPs are most concerned about regarding the implementation of LLMs into their clinical practice. This study's objective was to generate preliminary evidence that answers these questions, which are relevant because GPs themselves will ultimately harness the power of LLMs in primary care. METHODS Non-probability sampling was utilised: GPs practicing in the UK and who were members of one of two Facebook groups (one containing a community of UK primary care staff, the other containing a community of GMC-registered doctors in the UK) were invited to complete an online survey, which ran from 06 to 13 November 2024. RESULTS The survey received 113 responses, 107 of which were from GPs practicing in the UK. When LLM accuracy and safety were assumed to be guaranteed, broad enthusiasm for LLMs carrying out various nonclinical and clinical tasks in primary care was reported. The single nonclinical task and clinical task that respondents were most supportive of were the LLM listening to the consultation and writing notes in real-time for the GP to review, edit, and save (44.0%), and the LLM identifying outstanding clinical tasks and actioning them (51.0%), respectively. Respondents were concerned with a range of issues regarding LLMs being embedded into clinical systems, with patient safety being the most commonly reported single issue of concern (36.2%). DISCUSSION This study has generated preliminary evidence that is of potential utility to those developing LLMs for use in primary care. Further research is required to expand this evidence base to further inform the development of these technologies, and to ensure they are acceptable to the GPs who will use them.
Collapse
Affiliation(s)
- Richard C. Armitage
- Academic Unit of Population and Lifespan Sciences, School of MedicineUniversity of NottinghamNottinghamUK
| |
Collapse
|
24
|
Choi H, Lee D, Kang YK, Suh M. Empowering PET imaging reporting with retrieval-augmented large language models and reading reports database: a pilot single center study. Eur J Nucl Med Mol Imaging 2025; 52:2452-2462. [PMID: 39843863 PMCID: PMC12119711 DOI: 10.1007/s00259-025-07101-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Accepted: 01/17/2025] [Indexed: 01/24/2025]
Abstract
PURPOSE The potential of Large Language Models (LLMs) in enhancing a variety of natural language tasks in clinical fields includes medical imaging reporting. This pilot study examines the efficacy of a retrieval-augmented generation (RAG) LLM system considering zero-shot learning capability of LLMs, integrated with a comprehensive database of PET reading reports, in improving reference to prior reports and decision making. METHODS We developed a custom LLM framework with retrieval capabilities, leveraging a database of over 10 years of PET imaging reports from a single center. The system uses vector space embedding to facilitate similarity-based retrieval. Queries prompt the system to generate context-based answers and identify similar cases or differential diagnoses. From routine clinical PET readings, experienced nuclear medicine physicians evaluated the performance of system in terms of the relevance of queried similar cases and the appropriateness score of suggested potential diagnoses. RESULTS The system efficiently organized embedded vectors from PET reports, showing that imaging reports were accurately clustered within the embedded vector space according to the diagnosis or PET study type. Based on this system, a proof-of-concept chatbot was developed and showed the framework's potential in referencing reports of previous similar cases and identifying exemplary cases for various purposes. From routine clinical PET readings, 84.1% of the cases retrieved relevant similar cases, as agreed upon by all three readers. Using the RAG system, the appropriateness score of the suggested potential diagnoses was significantly better than that of the LLM without RAG. Additionally, it demonstrated the capability to offer differential diagnoses, leveraging the vast database to enhance the completeness and precision of generated reports. CONCLUSION The integration of RAG LLM with a large database of PET imaging reports suggests the potential to support clinical practice of nuclear medicine imaging reading by various tasks of AI including finding similar cases and deriving potential diagnoses from them. This study underscores the potential of advanced AI tools in transforming medical imaging reporting practices.
Collapse
Affiliation(s)
- Hongyoon Choi
- Department of Nuclear Medicine, Seoul National University Hospital, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea.
- Department of Nuclear Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea.
- Portrai, Inc., Seoul, Republic of Korea.
| | | | - Yeon-Koo Kang
- Department of Nuclear Medicine, Seoul National University Hospital, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea
| | - Minseok Suh
- Department of Nuclear Medicine, Seoul National University Hospital, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea
- Department of Nuclear Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
25
|
Deng L, Wu Y, Ren Y, Lu H. Autonomous Self-Evolving Research on Biomedical Data: The DREAM Paradigm. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2025; 12:e2417066. [PMID: 40344513 PMCID: PMC12165099 DOI: 10.1002/advs.202417066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/17/2024] [Revised: 04/12/2025] [Indexed: 05/11/2025]
Abstract
In contemporary biomedical research, the efficiency of data-driven methodologies is constrained by large data volumes, the complexity of tool selection, and limited human resources. To address these challenges, a Data-dRiven self-Evolving Autonomous systeM (DREAM) is developed as the first fully autonomous biomedical research system capable of independently conducting scientific investigations without human intervention. DREAM autonomously formulates and evolves scientific questions, configures computational environments, and performs result evaluation and validation. Unlike existing semi-autonomous systems, DREAM operates without manual intervention and is validated in real-world biomedical scenarios. It exceeds the average performance of top scientists in question generation, achieves a higher success rate in environment configuration than experienced human researchers, and uncovers novel scientific findings. In the context of the Framingham Heart Study, it demonstrated an efficiency that is over 10 000 times greater than that of average scientists. As a fully autonomous, self-evolving system, DREAM offers a robust and efficient solution for accelerating biomedical discovery and advancing other data-driven scientific disciplines.
Collapse
Affiliation(s)
- Luojia Deng
- Department of Bioinformatics and BiostatisticsSchool of Life Sciences and BiotechnologyShanghai Jiao Tong UniversityShanghai200240China
- SJTU‐Yale Joint Center for Biostatistics and Data ScienceTechnical Center for Digital MedicineNational Center for Translational MedicineShanghai Jiao Tong UniversityShanghai200240China
| | - Yijie Wu
- Department of Bioinformatics and BiostatisticsSchool of Life Sciences and BiotechnologyShanghai Jiao Tong UniversityShanghai200240China
- SJTU‐Yale Joint Center for Biostatistics and Data ScienceTechnical Center for Digital MedicineNational Center for Translational MedicineShanghai Jiao Tong UniversityShanghai200240China
| | - Yongyong Ren
- SJTU‐Yale Joint Center for Biostatistics and Data ScienceTechnical Center for Digital MedicineNational Center for Translational MedicineShanghai Jiao Tong UniversityShanghai200240China
| | - Hui Lu
- Department of Bioinformatics and BiostatisticsSchool of Life Sciences and BiotechnologyShanghai Jiao Tong UniversityShanghai200240China
- SJTU‐Yale Joint Center for Biostatistics and Data ScienceTechnical Center for Digital MedicineNational Center for Translational MedicineShanghai Jiao Tong UniversityShanghai200240China
| |
Collapse
|
26
|
Lamprecht CB, Lyerly M, Lucke-Wold B. Commentary: CNS-CLIP: Transforming a Neurosurgical Journal Into a Multimodal Medical Model. Neurosurgery 2025; 96:e123-e124. [PMID: 39636115 DOI: 10.1227/neu.0000000000003298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2024] [Accepted: 10/17/2024] [Indexed: 12/07/2024] Open
Affiliation(s)
- Chris B Lamprecht
- Department of Neurosurgery, College of Medicine, University of Florida, Gainesville , Florida , USA
| | - Mac Lyerly
- Wake Forest University School of Medicine, Winston-Salem , North Carolina , USA
| | - Brandon Lucke-Wold
- Lillian S. Wells Department of Neurosurgery, University of Florida, Gainesville , Florida , USA
| |
Collapse
|
27
|
Deng A, Chen W, Dai J, Jiang L, Chen Y, Chen Y, Jiang J, Rao M. Current application of ChatGPT in undergraduate nuclear medicine education: Taking Chongqing Medical University as an example. MEDICAL TEACHER 2025; 47:997-1003. [PMID: 39305476 DOI: 10.1080/0142159x.2024.2399673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Accepted: 08/29/2024] [Indexed: 10/03/2024]
Abstract
BACKGROUND Nuclear Medicine(NM), as an inherently interdisciplinary field, integrates diverse scientific principles and advanced imaging techniques. The advent of ChatGPT, a large language model, opens new avenues for medical educational innovation. With its advanced natural language processing abilities and complex algorithms, ChatGPT holds the potential to substantially enrich medical education, particularly in NM. OBJECTIVE To investigate the current application of ChatGPT in undergraduate Nuclear Medicine Education(NME). METHODS Employing a mixed-methods sequential explanatory design, the research investigates the current status of NME, the use of ChatGPT and the attitude towards ChatGPT among teachers and students in the Second Clinical College of Chongqing Medical University. RESULTS The investigation yields several salient findings: (1) Students and educators in NM face numerous challenges in the learning process; (2) ChatGPT is found to possess significant applicability and potential benefits in NME; (3) There is a pronounced inclination among respondents to adopt ChatGPT, with a keen interest in its diverse applications within the educational sphere. CONCLUSION ChatGPT has been utilized to address the difficulties faced by undergraduates at Chongqing Medical University in NME, and has been applied in various aspects to assist learning. The findings of this survey may offer some insights into how ChatGPT can be integrated into practical medical education.
Collapse
Affiliation(s)
- Ailin Deng
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Wenyi Chen
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Jinjie Dai
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Liu Jiang
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Yicai Chen
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Yuhua Chen
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Jinyan Jiang
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Maohua Rao
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
| |
Collapse
|
28
|
Daraqel B, Owayda A, Khan H, Koletsi D, Mheissen S. Artificial Intelligence as A Tool for Data Extraction Is Not Fully Reliable Compared to Manual Data Extraction. J Dent 2025:105846. [PMID: 40449825 DOI: 10.1016/j.jdent.2025.105846] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2025] [Revised: 04/16/2025] [Accepted: 05/23/2025] [Indexed: 06/03/2025] Open
Abstract
INTRODUCTION Data extraction for systematic reviews is a time-consuming step and prone to errors. OBJECTIVE This study aimed to evaluate the agreement between artificial intelligence and human data extraction methods. METHODS Studies published in seven orthodontic journals between 2019 to 2024, were retrieved and included. Fifteen data sets from each study were extracted manually and using the Microsoft Bing AI-based tool by two independent reviewers. Files in Portable Document Format were uploaded to the AI-based tool, and specific data were requested through its chat feature. The association between the data extraction methods and study characteristics was examined, and agreement was evaluated using interclass correlation and Kappa statistics. RESULTS A total of 300 orthodontic studies were included. Slight differences between human and AI-based data extraction methods for publication years and study designs were observed, though these were not statistically significant. Minor inconsistencies were also found in the extraction of the number of trial arms and the mean age of participants per group, but these were not significant. The AI-based tool was less effective in extracting variables related to the study design (P = 0.017) and the number of centers (P < 0.001). Agreement between human and AI-based extraction methods ranged from slight (0.16) for the type of study design to moderate (0.45) for study design classification, and substantial to perfect (0.65-1.00) for most other variables. CONCLUSION AI-based data extraction, while effective for straightforward variables, is not fully reliable for complex data extraction. Human input remains essential for ensuring accuracy and completeness in systematic reviews. CLINICAL SIGNIFICANCE AI-based tools can effectively extract straightforward data, potentially reducing the time and effort required for systematic reviews. This can help clinicians and researchers process large volumes of data more efficiently. However, it is important to keep human supervision to maintain the integrity and reliability of clinical evidence.
Collapse
Affiliation(s)
- Baraa Daraqel
- Department of Orthodontics, Oral Health Research and Promotion Unit, Al-Quds University, Jerusalem, Palestine
| | - Amer Owayda
- Private practice, Harmony medical group, Abu Dhabi, United Arab Emirates
| | - Haris Khan
- CMH Institute of Dentistry Lahore, National University of Medical Sciences, Punjab, Pakistan
| | - Despina Koletsi
- Clinic of Orthodontics and Pediatric Dentistry, Center of Dental Medicine, University of Zurich, Switzerland, Meta- Research Innovation Center at Stanford (METRICS), Stanford University, California, USA
| | - Samer Mheissen
- Department of Orthodontics, Oral Health Research and Promotion Unit, Al-Quds University, Jerusalem, Palestine; Private practice, USA, SC.
| |
Collapse
|
29
|
Zhu X. Elevating Clinical Practice in Interventional Radiology With Strategic Prompt Engineering. AJR Am J Roentgenol 2025. [PMID: 40434170 DOI: 10.2214/ajr.25.33266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2025]
Affiliation(s)
- Xiaoli Zhu
- The First Affiliated Hospital of Soochow Unviersity
| |
Collapse
|
30
|
Tang T, Li X, Lin Y, Liu C. Comparing digital real-time versus virtual simulation systems in dental education for preclinical tooth preparation of molars for metal-ceramic crowns. BMC Oral Health 2025; 25:814. [PMID: 40426144 PMCID: PMC12117957 DOI: 10.1186/s12903-025-06161-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2025] [Accepted: 05/12/2025] [Indexed: 05/29/2025] Open
Abstract
PURPOSE This study aimed to compare the effectiveness of the Real-time Dental Training and Evaluation System (RDTES) and Virtual Simulation System (VSS) with the Traditional Head-Simulator (THS) method in teaching molar preparation for metal-ceramic crowns in preclinical dental education. METHODS Undergraduate students were divided into four groups: No Additional Training (NAT), THS, RDTES, and VSS. The primary outcomes measured were artificial and machine scoring of tooth preparations, with additional anonymous surveys assessing student feedback. RESULTS Both RDTES and VSS groups demonstrated significantly higher tooth preparation scores compared to the NAT group, with RDTES showing superior performance in machine scan scoring. Linear regression analysis revealed a clear positive correlation between scoring and scoring improvement for both artificial and machine assessments. Student surveys indicated RDTES was rated higher in accuracy, feedback quality, skill improvement, and teaching effectiveness. CONCLUSIONS RDTES and VSS significantly enhance students' mastery of molar tooth preparation, with RDTES providing more precise guidance on tooth preparation volume. These systems show broad application prospects and development potential in dental education.
Collapse
Affiliation(s)
- Tianyu Tang
- Department of Prosthodontics, The Affiliated Stomatology Hospital of Kunming Medical University, Kunming, Yunnan, 650106, China
| | - Xingxing Li
- Department of Prosthodontics, The Affiliated Stomatology Hospital of Kunming Medical University, Kunming, Yunnan, 650106, China
| | - Yunhong Lin
- Department of Prosthodontics, The Affiliated Stomatology Hospital of Kunming Medical University, Kunming, Yunnan, 650106, China
| | - Caojie Liu
- Department of Prosthodontics, The Affiliated Stomatology Hospital of Kunming Medical University, Kunming, Yunnan, 650106, China.
- Yunnan Key Laboratory of Stomatology, The Affiliated Stomatology Hospital of Kunming Medical University, Chenggong District, 1168 West Chunrong Road, Yuhua Avenue, Kunming, Yunnan, 650500, People's Republic of China.
| |
Collapse
|
31
|
Kunze KN, Bepple J, Bedi A, Ramkumar PN, Pean CA. Commercial Products Using Generative Artificial Intelligence Include Ambient Scribes, Automated Documentation and Scheduling, Revenue Cycle Management, Patient Engagement and Education, and Prior Authorization Platforms. Arthroscopy 2025:S0749-8063(25)00397-4. [PMID: 40419172 DOI: 10.1016/j.arthro.2025.05.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/06/2025] [Revised: 05/10/2025] [Accepted: 05/10/2025] [Indexed: 05/28/2025]
Abstract
The integration of artificial intelligence (AI) into clinical practice is rapidly transforming healthcare workflows. At the forefront are large language models (LLMs), embedded within commercial and enterprise platforms to optimize documentation, streamline administration, and personalize patient engagement. The evolution of LLMs in healthcare has been driven by rapid advancements in natural language processing (NLP) and deep learning. Emerging commercial products include Ambient Scribes, Automated Documentation and Scheduling, Revenue Cycle Management, Patient Engagement and Education Assistants, and Prior Authorization Platforms. Ambient Scribes remain the leading commercial generative AI product, with approximately 90 platforms in existence to date. Emerging applications may improve provider efficiency and payer-provider alignment by automating the prior authorization process to reduce the manual labor burden placed on clinicians and staff. Current limitations include (1) lack of regulatory oversight, (2) existing biases, (3) inconsistent interoperability with EHRs, and (4) lack of physician and stakeholder buy-in due to lack of confidence in LLM outputs. Looking forward requires discussion of ethical, clinical, and operational considerations.
Collapse
Affiliation(s)
- Kyle N Kunze
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, NY, USA.
| | | | - Asheesh Bedi
- Department of Orthopaedic Surgery, University of Michigan, Ann Arbor, MI, USA
| | | | - Christian A Pean
- Department of Orthopaedic Surgery, Duke University School of Medicine, Durham, NC, USA
| |
Collapse
|
32
|
Yang Q, Zuo H, Su R, Su H, Zeng T, Zhou H, Wang R, Chen J, Lin Y, Chen Z, Tan T. Dual retrieving and ranking medical large language model with retrieval augmented generation. Sci Rep 2025; 15:18062. [PMID: 40413225 PMCID: PMC12103550 DOI: 10.1038/s41598-025-00724-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Accepted: 04/30/2025] [Indexed: 05/27/2025] Open
Abstract
Recent advancements in large language models (LLMs) have significantly enhanced text generation across various sectors; however, their medical application faces critical challenges regarding both accuracy and real-time responsiveness. To address these dual challenges, we propose a novel two-step retrieval and ranking retrieval-augmented generation (RAG) framework that synergistically combines embedding search with Elasticsearch technology. Built upon a dynamically updated medical knowledge base incorporating expert-reviewed documents from leading healthcare institutions, our hybrid architecture employs ColBERTv2 for context-aware result ranking while maintaining computational efficiency. Experimental results show a 10% improvement in accuracy for complex medical queries compared to standalone LLM and single-search RAG variants, while acknowledging that latency challenges remain in emergency situations requiring sub-second responses in an experimental setting, which can be achieved in real-time using more powerful hardware in real-world deployments. This work establishes a new paradigm for reliable medical AI assistants that successfully balances accuracy and practical deployment considerations.
Collapse
Affiliation(s)
- Qimin Yang
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Huan Zuo
- School of Public Health, University of South China, Hengyang, China
- The Affiliated Changsha Central Hospital, Hengyang Medical School, University of South China, Changsha, China
| | - Runqi Su
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Hanyinghong Su
- School of Public Health, University of South China, Hengyang, China
| | - Tangyi Zeng
- The Affiliated Changsha Central Hospital, Hengyang Medical School, University of South China, Changsha, China
| | - Huimei Zhou
- The Affiliated Changsha Central Hospital, Hengyang Medical School, University of South China, Changsha, China
| | - Rongsheng Wang
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Jiexin Chen
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Yijun Lin
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Zhiyi Chen
- School of Public Health, University of South China, Hengyang, China.
- The Affiliated Changsha Central Hospital, Hengyang Medical School, University of South China, Changsha, China.
- Key Laboratory of Medical Imaging Precision Theranostics and Radiation Protection, College of Hunan Province, The Affiliated Changsha Central Hospital, University of South China, Changsha, China.
- Department of Medical Imaging, The Affiliated Changsha Central Hospital, Hengyang Medical School, University of South China, Changsha, China.
| | - Tao Tan
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China.
| |
Collapse
|
33
|
McInnis MG, Coleman B, Hurwitz E, Robinson PN, Williams AE, Haendel MA, McMurry JA. Integrating Knowledge: The Power of Ontologies in Psychiatric Research and Clinical Informatics. Biol Psychiatry 2025:S0006-3223(25)01213-2. [PMID: 40414449 DOI: 10.1016/j.biopsych.2025.05.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 05/07/2025] [Accepted: 05/14/2025] [Indexed: 05/27/2025]
Abstract
Ontologies are structured frameworks for representing knowledge by systematically defining concepts, categories, and their relationships. While widely adopted in biomedicine, ontologies remain largely absent in mental health research and clinical care, where the field continues to rely heavily on existing classification systems (DSM). Although useful for clinical communication and administrative purposes, they lack the semantic structure, computational, and reasoning properties needed to integrate diverse data sources or support artificial intelligence (AI)-enabled analysis. This reliance on classification systems limits efforts to analyze and interpret complex, heterogeneous psychiatric data. In mood disorders, particularly bipolar disorder, the lack of formalized semantic models contributes to diagnostic inconsistencies, fragmented data structures, and barriers to precision medicine. Ontologies, by contrast, provide a standardized, machine-readable foundation for linking multimodal data sources, such as electronic health records (EHRs), genetic and neuroimaging data, and social determinants of health, while enabling secure, de-identified computation. This review surveys the current landscape of mental health ontologies and highlights the Human Phenotype Ontology (HPO) as a promising framework for bridging psychiatric and medical phenotypes. We describe ongoing efforts to enhance HPO through curated psychiatric terms, refined definitions, and structured mappings of observed phenomena. The Global Bipolar Cohort (GBC), an international collaboration, exemplifies this approach through the development of a consensus-driven ontology tailored to bipolar disorder. By supporting semantic interoperability, reproducible research, and individualized care, ontology-based approaches provide essential infrastructure for overcoming the limitations of classification systems and advancing data-driven precision psychiatry.
Collapse
Affiliation(s)
| | - Ben Coleman
- University of Connecticut, Farmington, CT, USA
| | - Eric Hurwitz
- University of North Carolina, Chapel Hill, NC, USA
| | - Peter N Robinson
- University of Connecticut, Farmington, CT, USA; Berlin Institute of Health at Charite, Berlin, Germany
| | | | | | | |
Collapse
|
34
|
Chen YC, Lee SH, Sheu H, Lin SH, Hu CC, Fu SC, Yang CP, Lin YC. Enhancing responses from large language models with role-playing prompts: a comparative study on answering frequently asked questions about total knee arthroplasty. BMC Med Inform Decis Mak 2025; 25:196. [PMID: 40410756 PMCID: PMC12102839 DOI: 10.1186/s12911-025-03024-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2025] [Accepted: 05/09/2025] [Indexed: 05/25/2025] Open
Abstract
BACKGROUND The application of artificial intelligence (AI) in medical education and patient interaction is rapidly growing. Large language models (LLMs) such as GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus have shown potential in providing relevant medical information. This study aims to evaluate and compare the performance of these LLMs in answering frequently asked questions (FAQs) about Total Knee Arthroplasty (TKA), with a specific focus on the impact of role-playing prompts. METHODS Four leading LLMs-GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus-were evaluated using ten standardized patient inquiries related to TKA. Each model produced two distinct responses per question: one generated under zero-shot prompting (question-only), and one under role-playing prompting (instructed to simulate an experienced orthopaedic surgeon). Four orthopaedic surgeons evaluated responses for accuracy and comprehensiveness on a 5-point Likert scale, along with a binary measure for acceptability. Statistical analyses (Wilcoxon rank sum and Chi-squared tests; P < 0.05) were conducted to compare model performance. RESULTS ChatGPT-4 with role-playing prompts achieved the highest scores for accuracy (3.73), comprehensiveness (4.05), and acceptability (77.5%), followed closely by ChatGPT-3.5 with role-playing prompts (3.70, 3.85, 72.5%, respectively). Google Gemini and Claude 3 Opus demonstrated lower performance across all metrics. In between-model comparisons based on zero-shot prompting, ChatGPT-4 achieved significantly higher scores of both accuracy and comprehensiveness relative to Google Gemini (P = 0.031 and P = 0.009, respectively) and Claude 3 Opus (P = 0.019 and P = 0.002), and demonstrated higher acceptability than Claude 3 Opus (P = 0.006). Within-model comparisons showed role-playing significantly improved all metrics for ChatGPT-3.5 (P < 0.05) and acceptability for ChatGPT-4 (P = 0.033). No significant prompting effects were observed for Gemini or Claude. CONCLUSIONS This study demonstrates that role-playing prompts significantly enhance the performance of LLMs, particularly for ChatGPT-3.5 and ChatGPT-4, in answering FAQs related to TKA. ChatGPT-4, with role-playing prompts, showed superior performance in terms of accuracy, comprehensiveness, and acceptability. Despite occasional inaccuracies, LLMs hold promise for improving patient education and clinical decision-making in orthopaedic practice. CLINICAL TRIAL NUMBER Not applicable.
Collapse
Affiliation(s)
- Yi-Chen Chen
- Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Taoyuan, 333, Taiwan
| | - Sheng-Hsun Lee
- Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Taoyuan, 333, Taiwan
| | - Huan Sheu
- Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Taoyuan, 333, Taiwan
| | - Sheng-Hsuan Lin
- Institute of Statistics, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Institute of Data Science and Engineering, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Department of Applied Mathematics, National Dong Hwa University, Hualien, Taiwan
- Department of Biochemical and Molecular Medical Sciences, National Dong Hwa University, Hualien, Taiwan
| | - Chih-Chien Hu
- Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Taoyuan, 333, Taiwan
| | - Shih-Chen Fu
- Institute of Statistics, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Department of Biochemical and Molecular Medical Sciences, National Dong Hwa University, Hualien, Taiwan
| | - Cheng-Pang Yang
- Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Taoyuan, 333, Taiwan.
| | - Yu-Chih Lin
- Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Taoyuan, 333, Taiwan.
| |
Collapse
|
35
|
Khan M, Ahuja K, Tsirikos AI. AI and machine learning in paediatric spine deformity surgery. Bone Jt Open 2025; 6:569-581. [PMID: 40407025 PMCID: PMC12100669 DOI: 10.1302/2633-1462.65.bjo-2024-0089.r1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 05/26/2025] Open
Abstract
Paediatric spine deformity surgery is a high-stakes procedure. It demands the surgeon to have exceptional anatomical knowledge and precise visuospatial awareness. There is increasing demand for precision medicine, which rapid advancements in computational technologies have made possible with the recent explosion of AI and machine learning (ML). We present the surgical and ethical applications of AI and ML in diagnosis, prognosis, image processing, and outcomes in the field of paediatric spine deformity.
Collapse
Affiliation(s)
- Mohsin Khan
- Scottish National Spine Deformity Centre, Royal Hospital for Children and Young People, Edinburgh, UK
| | - Kaustubh Ahuja
- Scottish National Spine Deformity Centre, Royal Hospital for Children and Young People, Edinburgh, UK
| | - Athanasios I Tsirikos
- Scottish National Spine Deformity Centre, Royal Hospital for Children and Young People, Edinburgh, UK
| |
Collapse
|
36
|
Mao C, Li J, Pang PCI, Zhu Q, Chen R. Identifying Kidney Stone Risk Factors Through Patient Experiences With a Large Language Model: Text Analysis and Empirical Study. J Med Internet Res 2025; 27:e66365. [PMID: 40403294 DOI: 10.2196/66365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Revised: 12/16/2024] [Accepted: 04/10/2025] [Indexed: 05/24/2025] Open
Abstract
BACKGROUND Kidney stones, a prevalent urinary disease, pose significant health risks. Factors like insufficient water intake or a high-protein diet increase an individual's susceptibility to the disease. Social media platforms can be a valuable avenue for users to share their experiences in managing these risk factors. Analyzing such patient-reported information can provide crucial insights into risk factors, potentially leading to improved quality of life for other patients. OBJECTIVE This study aims to develop a model KSrisk-GPT, based on a large language model (LLM) to identify potential kidney stone risk factors from web-based user experiences. METHODS This study collected data on the topic of kidney stones on Zhihu in the past 5 years and obtained 11,819 user comments. Experts organized the most common risk factors for kidney stones into six categories. Then, we use the least-to-most prompting in the chain-of-thought prompting to enable GPT-4.0 to think like an expert and ask GPT to identify risk factors from the comments. Metrics, including accuracy, precision, recall, and F1-score, were used to evaluate the performance of such a model. RESULTS Our proposed method outperforms other models in identifying comments containing risk factors with 95.9% accuracy and F1-score, with a precision of 95.6% and a recall of 96.2%. Out of the 863 comments identified with risk factors, our analysis showed the most mentioned risk factors for kidney stones in Zhihu user discussions, mainly including dietary habits (high protein, high calcium intake), insufficient water intake, genetic factors, and lifestyle. In addition, new potential risk factors were discovered with GPT, such as excessive use of supplements like vitamin C and calcium, laxatives, and hyperparathyroidism. CONCLUSIONS Comments from social media users offer a new data source for disease prevention and understanding patient journeys. Our method not only sheds light on using LLMs to efficiently summarize risk factors from social media data but also on LLMs' potential to identify new potential factors from the patient's perspective.
Collapse
Affiliation(s)
- Chao Mao
- MPU-UC Joint Research Laboratory in Advanced Technologies for Smart Cities, Faculty of Applied Sciences, Macao Polytechnic University, Macao, Macao
| | - Jiaxuan Li
- MPU-UC Joint Research Laboratory in Advanced Technologies for Smart Cities, Faculty of Applied Sciences, Macao Polytechnic University, Macao, Macao
| | - Patrick Cheong-Iao Pang
- MPU-UC Joint Research Laboratory in Advanced Technologies for Smart Cities, Faculty of Applied Sciences, Macao Polytechnic University, Macao, Macao
| | - Quanjing Zhu
- Department of Laboratory Medicine, West China Hospital, Sichuan University, Chengdu, China
| | - Rong Chen
- Department of Rehabilitation Medicine, The First Affiliated Hospital, Sun Yat-Sen University, Guangzhou, China
| |
Collapse
|
37
|
Ma J, Yu J, Xie A, Huang T, Liu W, Ma M, Tao Y, Zang F, Zheng Q, Zhu W, Chen Y, Ning M, Zhu Y. Large language model evaluation in autoimmune disease clinical questions comparing ChatGPT 4o, Claude 3.5 Sonnet and Gemini 1.5 pro. Sci Rep 2025; 15:17635. [PMID: 40399509 PMCID: PMC12095533 DOI: 10.1038/s41598-025-02601-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2024] [Accepted: 05/14/2025] [Indexed: 05/23/2025] Open
Abstract
Large language models (LLMs) have established a presence in providing medical services to patients and supporting clinical practice for doctors. To explore the ability of LLMs in answering clinical questions related to autoimmune diseases, this study was designed with 65 questions related to autoimmune diseases, covering five domains: concepts, report interpretation, diagnosis, prevention and treatment, and prognosis. Types of diseases include Sjögren's syndrome, systemic lupus erythematosus, rheumatoid arthritis, systemic sclerosis, and others. These questions were answered by three LLMs: ChatGPT 4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. The responses were then evaluated by 8 clinicians based on criteria including relevance, completeness, accuracy, safety, readability, and simplicity. We analyzed the scores of the three LLMs across five domains and six dimensions and compared their accuracy in answering the report interpretation section with that of two senior doctors and two junior doctors. The results showed that the performance of the three LLMs in the evaluation of autoimmune diseases significantly surpassed that of both junior and senior doctors. Notably, Claude 3.5 Sonnet excelled in providing comprehensive and accurate responses to clinical questions on autoimmune diseases, demonstrating the great potential of LLMs in assisting doctors with the diagnosis, treatment, and management of autoimmune diseases.
Collapse
Affiliation(s)
- Juntao Ma
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Jie Yu
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Anran Xie
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Taihong Huang
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Wenjing Liu
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Mengyin Ma
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Yue Tao
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Fuyu Zang
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Qisi Zheng
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Wenbo Zhu
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Yuxin Chen
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China.
| | - Mingzhe Ning
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China.
- Yizheng Hospital of Nanjing Drum Tower Hospital Group, Yizheng 211900, Jiangsu, China, Yangzhou, China.
| | - Yijia Zhu
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China.
| |
Collapse
|
38
|
Alter IL, Dias C, Briano J, Rameau A. Digital health technologies in swallowing care from screening to rehabilitation: A narrative review. Auris Nasus Larynx 2025; 52:319-326. [PMID: 40403345 DOI: 10.1016/j.anl.2025.05.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2025] [Revised: 05/14/2025] [Accepted: 05/16/2025] [Indexed: 05/24/2025]
Abstract
OBJECTIVES Digital health technologies (DHTs) have rapidly advanced in the past two decades, through developments in mobile and wearable devices and most recently with the explosion of artificial intelligence (AI) capabilities and subsequent extension into the health space. DHT has myriad potential applications to deglutology, many of which have undergone promising investigations and developments in recent years. We present the first literature review on applications of DHT in swallowing health, from screening to therapeutics. Public health interventions for swallowing care are increasingly needed in the setting of aging populations in the West and East Asia, and DHT may offer a scalable and low-cost solution. METHODS A narrative review was performed using PubMed and Google Scholar to identify recent research on applications of AI and digital health in swallow practice. Database searches, conducted in September 2024, included terms such as "digital," "AI," "machine learning," "tools" in combination with "deglutition," "Otolaryngology," "Head and Neck," "speech language pathology," "swallow," and "dysphagia." Primary literature pertaining to digital health in deglutology was included for review. RESULTS We review the various applications of DHT in swallowing care, including prevention, screening, diagnosis, treatment planning and rehabilitation. CONCLUSION DHT may offer innovative and scalable solutions for swallowing care as public health needs grow and in the setting of limited specialized healthcare workforce. These technological advances are also being explored as time and resource saving solutions at many points of care in swallow practice. DHT could bring affordable and accurate information for self-management of dysphagia to broader patient populations that otherwise lack access to expert providers.
Collapse
Affiliation(s)
- Isaac L Alter
- Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, 240 E 59 St, NY, NY 10022, USA
| | - Carla Dias
- Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, 240 E 59 St, NY, NY 10022, USA
| | - Jack Briano
- Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, 240 E 59 St, NY, NY 10022, USA
| | - Anaïs Rameau
- Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, 240 E 59 St, NY, NY 10022, USA.
| |
Collapse
|
39
|
Bai X, Wang S, Zhao Y, Feng M, Ma W, Liu X. Application of AI Chatbot in Responding to Asynchronous Text-Based Messages From Patients With Cancer: Comparative Study. J Med Internet Res 2025; 27:e67462. [PMID: 40397947 DOI: 10.2196/67462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2024] [Revised: 12/22/2024] [Accepted: 04/14/2025] [Indexed: 05/23/2025] Open
Abstract
BACKGROUND Telemedicine, which incorporates artificial intelligence such as chatbots, offers significant potential for enhancing health care delivery. However, the efficacy of artificial intelligence chatbots compared to human physicians in clinical settings remains underexplored, particularly in complex scenarios involving patients with cancer and asynchronous text-based interactions. OBJECTIVE This study aimed to evaluate the performance of the GPT-4 (OpenAI) chatbot in responding to asynchronous text-based medical messages from patients with cancer by comparing its responses with those of physicians across two clinical scenarios: patient education and medical decision-making. METHODS We collected 4257 deidentified asynchronous text-based medical consultation records from 17 oncologists across China between January 1, 2020, and March 31, 2024. Each record included patient questions, demographic data, and disease-related details. The records were categorized into two scenarios: patient education (eg, symptom explanations and test interpretations) and medical decision-making (eg, treatment planning). The GPT-4 chatbot was used to simulate physician responses to these records, with each session conducted in a new conversation to avoid cross-session interference. The chatbot responses, along with the original physician responses, were evaluated by a medical review panel (3 oncologists) and a patient panel (20 patients with cancer). The medical panel assessed completeness, accuracy, and safety using a 3-level scale, whereas the patient panel rated completeness, trustworthiness, and empathy on a 5-point ordinal scale. Statistical analyses included chi-square tests for categorical variables and Wilcoxon signed-rank tests for ordinal ratings. RESULTS In the patient education scenario (n=2364), the chatbot scored higher than physicians in completeness (n=2301, 97.34% vs n=2213, 93.61% for fully complete responses; P=.002), with no significant differences in accuracy or safety (P>.05). In the medical decision-making scenario (n=1893), the chatbot exhibited lower accuracy (n=1834, 96.88% vs n=1855, 97.99% for fully accurate responses; P<.001) and trustworthiness (n=860, 50.71% vs n=1766, 93.29% rated as "Moderately trustworthy" or higher; P<.001) compared with physicians. Regarding empathy, the medical review panel rated the chatbot as demonstrating higher empathy scores across both scenarios, whereas the patient review panel reached the opposite conclusion, consistently favoring physicians in empathetic communication. Errors in chatbot responses were primarily due to misinterpretations of medical terminology or the lack of updated guidelines, with 3.12% (59/1893) of its responses potentially leading to adverse outcomes, compared with 2.01% (38/1893) for physicians. CONCLUSIONS The GPT-4 chatbot performs comparably to physicians in patient education by providing comprehensive and empathetic responses. However, its reliability in medical decision-making remains limited, particularly in complex scenarios requiring nuanced clinical judgment. These findings underscore the chatbot's potential as a supplementary tool in telemedicine while highlighting the need for physician oversight to ensure patient safety and accuracy.
Collapse
Affiliation(s)
- Xuexue Bai
- Department of Neurosurgery, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
- Department of Neurosurgery, Peking Union Medical College Hospital, Beijing, China
| | - Shiyong Wang
- Department of Neurosurgery, First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Yuanli Zhao
- Department of Neurosurgery, Peking Union Medical College Hospital, Beijing, China
| | - Ming Feng
- Department of Neurosurgery, Peking Union Medical College Hospital, Beijing, China
| | - Wenbin Ma
- Department of Neurosurgery, Peking Union Medical College Hospital, Beijing, China
| | - Xiaomin Liu
- Head and Neck Neuro-Oncology Center, Tianjin Huanhu Hospital, Tianjin, China
| |
Collapse
|
40
|
Andras D, Ilies RA, Esanu V, Agoston S, Marginean Jumate TF, Dindelegan GC. Artificial Intelligence as a Potential Tool for Predicting Surgical Margin Status in Early Breast Cancer Using Mammographic Specimen Images. Diagnostics (Basel) 2025; 15:1276. [PMID: 40428269 PMCID: PMC12109882 DOI: 10.3390/diagnostics15101276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2025] [Revised: 05/10/2025] [Accepted: 05/13/2025] [Indexed: 05/29/2025] Open
Abstract
Background/Objectives: Breast cancer is the most common malignancy among women globally, with an increasing incidence, particularly in younger populations. Achieving complete surgical excision is essential to reduce recurrence. Artificial intelligence (AI), including large language models like ChatGPT, has potential for supporting diagnostic tasks, though its role in surgical oncology remains limited. Methods: This retrospective study evaluated ChatGPT's performance (ChatGPT-4, OpenAI, March 2025) in predicting surgical margin status (R0 or R1) based on intraoperative mammograms of lumpectomy specimens. AI-generated responses were compared with histopathological findings. Performance was evaluated using sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value (NPV), F1 score, and Cohen's kappa coefficient. Results: Out of a total of 100 patients, ChatGPT achieved an accuracy of 84.0% in predicting surgical margin status. Sensitivity for identifying R1 cases (incomplete excision) was 60.0%, while specificity for R0 (complete excision) was 86.7%. The positive predictive value (PPV) was 33.3%, and the negative predictive value (NPV) was 95.1%. The F1 score for R1 classification was 0.43, and Cohen's kappa coefficient was 0.34, indicating moderate agreement with histopathological findings. Conclusions: ChatGPT demonstrated moderate accuracy in confirming complete excision but showed limited reliability in identifying incomplete margins. While promising, these findings emphasize the need for domain-specific training and further validation before such models can be implemented in clinical breast cancer workflows.
Collapse
Affiliation(s)
- David Andras
- Department of General Surgery, Iuliu Hatieganu University of Medicine and Pharmacy, 400006 Cluj-Napoca, Romania; (D.A.); (G.C.D.)
- First Surgical Unit, Emergency County Hospital Cluj, 400006 Cluj-Napoca, Romania
| | - Radu Alexandru Ilies
- Faculty of Medicine, Iuliu Hatieganu University of Medicine and Pharmacy, 400012 Cluj-Napoca, Romania; (S.A.); (T.F.M.J.)
| | - Victor Esanu
- First Surgical Unit, Emergency County Hospital Cluj, 400006 Cluj-Napoca, Romania
| | - Stefan Agoston
- Faculty of Medicine, Iuliu Hatieganu University of Medicine and Pharmacy, 400012 Cluj-Napoca, Romania; (S.A.); (T.F.M.J.)
| | - Tudor Florin Marginean Jumate
- Faculty of Medicine, Iuliu Hatieganu University of Medicine and Pharmacy, 400012 Cluj-Napoca, Romania; (S.A.); (T.F.M.J.)
| | - George Calin Dindelegan
- Department of General Surgery, Iuliu Hatieganu University of Medicine and Pharmacy, 400006 Cluj-Napoca, Romania; (D.A.); (G.C.D.)
- First Surgical Unit, Emergency County Hospital Cluj, 400006 Cluj-Napoca, Romania
| |
Collapse
|
41
|
Shashikumar SP, Mohammadi S, Krishnamoorthy R, Patel A, Wardi G, Ahn JC, Singh K, Aronoff-Spencer E, Nemati S. Development and prospective implementation of a large language model based system for early sepsis prediction. NPJ Digit Med 2025; 8:290. [PMID: 40379845 PMCID: PMC12084535 DOI: 10.1038/s41746-025-01689-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Accepted: 04/27/2025] [Indexed: 05/19/2025] Open
Abstract
Sepsis is a dysregulated host response to infection with high mortality and morbidity. Early detection and intervention have been shown to improve patient outcomes, but existing computational models relying on structured electronic health record data often miss contextual information from unstructured clinical notes. This study introduces COMPOSER-LLM, an open-source large language model (LLM) integrated with the COMPOSER model to enhance early sepsis prediction. For high-uncertainty predictions, the LLM extracts additional context to assess sepsis-mimics, improving accuracy. Evaluated on 2500 patient encounters, COMPOSER-LLM achieved a sensitivity of 72.1%, positive predictive value of 52.9%, F-1 score of 61.0%, and 0.0087 false alarms per patient hour, outperforming the standalone COMPOSER model. Prospective validation yielded similar results. Manual chart review found 62% of false positives had bacterial infections, demonstrating potential clinical utility. Our findings suggest that integrating LLMs with traditional models can enhance predictive performance by leveraging unstructured data, representing a significant advance in healthcare analytics.
Collapse
Affiliation(s)
| | - Sina Mohammadi
- Division of Biomedical Informatics, UC San Diego, San Diego, CA, USA
| | | | - Avi Patel
- Department of Emergency Medicine, UC San Diego, San Diego, CA, USA
| | - Gabriel Wardi
- Department of Emergency Medicine, UC San Diego, San Diego, CA, USA
- Division of Pulmonary, Critical Care and Sleep Medicine, UC San Diego, San Diego, CA, USA
| | - Joseph C Ahn
- Division of Biomedical Informatics, UC San Diego, San Diego, CA, USA
- Division of Gastroenterology and Hepatology, Mayo Clinic, Rochester, NY, USA
| | - Karandeep Singh
- Division of Biomedical Informatics, UC San Diego, San Diego, CA, USA
- Jacobs Center for Health Innovation, UC San Diego Health, San Diego, CA, USA
| | - Eliah Aronoff-Spencer
- Division of Infectious Diseases and Global Public Health, UC San Diego, San Diego, CA, USA
| | - Shamim Nemati
- Division of Biomedical Informatics, UC San Diego, San Diego, CA, USA.
| |
Collapse
|
42
|
Kanani MM, Monawer A, Brown L, King WE, Miller ZD, Venugopal N, Heagerty PJ, Jarvik JG, Cohen T, Cross NM. High-Performance Prompting for LLM Extraction of Compression Fracture Findings from Radiology Reports. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2025:10.1007/s10278-025-01530-6. [PMID: 40379860 DOI: 10.1007/s10278-025-01530-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/17/2025] [Revised: 04/20/2025] [Accepted: 04/28/2025] [Indexed: 05/19/2025]
Abstract
Extracting information from radiology reports can provide critical data to empower many radiology workflows. For spinal compression fractures, these data can facilitate evidence-based care for at-risk populations. Manual extraction from free-text reports is laborious, and error-prone. Large language models (LLMs) have shown promise; however, fine-tuning strategies to optimize performance in specific tasks can be resource intensive. A variety of prompting strategies have achieved similar results with fewer demands. Our study pioneers the use of Meta's Llama 3.1, together with prompt-based strategies, for automated extraction of compression fractures from free-text radiology reports, outputting structured data without model training. We tested performance on a time-based sample of CT exams covering the spine from 2/20/2024 to 2/22/2024 acquired across our healthcare enterprise (637 anonymized reports, age 18-102, 47% Female). Ground truth annotations were manually generated and compared against the performance of three models (Llama 3.1 70B, Llama 3.1 8B, and Vicuna 13B) with nine different prompting configurations for a total of 27 model/prompt experiments. The highest F1 score (0.91) was achieved by the 70B Llama 3.1 model when provided with a radiologist-written background, with similar results when the background was written by a separate LLM (0.86). The addition of few-shot examples to these prompts had variable impact on F1 measurements (0.89, 0.84 respectively). Comparable ROC-AUC and PR-AUC performance was observed. Our work demonstrated that an open-weights LLM excelled at extracting compression fractures findings from free-text radiology reports using prompt-based techniques without requiring extensive manually labeled examples for model training.
Collapse
Affiliation(s)
| | - Arezu Monawer
- Department of Radiology, University of Washington, Seattle, WA, USA
| | - Lauryn Brown
- Department of Radiology, University of Washington, Seattle, WA, USA
| | - William E King
- Department of Radiology, University of Washington, Seattle, WA, USA
| | - Zachary D Miller
- Department of Radiology, University of Washington, Seattle, WA, USA
| | - Nitin Venugopal
- Department of Radiology, University of Washington, Seattle, WA, USA
| | | | - Jeffrey G Jarvik
- Department of Radiology, University of Washington, Seattle, WA, USA
| | - Trevor Cohen
- Department of Biomedical Informatics, University of Washington, Seattle, WA, USA
| | - Nathan M Cross
- Department of Radiology, University of Washington, Seattle, WA, USA
| |
Collapse
|
43
|
Omar M, Agbareia R, Glicksberg BS, Nadkarni GN, Klang E. Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study. JMIR Med Inform 2025; 13:e66917. [PMID: 40378406 PMCID: PMC12101789 DOI: 10.2196/66917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 01/31/2025] [Accepted: 01/31/2025] [Indexed: 05/18/2025] Open
Abstract
Background The capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored. Objective This study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs' ability to accurately judge their own responses. Methods We used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%-100%). We calculated the correlation between each model's mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t tests. Results The correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (r=-0.40; P=.001), indicating that worse-performing models exhibited paradoxically higher confidence. For instance, a top-performing model-GPT-4o-had a mean accuracy of 74% (SD 9.4%), with a mean confidence of 63% (SD 8.3%), whereas a low-performing model-Qwen2-7B-showed a mean accuracy of 46% (SD 10.5%) but a mean confidence of 76% (SD 11.7%). The mean difference in confidence between correct and incorrect responses was low for all models, ranging from 0.6% to 5.4%, with GPT-4o having the highest mean difference (5.4%, SD 2.3%; P=.003). Conclusions Better-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.
Collapse
Affiliation(s)
- Mahmud Omar
- Division of Data-Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, Gustave L. Levy Place New York, New York, NY, 10029, United States, 1 212 241 6500
| | - Reem Agbareia
- Ophthalmology Department, Hadassah Medical Center, Jerusalem, Israel
| | - Benjamin S Glicksberg
- Division of Data-Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, Gustave L. Levy Place New York, New York, NY, 10029, United States, 1 212 241 6500
| | - Girish N Nadkarni
- Division of Data-Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, Gustave L. Levy Place New York, New York, NY, 10029, United States, 1 212 241 6500
| | - Eyal Klang
- Division of Data-Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, Gustave L. Levy Place New York, New York, NY, 10029, United States, 1 212 241 6500
| |
Collapse
|
44
|
Bednarczyk L, Reichenpfader D, Gaudet-Blavignac C, Ette AK, Zaghir J, Zheng Y, Bensahla A, Bjelogrlic M, Lovis C. Scientific Evidence for Clinical Text Summarization Using Large Language Models: Scoping Review. J Med Internet Res 2025; 27:e68998. [PMID: 40371947 PMCID: PMC12123242 DOI: 10.2196/68998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Revised: 02/21/2025] [Accepted: 03/12/2025] [Indexed: 05/16/2025] Open
Abstract
BACKGROUND Information overload in electronic health records requires effective solutions to alleviate clinicians' administrative tasks. Automatically summarizing clinical text has gained significant attention with the rise of large language models. While individual studies show optimism, a structured overview of the research landscape is lacking. OBJECTIVE This study aims to present the current state of the art on clinical text summarization using large language models, evaluate the level of evidence in existing research and assess the applicability of performance findings in clinical settings. METHODS This scoping review complied with the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines. Literature published between January 1, 2019, and June 18, 2024, was identified from 5 databases: PubMed, Embase, Web of Science, IEEE Xplore, and ACM Digital Library. Studies were excluded if they did not describe transformer-based models, did not focus on clinical text summarization, did not engage with free-text data, were not original research, were nonretrievable, were not peer-reviewed, or were not in English, French, Spanish, or German. Data related to study context and characteristics, scope of research, and evaluation methodologies were systematically collected and analyzed by 3 authors independently. RESULTS A total of 30 original studies were included in the analysis. All used observational retrospective designs, mainly using real patient data (n=28, 93%). The research landscape demonstrated a narrow research focus, often centered on summarizing radiology reports (n=17, 57%), primarily involving data from the intensive care unit (n=15, 50%) of US-based institutions (n=19, 73%), in English (n=26, 87%). This focus aligned with the frequent reliance on the open-source Medical Information Mart for Intensive Care dataset (n=15, 50%). Summarization methodologies predominantly involved abstractive approaches (n=17, 57%) on single-document inputs (n=4, 13%) with unstructured data (n=13, 43%), yet reporting on methodological details remained inconsistent across studies. Model selection involved both open-source models (n=26, 87%) and proprietary models (n=7, 23%). Evaluation frameworks were highly heterogeneous. All studies conducted internal validation, but external validation (n=2, 7%), failure analysis (n=6, 20%), and patient safety risks analysis (n=1, 3%) were infrequent, and none reported bias assessment. Most studies used both automated metrics and human evaluation (n=16, 53%), while 10 (33%) used only automated metrics, and 4 (13%) only human evaluation. CONCLUSIONS Key barriers hinder the translation of current research into trustworthy, clinically valid applications. Current research remains exploratory and limited in scope, with many applications yet to be explored. Performance assessments often lack reliability, and clinical impact evaluations are insufficient raising concerns about model utility, safety, fairness, and data privacy. Advancing the field requires more robust evaluation frameworks, a broader research scope, and a stronger focus on real-world applicability.
Collapse
Affiliation(s)
- Lydie Bednarczyk
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
| | - Daniel Reichenpfader
- Institute for Patient-centered Digital Health, Bern University of Applied Sciences, Biel, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | | | - Amon Kenna Ette
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Jamil Zaghir
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Yuanyuan Zheng
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Adel Bensahla
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Mina Bjelogrlic
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Christian Lovis
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| |
Collapse
|
45
|
Wang C, Wang F, Li S, Ren QW, Tan X, Fu Y, Liu D, Qian G, Cao Y, Yin R, Li K. Patient Triage and Guidance in Emergency Departments Using Large Language Models: Multimetric Study. J Med Internet Res 2025; 27:e71613. [PMID: 40374171 PMCID: PMC12123234 DOI: 10.2196/71613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2025] [Revised: 03/13/2025] [Accepted: 05/01/2025] [Indexed: 05/17/2025] Open
Abstract
BACKGROUND Emergency departments (EDs) face significant challenges due to overcrowding, prolonged waiting times, and staff shortages, leading to increased strain on health care systems. Efficient triage systems and accurate departmental guidance are critical for alleviating these pressures. Recent advancements in large language models (LLMs), such as ChatGPT, offer potential solutions for improving patient triage and outpatient department selection in emergency settings. OBJECTIVE The study aimed to assess the accuracy, consistency, and feasibility of GPT-4-based ChatGPT models (GPT-4o and GPT-4-Turbo) for patient triage using the Modified Early Warning Score (MEWS) and evaluate GPT-4o's ability to provide accurate outpatient department guidance based on simulated patient scenarios. METHODS A 2-phase experimental study was conducted. In the first phase, 2 ChatGPT models (GPT-4o and GPT-4-Turbo) were evaluated for MEWS-based patient triage accuracy using 1854 simulated patient scenarios. Accuracy and consistency were assessed before and after prompt engineering. In the second phase, GPT-4o was tested for outpatient department selection accuracy using 264 scenarios sourced from the Chinese Medical Case Repository. Each scenario was independently evaluated by GPT-4o thrice. Data analyses included Wilcoxon tests, Kendall correlation coefficients, and logistic regression analyses. RESULTS In the first phase, ChatGPT's triage accuracy, based on MEWS, improved following prompt engineering. Interestingly, GPT-4-Turbo outperformed GPT-4o. GPT-4-Turbo achieved an accuracy of 100% compared to GPT-4o's accuracy of 96.2%, despite GPT-4o initially showing better performance prior to prompt engineering. This finding suggests that GPT-4-Turbo may be more adaptable to prompt optimization. In the second phase, GPT-4o, with superior performance on emotional responsiveness compared to GPT-4-Turbo, demonstrated an overall guidance accuracy of 92.63% (95% CI 90.34%-94.93%), with the highest accuracy in internal medicine (93.51%, 95% CI 90.85%-96.17%) and the lowest in general surgery (91.46%, 95% CI 86.50%-96.43%). CONCLUSIONS ChatGPT demonstrated promising capability for supporting patient triage and outpatient guidance in EDs. GPT-4-Turbo showed greater adaptability to prompt engineering, whereas GPT-4o exhibited superior responsiveness and emotional interaction, which are essential for patient-facing tasks. Future studies should explore real-world implementation and address the identified limitations to enhance ChatGPT's clinical integration.
Collapse
Affiliation(s)
- Chenxu Wang
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Industrial Engineering, Sichuan University, Chengdu, China
| | - Fei Wang
- Department of Nursing, West China School of Medicine, Sichuan University, Chengdu, China
| | - Shuhan Li
- Department of Industrial Engineering, Sichuan University, Chengdu, China
| | - Qing-Wen Ren
- Department of Medicine, Queen Mary Hospital, University of Hong Kong, Hong Kong, China (Hong Kong)
| | - Xiaomei Tan
- Department of Industrial Engineering, Sichuan University, Chengdu, China
| | - Yaoyu Fu
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
| | - Di Liu
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Industrial Engineering, Sichuan University, Chengdu, China
- Med-X Center for Informatics, Sichuan University, Chengdu, China
| | - Guangwu Qian
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Computer Science, Sichuan University, Chengdu, China
| | - Yu Cao
- Department of Emergency Medicine, West China Hospital of Sichuan University, Chengdu, China
| | - Rong Yin
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Industrial Engineering, Sichuan University, Chengdu, China
- Med-X Center for Informatics, Sichuan University, Chengdu, China
| | - Kang Li
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
- Med-X Center for Informatics, Sichuan University, Chengdu, China
| |
Collapse
|
46
|
Chen R, Zhang S, Zheng Y, Yu Q, Wang C. Enhancing treatment decision-making for low back pain: a novel framework integrating large language models with retrieval-augmented generation technology. Front Med (Lausanne) 2025; 12:1599241. [PMID: 40438365 PMCID: PMC12116667 DOI: 10.3389/fmed.2025.1599241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2025] [Accepted: 04/30/2025] [Indexed: 06/01/2025] Open
Abstract
Introduction Chronic low back pain (CLBP) is a global health problem that seriously affects the quality of life among patients. The etiology of CLBP is complex, with non-specific symptoms and considerable heterogeneity, which poses a great challenge for diagnosis. In addition, the uncertain treatment responses as well as the potential influence of psychological and social factors further increase the difficulty of personalized decision-making in clinical practice. Methods This study proposed an innovative support framework on clinical decision, which combined large language models (LLMs) with retrieval-augmented generation (RAG) technology. Moreover, the least-to-most (LtM) prompting technology was introduced, aiming to simulate the decision-making process of senior experts thereby improving personalized treatment for CLBP. Additionally, a special CLBP-related dataset was generated to verify effectiveness of the framework, which compared the proposed model CLBP-GPT with GPT-4.0, ERNIE Bot, and DeepSeek in terms of five key indicators: accuracy, relevance, clarity, benefit, and completeness. Results The results showed that the CLBP-GPT model proposed in this study scored significantly better than other comparison models in all five evaluation dimensions. Specifically, the total score of CLBP-GPT was 4.40 (SD = 0.20), substantially higher than GPT-4.0 (4.03, SD = 0.48), ERNIE Bot (3.54, SD = 0.53), and DeepSeek (3.81, SD = 0.47). In terms of accuracy, the average score of CLBP-GPT was 4.38 (SD = 0.19), while the scores of other models were all below 4, indicating that CLBP-GPT could provide more accurate clinical decision-making recommendations. In addition, CLBP-GPT scored as high as 4.42 (SD = 0.19) in the completeness dimension, further demonstrating that the decision content output by the model was more comprehensive and covered more key information related to CLBP. Discussion This study not only provides new technical support for clinical decision-making in CLBP, but also introduces a powerful tool for doctors to formulate personalized and efficient treatment strategies. It is expected to improve the diagnosis and treatment of CLBP in the future.
Collapse
Affiliation(s)
| | | | | | - Qiuhua Yu
- Department of Rehabilitation Medicine, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Chuhuai Wang
- Department of Rehabilitation Medicine, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
47
|
Jiao C, Rosas E, Asadigandomani H, Delsoz M, Madadi Y, Raja H, Munir WM, Tamm B, Mehravaran S, Djalilian AR, Yousefi S, Soleimani M. Diagnostic Performance of Publicly Available Large Language Models in Corneal Diseases: A Comparison with Human Specialists. Diagnostics (Basel) 2025; 15:1221. [PMID: 40428214 PMCID: PMC12110359 DOI: 10.3390/diagnostics15101221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2025] [Revised: 05/10/2025] [Accepted: 05/10/2025] [Indexed: 05/29/2025] Open
Abstract
Background/Objectives: This study evaluated the diagnostic accuracy of seven publicly available large language models (LLMs)-GPT-3.5, GPT-4.o Mini, GPT-4.o, Gemini 1.5 Flash, Claude 3.5 Sonnet, Grok3, and DeepSeek R1-in diagnosing corneal diseases, comparing their performance to human specialists. Methods: Twenty corneal disease cases from the University of Iowa's EyeRounds were presented to each LLM. Diagnostic accuracy was determined by comparing LLM-generated diagnoses to the confirmed case diagnoses. Four human cornea specialists evaluated the same cases to establish a benchmark and assess interobserver agreement. Results: Diagnostic accuracy varied significantly among LLMs (p = 0.001). GPT-4.o achieved the highest accuracy (80.0%), followed by Claude 3.5 Sonnet and Grok3 (70.0%), DeepSeek R1 (65.0%), GPT-3.5 (60.0%), GPT-4.o Mini (55.0%), and Gemini 1.5 Flash (30.0%). Human experts averaged 92.5% accuracy, outperforming all LLMs (p < 0.001, Cohen's d = -1.314). GPT-4.o showed no significant difference from human consensus (p = 0.250, κ = 0.348), while Claude and Grok3 showed fair agreement (κ = 0.219). DeepSeek R1 also performed reasonably (κ = 0.178), although not significantly. Conclusions: Among the evaluated LLMs, GPT-4.o, Claude 3.5 Sonnet, Grok3, and DeepSeek R1 demonstrated promising diagnostic accuracy, with GPT-4.o most closely matching human performance. However, performance remained inconsistent, especially in complex cases. LLMs may offer value as diagnostic support tools, but human expertise remains indispensable for clinical decision-making.
Collapse
Affiliation(s)
- Cheng Jiao
- Department of Ophthalmology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA; (C.J.); (E.R.)
| | - Erik Rosas
- Department of Ophthalmology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA; (C.J.); (E.R.)
| | - Hassan Asadigandomani
- Department of Ophthalmology, University of California San Francisco, San Francisco, CA 94143, USA;
| | - Mohammad Delsoz
- Department of Ophthalmology, Hamilton Eye Institute, University of Tennessee Health Science Center, Memphis, TN 38103, USA; (M.D.); (Y.M.); (H.R.); (S.Y.)
| | - Yeganeh Madadi
- Department of Ophthalmology, Hamilton Eye Institute, University of Tennessee Health Science Center, Memphis, TN 38103, USA; (M.D.); (Y.M.); (H.R.); (S.Y.)
| | - Hina Raja
- Department of Ophthalmology, Hamilton Eye Institute, University of Tennessee Health Science Center, Memphis, TN 38103, USA; (M.D.); (Y.M.); (H.R.); (S.Y.)
| | - Wuqaas M. Munir
- Department of Ophthalmology and Visual Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA; (W.M.M.); (B.T.)
| | - Brendan Tamm
- Department of Ophthalmology and Visual Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA; (W.M.M.); (B.T.)
| | - Shiva Mehravaran
- Department of Biology, School of Computer, Mathematical, and Natural Sciences, Morgan State University, Baltimore, MD 21251, USA;
| | - Ali R. Djalilian
- Department of Ophthalmology and Visual Sciences, University of Illinois at Chicago, Chicago, IL 60612, USA;
| | - Siamak Yousefi
- Department of Ophthalmology, Hamilton Eye Institute, University of Tennessee Health Science Center, Memphis, TN 38103, USA; (M.D.); (Y.M.); (H.R.); (S.Y.)
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN 38136, USA
| | - Mohammad Soleimani
- Department of Ophthalmology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA; (C.J.); (E.R.)
| |
Collapse
|
48
|
Chen D, Chauhan K, Parsa R, Liu ZA, Liu FF, Mak E, Eng L, Hannon BL, Croke J, Hope A, Fallah-Rad N, Wong P, Raman S. Patient perceptions of empathy in physician and artificial intelligence chatbot responses to patient questions about cancer. NPJ Digit Med 2025; 8:275. [PMID: 40360673 PMCID: PMC12075825 DOI: 10.1038/s41746-025-01671-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2024] [Accepted: 04/24/2025] [Indexed: 05/15/2025] Open
Abstract
Artificial intelligence chatbots can draft empathetic responses to cancer questions, but how patients perceive chatbot empathy remains unclear. Here, we found that people with cancer rated chatbot responses as more empathetic than physician responses. However, differences between patient and physician perceptions of empathy highlight the need for further research to tailor clinical messaging to better meet patient needs. Chatbots may be effective in generating empathetic template responses to patient questions under clinician oversight.
Collapse
Affiliation(s)
- David Chen
- Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada
| | - Kabir Chauhan
- Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada
| | - Rod Parsa
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada
| | - Zhihui Amy Liu
- Department of Biostatistics, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Fei-Fei Liu
- Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
| | - Ernie Mak
- Department of Supportive Care, University Health Network, Toronto, ON, Canada
- Department of Family & Community Medicine, University of Toronto, Toronto, ON, Canada
| | - Lawson Eng
- Division of Medical Oncology and Hematology, Department of Medicine, Princess Margaret Cancer Centre/University Health Network Toronto, Toronto, ON, Canada
- Division of Medical Oncology, Department of Medicine, University of Toronto, Toronto, ON, Canada
| | - Breffni Louise Hannon
- Department of Supportive Care, University Health Network, Toronto, ON, Canada
- Department of Medicine, University of Toronto, Toronto, ON, Canada
| | - Jennifer Croke
- Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
| | - Andrew Hope
- Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
| | - Nazanin Fallah-Rad
- Division of Medical Oncology and Hematology, Department of Medicine, Princess Margaret Cancer Centre/University Health Network Toronto, Toronto, ON, Canada
| | - Phillip Wong
- Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
| | - Srinivas Raman
- Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada.
- Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
49
|
Shi B, Chen L, Pang S, Wang Y, Wang S, Li F, Zhao W, Guo P, Zhang L, Fan C, Zou Y, Wu X. Large Language Models and Artificial Neural Networks for Assessing 1-Year Mortality in Patients With Myocardial Infarction: Analysis From the Medical Information Mart for Intensive Care IV (MIMIC-IV) Database. J Med Internet Res 2025; 27:e67253. [PMID: 40354652 PMCID: PMC12107198 DOI: 10.2196/67253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2024] [Revised: 04/01/2025] [Accepted: 04/17/2025] [Indexed: 05/14/2025] Open
Abstract
BACKGROUND Accurate mortality risk prediction is crucial for effective cardiovascular risk management. Recent advancements in artificial intelligence (AI) have demonstrated potential in this specific medical field. Qwen-2 and Llama-3 are high-performance, open-source large language models (LLMs) available online. An artificial neural network (ANN) algorithm derived from the SWEDEHEART (Swedish Web System for Enhancement and Development of Evidence-Based Care in Heart Disease Evaluated According to Recommended Therapies) registry, termed SWEDEHEART-AI, can predict patient prognosis following acute myocardial infarction (AMI). OBJECTIVE This study aims to evaluate the 3 models mentioned above in predicting 1-year all-cause mortality in critically ill patients with AMI. METHODS The Medical Information Mart for Intensive Care IV (MIMIC-IV) database is a publicly available data set in critical care medicine. We included 2758 patients who were first admitted for AMI and discharged alive. SWEDEHEART-AI calculated the mortality rate based on each patient's 21 clinical variables. Qwen-2 and Llama-3 analyzed the content of patients' discharge records and directly provided a 1-decimal value between 0 and 1 to represent 1-year death risk probabilities. The patients' actual mortality was verified using follow-up data. The predictive performance of the 3 models was assessed and compared using the Harrell C-statistic (C-index), the area under the receiver operating characteristic curve (AUROC), calibration plots, Kaplan-Meier curves, and decision curve analysis. RESULTS SWEDEHEART-AI demonstrated strong discrimination in predicting 1-year all-cause mortality in patients with AMI, with a higher C-index than Qwen-2 and Llama-3 (C-index 0.72, 95% CI 0.69-0.74 vs C-index 0.65, 0.62-0.67 vs C-index 0.56, 95% CI 0.53-0.58, respectively; all P<.001 for both comparisons). SWEDEHEART-AI also showed high and consistent AUROC in the time-dependent ROC curve. The death rates calculated by SWEDEHEART-AI were positively correlated with actual mortality, and the 3 risk classes derived from this model showed clear differentiation in the Kaplan-Meier curve (P<.001). Calibration plots indicated that SWEDEHEART-AI tended to overestimate mortality risk, with an observed-to-expected ratio of 0.478. Compared with the LLMs, SWEDEHEART-AI demonstrated positive and greater net benefits at risk thresholds below 19%. CONCLUSIONS SWEDEHEART-AI, a trained ANN model, demonstrated the best performance, with strong discrimination and clinical utility in predicting 1-year all-cause mortality in patients with AMI from an intensive care cohort. Among the LLMs, Qwen-2 outperformed Llama-3 and showed moderate predictive value. Qwen-2 and SWEDEHEART-AI exhibited comparable classification effectiveness. The future integration of LLMs into clinical decision support systems holds promise for accurate risk stratification in patients with AMI; however, further research is needed to optimize LLM performance and address calibration issues across diverse patient populations.
Collapse
Affiliation(s)
- Boqun Shi
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Liangguo Chen
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Shuo Pang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Yue Wang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Shen Wang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Fadong Li
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Wenxin Zhao
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Pengrong Guo
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Leli Zhang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Chu Fan
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Yi Zou
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Xiaofan Wu
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| |
Collapse
|
50
|
Luo Y, Jiao M, Fotedar N, Ding JE, Karakis I, Rao VR, Asmar M, Xian X, Aboud O, Wen Y, Lin JJ, Hung FM, Sun H, Rosenow F, Liu F. Clinical Value of ChatGPT for Epilepsy Presurgical Decision-Making: Systematic Evaluation of Seizure Semiology Interpretation. J Med Internet Res 2025; 27:e69173. [PMID: 40354107 PMCID: PMC12107199 DOI: 10.2196/69173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2024] [Revised: 02/03/2025] [Accepted: 03/10/2025] [Indexed: 05/14/2025] Open
Abstract
BACKGROUND For patients with drug-resistant focal epilepsy, surgical resection of the epileptogenic zone (EZ) is an effective treatment to control seizures. Accurate localization of the EZ is crucial and is typically achieved through comprehensive presurgical approaches such as seizure semiology interpretation, electroencephalography (EEG), magnetic resonance imaging (MRI), and intracranial EEG (iEEG). However, interpreting seizure semiology is challenging because it heavily relies on expert knowledge. The semiologies are often inconsistent and incoherent, leading to variability and potential limitations in presurgical evaluation. To overcome these challenges, advanced technologies like large language models (LLMs)-with ChatGPT being a notable example-offer valuable tools for analyzing complex textual information, making them well-suited to interpret detailed seizure semiology descriptions and accurately localize the EZ. OBJECTIVE This study evaluates the clinical value of ChatGPT for interpreting seizure semiology to localize EZs in presurgical assessments for patients with focal epilepsy and compares its performance with that of epileptologists. METHODS We compiled 2 data cohorts: a publicly sourced cohort of 852 semiology-EZ pairs from 193 peer-reviewed journal publications and a private cohort of 184 semiology-EZ pairs collected from Far Eastern Memorial Hospital (FEMH) in Taiwan. ChatGPT was evaluated to predict the most likely EZ locations using 2 prompt methods: zero-shot prompting (ZSP) and few-shot prompting (FSP). To compare the performance of ChatGPT, 8 epileptologists were recruited to participate in an online survey to interpret 100 randomly selected semiology records. The responses from ChatGPT and epileptologists were compared using 3 metrics: regional sensitivity (RSens), weighted sensitivity (WSens), and net positive inference rate (NPIR). RESULTS In the publicly sourced cohort, ChatGPT demonstrated high RSens reliability, achieving 80% to 90% for the frontal and temporal lobes; 20% to 40% for the parietal lobe, occipital lobe, and insular cortex; and only 3% for the cingulate cortex. The WSens, which accounts for biased data distribution, consistently exceeded 67%, while the mean NPIR remained around 0. These evaluation results based on the private FEMH cohort are consistent with those from the publicly sourced cohort. A group t test with 1000 bootstrap samples revealed that ChatGPT-4 significantly outperformed epileptologists in RSens for the most frequently implicated EZs, such as the frontal and temporal lobes (P<.001). Additionally, ChatGPT-4 demonstrated superior overall performance in WSens (P<.001). However, no significant differences were observed between ChatGPT and the epileptologists in NPIR, highlighting comparable performance in this metric. CONCLUSIONS ChatGPT demonstrated clinical value as a tool to assist decision-making during epilepsy preoperative workups. With ongoing advancements in LLMs, their reliability and accuracy are anticipated to improve.
Collapse
Affiliation(s)
- Yaxi Luo
- Department of Computer Science, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
| | - Meng Jiao
- Department of Systems and Enterprises, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
| | - Neel Fotedar
- School of Medicine, Case Western Reserve University, Cleveland, OH, United States
- Department of Neurology, University Hospitals Cleveland Medical Center, Cleveland, OH, United States
| | - Jun-En Ding
- Department of Systems and Enterprises, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
| | - Ioannis Karakis
- Department of Neurology, School of Medicine, Emory University, Atlanta, GA, United States
- Department of Neurology, School of Medicine, University of Crete, Heraklion, Greece
| | - Vikram R Rao
- Department of Neurology and Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA, United States
| | - Melissa Asmar
- Department of Neurology, University of California, Davis, Davis, CA, United States
| | - Xiaochen Xian
- H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, United States
| | - Orwa Aboud
- Department of Neurology and Neurological Surgery, University of California, Davis, Davis, CA, United States
| | - Yuxin Wen
- Fowler School of Engineering, Chapman University, Orange, CA, United States
| | - Jack J Lin
- Department of Neurology, University of California, Davis, Davis, CA, United States
| | - Fang-Ming Hung
- Center of Artificial Intelligence, Far Eastern Memorial Hospital, New Taipei City, Taiwan
- Surgical Trauma Intensive Care Unit, Far Eastern Memorial Hospital, New Taipei City, Taiwan
| | - Hai Sun
- Department of Neurosurgery, Rutgers Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, New Brunswick, NJ, United States
| | - Felix Rosenow
- Department of Neurology, Epilepsy Center Frankfurt Rhine-Main, Goethe University Frankfurt, Frankfurt am Main, Germany
| | - Feng Liu
- Department of Systems and Enterprises, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
- Semcer Center for Healthcare Innovation, Stevens Institute of Technology, Hoboken, NJ, United States
| |
Collapse
|