Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med 2023;29:1930-1940. [PMID: 37460753 DOI: 10.1038/s41591-023-02448-8] [Citation(s) in RCA: 780] [Impact Index Per Article: 390.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 06/08/2023] [Indexed: 08/17/2023]

For:	Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med 2023;29:1930-1940. [PMID: 37460753 DOI: 10.1038/s41591-023-02448-8] [Citation(s) in RCA: 780] [Impact Index Per Article: 390.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 06/08/2023] [Indexed: 08/17/2023]

Number

Cited by Other Article(s)

Chew BH, Lai PSM, Sivaratnam DA, Basri NI, Appannah G, Mohd Yusof BN, Thambiah SC, Nor Hanipah Z, Wong PF, Chang LC. Efficient and Effective Diabetes Care in the Era of Digitalization and Hypercompetitive Research Culture: A Focused Review in the Western Pacific Region with Malaysia as a Case Study. Health Syst Reform 2025;11:2417788. [PMID: 39761168 DOI: 10.1080/23288604.2024.2417788] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 08/28/2024] [Accepted: 10/14/2024] [Indexed: 01/11/2025] Open

Baker HP, Aggarwal S, Kalidoss S, Hess M, Haydon R, Strelzow JA. Diagnostic accuracy of ChatGPT-4 in orthopedic oncology: a comparative study with residents. Knee 2025;55:153-160. [PMID: 40311171 DOI: 10.1016/j.knee.2025.04.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 03/16/2025] [Accepted: 04/05/2025] [Indexed: 05/03/2025]

Abstract

BACKGROUND

Artificial intelligence (AI) is increasingly being explored for its potential role in medical diagnostics. ChatGPT-4, a large language model (LLM) with image analysis capabilities, may assist in histopathological interpretation, but its accuracy in musculoskeletal oncology remains untested. This study evaluates ChatGPT-4's diagnostic accuracy in identifying musculoskeletal tumors from histology slides compared to orthopedic surgery residents.

METHODS

A comparative study was conducted using 24 histology slides randomly selected from an orthopedic oncology registry. Five teams of orthopedic surgery residents (PGY-1 to PGY-5) participated in a diagnostic competition, providing their best diagnosis for each slide. ChatGPT-4 was tested separately using identical histology images and clinical vignettes, with two independent attempts. Statistical analyses, including one-way ANOVA and independent t-tests were performed to compare diagnostic accuracy.

RESULTS

Orthopedic residents significantly outperformed ChatGPT-4 in diagnosing musculoskeletal tumors. The mean diagnostic accuracy among resident teams was 55%, while ChatGPT-4 achieved 25% on its first attempt and 33% on its second attempt. One-way ANOVA revealed a significant difference in accuracy across groups (F = 8.51, p = 0.033). Independent t-tests confirmed that residents performed significantly better than ChatGPT-4 (t = 5.80, p = 0.0004 for first attempt; t = 4.25, p = 0.0028 for second attempt). Both residents and ChatGPT-4 struggled with specific cases, particularly soft tissue sarcomas.

CONCLUSIONS

ChatGPT-4 demonstrated limited accuracy in interpreting histopathological slides compared to orthopedic residents. While AI holds promise for medical diagnostics, its current capabilities in musculoskeletal oncology remain insufficient for independent clinical use. These findings should be viewed as exploratory rather than confirmatory, and further research with larger, more diverse datasets is needed to assess AI's role in histopathology. Future studies should investigate AI-assisted workflows, refine prompt engineering, and explore AI models specifically trained for histopathological diagnosis.

Collapse

Wan F, Wang T, Wang K, Si Y, Fondrevelle J, Du S, Duclos A. Surgery scheduling based on large language models. Artif Intell Med 2025;166:103151. [PMID: 40349664 DOI: 10.1016/j.artmed.2025.103151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2024] [Revised: 01/13/2025] [Accepted: 05/01/2025] [Indexed: 05/14/2025]

Pushpanathan K, Zou M, Srinivasan S, Wong WM, Mangunkusumo EA, Thomas GN, Lai Y, Sun CH, Lam JSH, Tan MCJ, Lin HAH, Ma W, Koh VTC, Chen DZ, Tham YC. Can OpenAI's New o1 Model Outperform Its Predecessors in Common Eye Care Queries? OPHTHALMOLOGY SCIENCE 2025;5:100745. [PMID: 40291392 PMCID: PMC12022690 DOI: 10.1016/j.xops.2025.100745] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/30/2024] [Revised: 02/01/2025] [Accepted: 02/14/2025] [Indexed: 04/30/2025]

Abstract

Objective

The newly launched OpenAI o1 is said to offer improved reasoning, potentially providing higher quality responses to eye care queries. However, its performance remains unassessed. We evaluated the performance of o1, ChatGPT-4o, and ChatGPT-4 in addressing ophthalmic-related queries, focusing on correctness, completeness, and readability.

Design

Cross-sectional study.

Subjects

Sixteen queries, previously identified as suboptimally responded to by ChatGPT-4 from prior studies, were used, covering 3 subtopics: myopia (6 questions), ocular symptoms (4 questions), and retinal conditions (6 questions).

Methods

For each subtopic, 3 attending-level ophthalmologists, masked to the model sources, evaluated the responses based on correctness, completeness, and readability (on a 5-point scale for each metric).

Main Outcome Measures

Mean summed scores of each model for correctness, completeness, and readability, rated on a 5-point scale (maximum score: 15).

Results

O1 scored highest in correctness (12.6) and readability (14.2), outperforming ChatGPT-4, which scored 10.3 (P = 0.010) and 12.4 (P < 0.001), respectively. No significant difference was found between o1 and ChatGPT-4o. When stratified by subtopics, o1 consistently demonstrated superior correctness and readability. In completeness, ChatGPT-4o achieved the highest score of 12.4, followed by o1 (10.8), though the difference was not statistically significant. o1 showed notable limitations in completeness for ocular symptom queries, scoring 5.5 out of 15.

Conclusions

While o1 is marketed as offering improved reasoning capabilities, its performance in addressing eye care queries does not significantly differ from its predecessor, ChatGPT-4o. Nevertheless, it surpasses ChatGPT-4, particularly in correctness and readability.

Financial Disclosures

Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

Collapse

Affiliation(s)

Krithi Pushpanathan Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
Minjie Zou Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
Sahana Srinivasan Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
Wendy Meihua Wong Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore Department of Ophthalmology, National University Hospital, Singapore
Erlangga Ariadarma Mangunkusumo Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore Department of Ophthalmology, National University Hospital, Singapore
George Naveen Thomas Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore Department of Ophthalmology, National University Hospital, Singapore
Yien Lai Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore Department of Ophthalmology, National University Hospital, Singapore
Chen-Hsin Sun Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore Department of Ophthalmology, National University Hospital, Singapore
Janice Sing Harn Lam Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore Department of Ophthalmology, National University Hospital, Singapore
Marcus Chun Jin Tan Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore Department of Ophthalmology, National University Hospital, Singapore
Hazel Anne Hui'En Lin Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore Department of Ophthalmology, National University Hospital, Singapore
Weizhi Ma Institute for AI Industry Research, Tsinghua University, Beijing, China
Victor Teck Chang Koh Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore Department of Ophthalmology, National University Hospital, Singapore
David Ziyou Chen Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore Department of Ophthalmology, National University Hospital, Singapore
Yih-Chung Tham Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore Singapore Eye Research Institute, Singapore National Eye Centre, Singapore Eye Academic Clinical Program (Eye ACP), Duke NUS Medical School, Singapore

Collapse

Zaman A, Yassin MM, Mehmud I, Cao A, Lu J, Hassan H, Kang Y. Challenges, optimization strategies, and future horizons of advanced deep learning approaches for brain lesion segmentation. Methods 2025;239:140-168. [PMID: 40306473 DOI: 10.1016/j.ymeth.2025.04.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2025] [Revised: 04/17/2025] [Accepted: 04/24/2025] [Indexed: 05/02/2025] Open

Affiliation(s)

Asim Zaman School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen 518055, China; College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China; Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Medical School, Shenzhen University, Shenzhen 518060, China
Mazen M Yassin School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen 518055, China; College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China
Irfan Mehmud Department of Urology, The Third Affiliated Hospital of Shenzhen University (Luohu Hospital Group), Shenzhen University, Shenzhen 518000, China; Institute of Urology, South China Hospital, Medicine School, Shenzhen University, Shenzhen 518000, China
Anbo Cao College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China; School of Applied Technology, Shenzhen University, Shenzhen 518055, China
Jiaxi Lu College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China; School of Applied Technology, Shenzhen University, Shenzhen 518055, China
Haseeb Hassan College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China
Yan Kang School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University, Shenzhen 518055, China; College of Health Science and Environmental Engineering, Shenzhen Technology University, Shenzhen 518118, China; School of Applied Technology, Shenzhen University, Shenzhen 518055, China; Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Medical School, Shenzhen University, Shenzhen 518060, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang 110169, China.

Collapse

Solomon TPJ, Laye MJ. The sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots: An assessment of accuracy, completeness, clarity, quality of evidence, and test-retest reliability. PLoS One 2025;20:e0325982. [PMID: 40512755 PMCID: PMC12165421 DOI: 10.1371/journal.pone.0325982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2025] [Accepted: 05/21/2025] [Indexed: 06/16/2025] Open

Abstract

BACKGROUND

Generative artificial intelligence (AI) chatbots are increasingly utilised in various domains, including sports nutrition. Despite their growing popularity, there is limited evidence on the accuracy, completeness, clarity, evidence quality, and test-retest reliability of AI-generated sports nutrition advice. This study evaluates the performance of ChatGPT, Gemini, and Claude's basic and advanced models across these metrics to determine their utility in providing sports nutrition information.

MATERIALS AND METHODS

Two experiments were conducted. In Experiment 1, chatbots were tested with simple and detailed prompts in two domains: Sports nutrition for training and Sports nutrition for racing. Intraclass correlation coefficient (ICC) was used to assess interrater agreement and chatbot performance was assessed by measuring accuracy, completeness, clarity, evidence quality, and test-retest reliability. In Experiment 2, chatbot performance was evaluated by measuring the accuracy and test-retest reliability of chatbots' answers to multiple-choice questions based on a sports nutrition certification exam. ANOVAs and logistic mixed models were used to analyse chatbot performance.

RESULTS

In Experiment 1, interrater agreement was good (ICC = 0.893) and accuracy varied from 74% (Gemini1.5pro) to 31% (ClaudePro). Detailed prompts improved Claude's accuracy but had little impact on ChatGPT or Gemini. Completeness scores were highest for ChatGPT-4o compared to other chatbots, which scored low to moderate. The quality of cited evidence was low for all chatbots when simple prompts were used but improved with detailed prompts. In Experiment 2, accuracy ranged from 89% (Claude3.5Sonnet) to 61% (ClaudePro). Test-retest reliability was acceptable across all metrics in both experiments.

CONCLUSIONS

While generative AI chatbots demonstrate potential in providing sports nutrition guidance, their accuracy is moderate at best and inconsistent between models. Until significant advancements are made, athletes and coaches should consult registered dietitians for tailored nutrition advice.

Collapse

Forero DA, Abreu SE, Tovar BE, Oermann MH. Large Language Models and the Analyses of Adherence to Reporting Guidelines in Systematic Reviews and Overviews of Reviews (PRISMA 2020 and PRIOR). J Med Syst 2025;49:80. [PMID: 40504403 PMCID: PMC12162794 DOI: 10.1007/s10916-025-02212-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2025] [Accepted: 05/29/2025] [Indexed: 06/16/2025]

Linardon J, Messer M, Anderson C, Liu C, McClure Z, Jarman HK, Goldberg SB, Torous J. Role of large language models in mental health research: an international survey of researchers' practices and perspectives. BMJ MENTAL HEALTH 2025;28:e301787. [PMID: 40514050 PMCID: PMC12164621 DOI: 10.1136/bmjment-2025-301787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 05/09/2025] [Accepted: 06/03/2025] [Indexed: 06/16/2025]

Su H, Sun Y, Li R, Zhang A, Yang Y, Xiao F, Duan Z, Chen J, Hu Q, Yang T, Xu B, Zhang Q, Zhao J, Li Y, Li H. Large Language Models in Medical Diagnostics: Scoping Review With Bibliometric Analysis. J Med Internet Res 2025;27:e72062. [PMID: 40489764 DOI: 10.2196/72062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2025] [Revised: 03/24/2025] [Accepted: 04/21/2025] [Indexed: 06/11/2025] Open

Abstract

BACKGROUND

The integration of large language models (LLMs) into medical diagnostics has garnered substantial attention due to their potential to enhance diagnostic accuracy, streamline clinical workflows, and address health care disparities. However, the rapid evolution of LLM research necessitates a comprehensive synthesis of their applications, challenges, and future directions.

OBJECTIVE

This scoping review aimed to provide an overview of the current state of research regarding the use of LLMs in medical diagnostics. The study sought to answer four primary subquestions, as follows: (1) Which LLMs are commonly used? (2) How are LLMs assessed in diagnosis? (3) What is the current performance of LLMs in diagnosing diseases? (4) Which medical domains are investigating the application of LLMs?

METHODS

This scoping review was conducted according to the Joanna Briggs Institute Manual for Evidence Synthesis and adheres to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews). Relevant literature was searched from the Web of Science, PubMed, Embase, IEEE Xplore, and ACM Digital Library databases from 2022 to 2025. Articles were screened and selected based on predefined inclusion and exclusion criteria. Bibliometric analysis was performed using VOSviewer to identify major research clusters and trends. Data extraction included details on LLM types, application domains, and performance metrics.

RESULTS

The field is rapidly expanding, with a surge in publications after 2023. GPT-4 and its variants dominated research (70/95, 74% of studies), followed by GPT-3.5 (34/95, 36%). Key applications included disease classification (text or image-based), medical question answering, and diagnostic content generation. LLMs demonstrated high accuracy in specialties like radiology, psychiatry, and neurology but exhibited biases in race, gender, and cost predictions. Ethical concerns, including privacy risks and model hallucination, alongside regulatory fragmentation, were critical barriers to clinical adoption.

CONCLUSIONS

LLMs hold transformative potential for medical diagnostics but require rigorous validation, bias mitigation, and multimodal integration to address real-world complexities. Future research should prioritize explainable artificial intelligence frameworks, specialty-specific optimization, and international regulatory harmonization to ensure equitable and safe clinical deployment.

Collapse

Affiliation(s)

Hankun Su Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China Xiangya School of Medicine, Central South University, Changsha, China
Yuanyuan Sun Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
Ruiting Li School of Biomedical Sciences and Engineering, South China University of Technology, Guangzhou, China
Aozhe Zhang Xiangya School of Medicine, Central South University, Changsha, China
Yuemeng Yang Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China Xiangya School of Medicine, Central South University, Changsha, China
Fen Xiao Department of Metabolism and Endocrinology, Second Xiangya Hospital of Central South University, Changsha, China
Zhiying Duan Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
Jingjing Chen Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
Qin Hu Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
Tianli Yang Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
Bin Xu Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
Qiong Zhang Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
Jing Zhao Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
Yanping Li Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
Hui Li Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China

Collapse

Stephan D, Bertsch AS, Schumacher S, Puladi B, Burwinkel M, Al-Nawas B, Kämmerer PW, Thiem DG. Improving Patient Communication by Simplifying AI-Generated Dental Radiology Reports With ChatGPT: Comparative Study. J Med Internet Res 2025;27:e73337. [PMID: 40489773 DOI: 10.2196/73337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2025] [Revised: 04/15/2025] [Accepted: 04/28/2025] [Indexed: 06/11/2025] Open

Abstract

BACKGROUND

Medical reports, particularly radiology findings, are often written for professional communication, making them difficult for patients to understand. This communication barrier can reduce patient engagement and lead to misinterpretation. Artificial intelligence (AI), especially large language models such as ChatGPT, offers new opportunities for simplifying medical documentation to improve patient comprehension.

OBJECTIVE

We aimed to evaluate whether AI-generated radiology reports simplified by ChatGPT improve patient understanding, readability, and communication quality compared to original AI-generated reports.

METHODS

In total, 3 versions of radiology reports were created using ChatGPT: an original AI-generated version (text 1), a patient-friendly, simplified version (text 2), and a further simplified and accessibility-optimized version (text 3). A total of 300 patients (n=100, 33.3% per group), excluding patients with medical education, were randomly assigned to review one text version and complete a standardized questionnaire. Readability was assessed using the Flesch Reading Ease (FRE) score and LIX indices.

RESULTS

Both simplified texts showed significantly higher readability scores (text 1: FRE score=51.1; text 2: FRE score=55.0; and text 3: FRE score=56.4; P<.001) and lower LIX scores, indicating enhanced clarity. Text 3 had the shortest sentences, had the fewest long words, and scored best on all patient-rated dimensions. Questionnaire results revealed significantly higher ratings for texts 2 and 3 across clarity (P<.001), tone (P<.001), structure, and patient engagement. For example, patients rated the ability to understand findings without help highest for text 3 (mean 1.5, SD 0.7) and lowest for text 1 (mean 3.1, SD 1.4). Both simplified texts significantly improved patients' ability to prepare for clinical conversations and promoted shared decision-making.

CONCLUSIONS

AI-generated simplification of radiology reports significantly enhances patient comprehension and engagement. These findings highlight the potential of ChatGPT as a tool to improve patient-centered communication. While promising, future research should focus on ensuring clinical accuracy and exploring applications across diverse patient populations to support equitable and effective integration of AI in health care communication.

Collapse

Hijazi W, Builoff V, Killekar A, Shanbhag A, Miller RJ, Dey D, Liang JX, Flood K, Berman D, Bourque JM, Phillips LM, Chareonthaitawee P, Slomka PJ. Priming with specific context improves large language model performance on nuclear cardiology board preparation test. J Nucl Cardiol 2025:102269. [PMID: 40490095 DOI: 10.1016/j.nuclcard.2025.102269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2025] [Revised: 05/06/2025] [Accepted: 06/02/2025] [Indexed: 06/11/2025]

Affiliation(s)

Waseem Hijazi Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Valerie Builoff Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Aditya Killekar Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Aakash Shanbhag Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA; Signal and Image Processing Institute, Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA, USA
Robert Jh Miller Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA; Department of Cardiac Sciences, University of Calgary, Calgary AB, Canada
Damini Dey Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Joanna X Liang Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Kathleen Flood American Society of Nuclear Cardiology, Fairfax, Virginia, USA
Daniel Berman Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Jamieson M Bourque Division of Cardiovascular Medicine and Radiology, University of Virginia Health System, Charlottesville, VA, USA
Lawrence M Phillips Leon H. Charney Division of Cardiology, Department of Medicine, NYU Grossman School of Medicine, New York, NY, USA
Panithaya Chareonthaitawee Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
Piotr J Slomka Departments of Medicine (Division of Artificial Intelligence in Medicine), Imaging and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA.

Collapse

Wu X, Huang Y, He Q. A large language model improves clinicians' diagnostic performance in complex critical illness cases. Crit Care 2025;29:230. [PMID: 40481529 PMCID: PMC12143052 DOI: 10.1186/s13054-025-05468-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2025] [Accepted: 05/24/2025] [Indexed: 06/11/2025] Open

Abstract

BACKGROUND

Large language models (LLMs) have demonstrated potential in assisting clinical decision-making. However, studies evaluating LLMs' diagnostic performance on complex critical illness cases are lacking. We aimed to assess the diagnostic accuracy and response quality of an artificial intelligence (AI) model, and evaluate its potential benefits in assisting critical care residents with differential diagnosis of complex cases.

METHODS

This prospective comparative study collected challenging critical illness cases from the literature. Critical care residents from tertiary teaching hospitals were recruited and randomly assigned to non-AI-assisted physician and AI-assisted physician groups. We selected a reasoning model, DeepSeek-R1, for our study. We evaluated the model's response quality using Likert scales, and we compared the diagnostic accuracy and efficiency between groups.

RESULTS

A total of 48 cases were included. Thirty-two critical care residents were recruited, with 16 residents assigned to each group. Each resident handled an average of 3 cases. DeepSeek-R1's responses received median Likert grades of 4.0 (IQR 4.0-5.0; 95% CI 4.0-4.5) for completeness, 5.0 (IQR 4.0-5.0; 95% CI 4.5-5.0) for clarity, and 5.0 (IQR 4.0-5.0; 95% CI 4.0-5.0) for usefulness. The AI model's top diagnosis accuracy was 60% (29/48; 95% CI 0.456-0.729), with a median differential diagnosis quality score of 5.0 (IQR 4.0-5.0; 95% CI 4.5-5.0). Top diagnosis accuracy was 27% (13/48; 95% CI 0.146-0.396) in the non-AI-assisted physician group versus 58% (28/48; 95% CI 0.438-0.729) in the AI-assisted physician group. Median differential quality scores were 3.0 (IQR 0-5.0; 95% CI 2.0-4.0) without and 5.0 (IQR 3.0-5.0; 95% CI 3.0-5.0) with AI assistance. The AI model showed higher diagnostic accuracy than residents, and AI assistance significantly improved residents' accuracy. The residents' diagnostic time significantly decreased with AI assistance (median, 972 s; IQR 570-1320; 95% CI 675-1200) versus without (median, 1920 s; IQR 1320-2640; 95% CI 1710-2370).

CONCLUSIONS

For diagnostically difficult critical illness cases, DeepSeek-R1 generates high-quality information, achieves reasonable diagnostic accuracy, and significantly improves residents' diagnostic accuracy and efficiency. Reasoning models are suggested to be promising diagnostic adjuncts in intensive care units.

Collapse

Su Y, Babore YB, Kahn CE. A Large Language Model to Detect Negated Expressions in Radiology Reports. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2025;38:1297-1303. [PMID: 39322813 DOI: 10.1007/s10278-024-01274-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 08/28/2024] [Accepted: 09/12/2024] [Indexed: 09/27/2024]

Abstract

Natural language processing (NLP) is crucial to extract information accurately from unstructured text to provide insights for clinical decision-making, quality improvement, and medical research. This study compared the performance of a rule-based NLP system and a medical-domain transformer-based model to detect negated concepts in radiology reports. Using a corpus of 984 de-identified radiology reports from a large U.S.-based academic health system (1000 consecutive reports, excluding 16 duplicates), the investigators compared the rule-based medspaCy system and the Clinical Assertion and Negation Classification Bidirectional Encoder Representations from Transformers (CAN-BERT) system to detect negated expressions of terms from RadLex, the Unified Medical Language System Metathesaurus, and the Radiology Gamuts Ontology. Power analysis determined a sample size of 382 terms to achieve α = 0.05 and β = 0.8 for McNemar's test; based on an estimate of 15% negated terms, 2800 randomly selected terms were annotated manually as negated or not negated. Precision, recall, and F1 of the two models were compared using McNemar's test. Of the 2800 terms, 387 (13.8%) were negated. For negation detection, medspaCy attained a recall of 0.795, precision of 0.356, and F1 of 0.492. CAN-BERT achieved a recall of 0.785, precision of 0.768, and F1 of 0.777. Although recall was not significantly different, CAN-BERT had significantly better precision (χ2 = 304.64; p < 0.001). The transformer-based CAN-BERT model detected negated terms in radiology reports with high precision and recall; its precision significantly exceeded that of the rule-based medspaCy system. Use of this system will improve data extraction from textual reports to support information retrieval, AI model training, and discovery of causal relationships.

Collapse

Park SH, Dean G, Ortiz EM, Choi JI. Overview of South Korean Guidelines for Approval of Large Language or Multimodal Models as Medical Devices: Key Features and Areas for Improvement. Korean J Radiol 2025;26:519-523. [PMID: 40288893 PMCID: PMC12123075 DOI: 10.3348/kjr.2025.0257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2025] [Accepted: 03/10/2025] [Indexed: 04/29/2025] Open

Boyle A, Huo B, Sylla P, Calabrese E, Kumar S, Slater BJ, Walsh DS, Vosburg RW. Large language model-generated clinical practice guideline for appendicitis. Surg Endosc 2025;39:3539-3551. [PMID: 40251310 DOI: 10.1007/s00464-025-11723-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2024] [Accepted: 04/06/2025] [Indexed: 04/20/2025]

Abstract

BACKGROUND

Clinical practice guidelines provide important evidence-based recommendations to optimize patient care, but their development is labor-intensive and time-consuming. Large language models have shown promise in supporting academic writing and the development of systematic reviews, but their ability to assist with guideline development has not been explored. In this study, we tested the capacity of LLMs to support each stage of guideline development, using the latest SAGES guideline on the surgical management of appendicitis as a comparison.

METHODS

Prompts were engineered to trigger LLMs to perform each task of guideline development, using key questions and PICOs derived from the SAGES guideline. ChatGPT-4, Google Gemini, Consensus, and Perplexity were queried on February 21, 2024. LLM performance was evaluated qualitatively, with narrative descriptions of each task's output. The Appraisal of Guidelines for Research and Evaluation in Surgery (AGREE-S) instrument was used to quantitatively assess the quality of the LLM-derived guideline compared to the existing SAGES guideline.

RESULTS

Popular LLMs were able to generate a search syntax, perform data analysis, and follow the GRADE approach and Evidence-to-Decision framework to produce guideline recommendations. These LLMs were unable to independently perform a systematic literature search or reliably perform screening, data extraction, or risk of bias assessment at the time of testing. AGREE-S appraisal produced a total score of 119 for the LLM-derived guideline and 156 for the SAGES guideline. In 19 of the 24 domains, the two guidelines scored within two points of each other.

CONCLUSIONS

LLMs demonstrate potential to assist with certain steps of guideline development, which may reduce time and resource burden associated with these tasks. As new models are developed, the role for LLMs in guideline development will continue to evolve. Ongoing research and multidisciplinary collaboration are needed to support the safe and effective integration of LLMs in each step of guideline development.

Collapse

Turner KM, Ahmad SA. Large language models as clinical decision support tools: A call for careful implementation in the care of patients with pancreatic cancer. Surgery 2025;182:109378. [PMID: 40287319 DOI: 10.1016/j.surg.2025.109378] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2025] [Accepted: 03/31/2025] [Indexed: 04/29/2025]

Hoch CC, Funk PF, Guntinas-Lichius O, Volk GF, Lüers JC, Hussain T, Wirth M, Schmidl B, Wollenberg B, Alfertshofer M. Harnessing advanced large language models in otolaryngology board examinations: an investigation using python and application programming interfaces. Eur Arch Otorhinolaryngol 2025;282:3317-3328. [PMID: 40281318 PMCID: PMC12122622 DOI: 10.1007/s00405-025-09404-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2024] [Accepted: 04/07/2025] [Indexed: 04/29/2025]

Dorfner FJ, Dada A, Busch F, Makowski MR, Han T, Truhn D, Kleesiek J, Sushil M, Adams LC, Bressem KK. Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks. J Am Med Inform Assoc 2025;32:1015-1024. [PMID: 40190132 DOI: 10.1093/jamia/ocaf045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Accepted: 03/02/2025] [Indexed: 05/21/2025] Open

Abstract

OBJECTIVES

Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study aims to critically evaluate the performance of biomedically fine-tuned LLMs against their general-purpose counterparts across a range of clinical tasks.

MATERIALS AND METHODS

We evaluated the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on clinical case challenges from NEJM and JAMA, and on multiple clinical tasks, such as information extraction, document summarization and clinical coding. We used a diverse set of benchmarks specifically chosen to be outside the likely fine-tuning datasets of biomedical models, ensuring a fair assessment of generalization capabilities.

RESULTS

Biomedical LLMs generally underperformed compared to general-purpose models, especially on tasks not focused on probing medical knowledge. While on the case challenges, larger biomedical and general-purpose models showed similar performance (eg, OpenBioLLM-70B: 66.4% vs Llama-3-70B-Instruct: 65% on JAMA), smaller biomedical models showed more pronounced underperformance (OpenBioLLM-8B: 30% vs Llama-3-8B-Instruct: 64.3% on NEJM). Similar trends appeared across CLUE benchmarks, with general-purpose models often achieving higher scores in text generation, question answering, and coding. Notably, biomedical LLMs also showed a higher tendency to hallucinate.

DISCUSSION

Our findings challenge the assumption that biomedical fine-tuning inherently improves LLM performance, as general-purpose models consistently performed better on unseen medical tasks. Retrieval-augmented generation may offer a more effective strategy for clinical adaptation.

CONCLUSION

Fine-tuning LLMs on biomedical data may not yield the anticipated benefits. Alternative approaches, such as retrieval augmentation, should be further explored for effective and reliable clinical integration of LLMs.

Collapse

Wang Z, Sun J, Liu H, Luo X, Li J, He W, Yang Z, Lv H, Chen Y, Wang Z. Development and Performance of a Large Language Model for the Quality Evaluation of Multi-Language Medical Imaging Guidelines and Consensus. J Evid Based Med 2025;18:e70020. [PMID: 40181523 DOI: 10.1111/jebm.70020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/01/2025] [Revised: 03/11/2025] [Accepted: 03/16/2025] [Indexed: 04/05/2025]

Guan H, Novoa-Laurentiev J, Zhou L. CD-Tron: Leveraging large clinical language model for early detection of cognitive decline from electronic health records. J Biomed Inform 2025;166:104830. [PMID: 40320101 DOI: 10.1016/j.jbi.2025.104830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Revised: 03/28/2025] [Accepted: 04/13/2025] [Indexed: 05/08/2025]

Abstract

BACKGROUND

Early detection of cognitive decline during the preclinical stage of Alzheimer's disease and related dementias (AD/ADRD) is crucial for timely intervention and treatment. Clinical notes in the electronic health record contain valuable information that can aid in the early identification of cognitive decline. In this study, we utilize advanced large clinical language models, fine-tuned on clinical notes, to improve the early detection of cognitive decline.

METHODS

We collected clinical notes from 2,166 patients spanning the 4 years preceding their initial mild cognitive impairment (MCI) diagnosis from the Enterprise Data Warehouse of Mass General Brigham. To train the model, we developed CD-Tron, built upon a large clinical language model that was finetuned using 4,949 expert-labeled note sections. For evaluation, the trained model was applied to 1,996 independent note sections to assess its performance on real-world unstructured clinical data. Additionally, we used explainable AI techniques, specifically SHAP values (SHapley Additive exPlanations), to interpret the model's predictions and provide insight into the most influential features. Error analysis was also facilitated to further analyze the model's prediction.

RESULTS

CD-Tron significantly outperforms baseline models, achieving notable improvements in precision, recall, and AUC metrics for detecting cognitive decline (CD). Tested on many real-world clinical notes, CD-Tron demonstrated high sensitivity with only one false negative, crucial for clinical applications prioritizing early and accurate CD detection. SHAP-based interpretability analysis highlighted key textual features contributing to model predictions, supporting transparency and clinician understanding.

CONCLUSION

CD-Tron offers a novel approach to early cognitive decline detection by applying large clinical language models to free-text EHR data. Pretrained on real-world clinical notes, it accurately identifies early cognitive decline and integrates SHAP for interpretability, enhancing transparency in predictions.

Collapse

Satheakeerthy S, Jesudason D, Pietris J, Bacchi S, Chan WO. LLM-assisted medical documentation: efficacy, errors, and ethical considerations in ophthalmology. Eye (Lond) 2025;39:1440-1442. [PMID: 40148503 PMCID: PMC12089378 DOI: 10.1038/s41433-025-03767-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2025] [Revised: 03/05/2025] [Accepted: 03/19/2025] [Indexed: 03/29/2025] Open

Wu Y, Zhang Y, Wu Y, Zheng Q, Li X, Chen X. ChatIOS: Improving automatic 3-dimensional tooth segmentation via GPT-4V and multimodal pre-training. J Dent 2025;157:105755. [PMID: 40228651 DOI: 10.1016/j.jdent.2025.105755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 03/26/2025] [Accepted: 04/10/2025] [Indexed: 04/16/2025] Open

Abstract

OBJECTIVES

This study aims to propose a framework that integrates GPT-4V, a recent advanced version of ChatGPT, and multimodal pre-training techniques to enhance deep learning algorithms for 3-dimensional (3D) tooth segmentation in scans produced by intraoral scanners (IOSs).

METHODS

The framework was developed on 1800 intraoral scans of approximately 24,000 annotated teeth (training set: 1200 scans, 16,004 teeth; testing set: 600 scans, 7995 teeth), from the Teeth3DS dataset, which was gathered from 900 patients with both maxillary and mandible regions. The first step of the proposed framework, ChatIOS, is to pre-process the 3D IOS data to extract 3D point clouds. Then, GPT-4V generates detailed descriptions of 2-dimensional (2D) IOS images taken from different view angles. In the multimodal pre-training, triplets, which comprise point clouds, 2D images, and text descriptions, serve as inputs. A series of ablation studies were systematically conducted to illustrate the superior design of the automatic 3D tooth segmentation system. Our quantitative evaluation criteria included segmentation quality, processing speed, and clinical applicability.

RESULTS

When tested on 600 scans, ChatIOS substantially outperformed the existing benchmarks such as PointNet++ across all metrics, including mean intersection-over-union (mIoU, from 90.3 % to 93.0 % for maxillary and from 89.2 % to 92.3 % for mandible scans), segmentation accuracy (from 97.0 % to 98.0 % for maxillary and from 96.8 % to 97.9 % for mandible scans) and dice similarity coefficient (DSC, from 98.1 % to 98.7 % for maxillary and from 97.9 % to 98.6 % for mandible scans). Our model took only approximately 2s to generate segmentation outputs per scan and exhibited acceptable consistency with clinical expert evaluations.

CONCLUSIONS

Our ChatIOS framework can increase the effectiveness and efficiency of 3D tooth segmentation that clinical procedures require, including orthodontic and prosthetic treatments. This study presents an early exploration of the applications of GPT-4V in digital dentistry and also pioneers the multimodal pre-training paradigm for 3D tooth segmentation.

CLINICAL SIGNIFICANCE

Accurate segmentation of teeth on 3D intraoral scans is critical for orthodontic and prosthetic treatments. ChatIOS can integrate GPT-4V with pre-trained vision-language models (VLMs) to gain an in-depth understanding of IOS data, which can contribute to more efficient and precise tooth segmentation systems.

Collapse

Armitage RC. How do GPs Want Large Language Models to be Applied in Primary Care, and What Are Their Concerns? A Cross-Sectional Survey. J Eval Clin Pract 2025;31:e70129. [PMID: 40369934 PMCID: PMC12079004 DOI: 10.1111/jep.70129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/14/2024] [Revised: 03/16/2025] [Accepted: 04/14/2025] [Indexed: 05/16/2025]

Abstract

INTRODUCTION

Although the potential utility of large language models (LLMs) in medicine and healthcare is substantial, no assessment has been made to date of how GPs want LLMs to be applied in primary care, or of which issues GPs are most concerned about regarding the implementation of LLMs into their clinical practice. This study's objective was to generate preliminary evidence that answers these questions, which are relevant because GPs themselves will ultimately harness the power of LLMs in primary care.

METHODS

Non-probability sampling was utilised: GPs practicing in the UK and who were members of one of two Facebook groups (one containing a community of UK primary care staff, the other containing a community of GMC-registered doctors in the UK) were invited to complete an online survey, which ran from 06 to 13 November 2024.

RESULTS

The survey received 113 responses, 107 of which were from GPs practicing in the UK. When LLM accuracy and safety were assumed to be guaranteed, broad enthusiasm for LLMs carrying out various nonclinical and clinical tasks in primary care was reported. The single nonclinical task and clinical task that respondents were most supportive of were the LLM listening to the consultation and writing notes in real-time for the GP to review, edit, and save (44.0%), and the LLM identifying outstanding clinical tasks and actioning them (51.0%), respectively. Respondents were concerned with a range of issues regarding LLMs being embedded into clinical systems, with patient safety being the most commonly reported single issue of concern (36.2%).

DISCUSSION

This study has generated preliminary evidence that is of potential utility to those developing LLMs for use in primary care. Further research is required to expand this evidence base to further inform the development of these technologies, and to ensure they are acceptable to the GPs who will use them.

Collapse

Choi H, Lee D, Kang YK, Suh M. Empowering PET imaging reporting with retrieval-augmented large language models and reading reports database: a pilot single center study. Eur J Nucl Med Mol Imaging 2025;52:2452-2462. [PMID: 39843863 PMCID: PMC12119711 DOI: 10.1007/s00259-025-07101-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Accepted: 01/17/2025] [Indexed: 01/24/2025]

Abstract

PURPOSE

The potential of Large Language Models (LLMs) in enhancing a variety of natural language tasks in clinical fields includes medical imaging reporting. This pilot study examines the efficacy of a retrieval-augmented generation (RAG) LLM system considering zero-shot learning capability of LLMs, integrated with a comprehensive database of PET reading reports, in improving reference to prior reports and decision making.

METHODS

We developed a custom LLM framework with retrieval capabilities, leveraging a database of over 10 years of PET imaging reports from a single center. The system uses vector space embedding to facilitate similarity-based retrieval. Queries prompt the system to generate context-based answers and identify similar cases or differential diagnoses. From routine clinical PET readings, experienced nuclear medicine physicians evaluated the performance of system in terms of the relevance of queried similar cases and the appropriateness score of suggested potential diagnoses.

RESULTS

The system efficiently organized embedded vectors from PET reports, showing that imaging reports were accurately clustered within the embedded vector space according to the diagnosis or PET study type. Based on this system, a proof-of-concept chatbot was developed and showed the framework's potential in referencing reports of previous similar cases and identifying exemplary cases for various purposes. From routine clinical PET readings, 84.1% of the cases retrieved relevant similar cases, as agreed upon by all three readers. Using the RAG system, the appropriateness score of the suggested potential diagnoses was significantly better than that of the LLM without RAG. Additionally, it demonstrated the capability to offer differential diagnoses, leveraging the vast database to enhance the completeness and precision of generated reports.

CONCLUSION

The integration of RAG LLM with a large database of PET imaging reports suggests the potential to support clinical practice of nuclear medicine imaging reading by various tasks of AI including finding similar cases and deriving potential diagnoses from them. This study underscores the potential of advanced AI tools in transforming medical imaging reporting practices.

Collapse

Deng L, Wu Y, Ren Y, Lu H. Autonomous Self-Evolving Research on Biomedical Data: The DREAM Paradigm. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2025;12:e2417066. [PMID: 40344513 PMCID: PMC12165099 DOI: 10.1002/advs.202417066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/17/2024] [Revised: 04/12/2025] [Indexed: 05/11/2025]

Lamprecht CB, Lyerly M, Lucke-Wold B. Commentary: CNS-CLIP: Transforming a Neurosurgical Journal Into a Multimodal Medical Model. Neurosurgery 2025;96:e123-e124. [PMID: 39636115 DOI: 10.1227/neu.0000000000003298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2024] [Accepted: 10/17/2024] [Indexed: 12/07/2024] Open

Deng A, Chen W, Dai J, Jiang L, Chen Y, Chen Y, Jiang J, Rao M. Current application of ChatGPT in undergraduate nuclear medicine education: Taking Chongqing Medical University as an example. MEDICAL TEACHER 2025;47:997-1003. [PMID: 39305476 DOI: 10.1080/0142159x.2024.2399673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Accepted: 08/29/2024] [Indexed: 10/03/2024]

Daraqel B, Owayda A, Khan H, Koletsi D, Mheissen S. Artificial Intelligence as A Tool for Data Extraction Is Not Fully Reliable Compared to Manual Data Extraction. J Dent 2025:105846. [PMID: 40449825 DOI: 10.1016/j.jdent.2025.105846] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2025] [Revised: 04/16/2025] [Accepted: 05/23/2025] [Indexed: 06/03/2025] Open

Abstract

INTRODUCTION

Data extraction for systematic reviews is a time-consuming step and prone to errors.

OBJECTIVE

This study aimed to evaluate the agreement between artificial intelligence and human data extraction methods.

METHODS

Studies published in seven orthodontic journals between 2019 to 2024, were retrieved and included. Fifteen data sets from each study were extracted manually and using the Microsoft Bing AI-based tool by two independent reviewers. Files in Portable Document Format were uploaded to the AI-based tool, and specific data were requested through its chat feature. The association between the data extraction methods and study characteristics was examined, and agreement was evaluated using interclass correlation and Kappa statistics.

RESULTS

A total of 300 orthodontic studies were included. Slight differences between human and AI-based data extraction methods for publication years and study designs were observed, though these were not statistically significant. Minor inconsistencies were also found in the extraction of the number of trial arms and the mean age of participants per group, but these were not significant. The AI-based tool was less effective in extracting variables related to the study design (P = 0.017) and the number of centers (P < 0.001). Agreement between human and AI-based extraction methods ranged from slight (0.16) for the type of study design to moderate (0.45) for study design classification, and substantial to perfect (0.65-1.00) for most other variables.

CONCLUSION

AI-based data extraction, while effective for straightforward variables, is not fully reliable for complex data extraction. Human input remains essential for ensuring accuracy and completeness in systematic reviews.

CLINICAL SIGNIFICANCE

AI-based tools can effectively extract straightforward data, potentially reducing the time and effort required for systematic reviews. This can help clinicians and researchers process large volumes of data more efficiently. However, it is important to keep human supervision to maintain the integrity and reliability of clinical evidence.

Collapse

Zhu X. Elevating Clinical Practice in Interventional Radiology With Strategic Prompt Engineering. AJR Am J Roentgenol 2025. [PMID: 40434170 DOI: 10.2214/ajr.25.33266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2025]

Tang T, Li X, Lin Y, Liu C. Comparing digital real-time versus virtual simulation systems in dental education for preclinical tooth preparation of molars for metal-ceramic crowns. BMC Oral Health 2025;25:814. [PMID: 40426144 PMCID: PMC12117957 DOI: 10.1186/s12903-025-06161-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2025] [Accepted: 05/12/2025] [Indexed: 05/29/2025] Open

Kunze KN, Bepple J, Bedi A, Ramkumar PN, Pean CA. Commercial Products Using Generative Artificial Intelligence Include Ambient Scribes, Automated Documentation and Scheduling, Revenue Cycle Management, Patient Engagement and Education, and Prior Authorization Platforms. Arthroscopy 2025:S0749-8063(25)00397-4. [PMID: 40419172 DOI: 10.1016/j.arthro.2025.05.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/06/2025] [Revised: 05/10/2025] [Accepted: 05/10/2025] [Indexed: 05/28/2025]

Yang Q, Zuo H, Su R, Su H, Zeng T, Zhou H, Wang R, Chen J, Lin Y, Chen Z, Tan T. Dual retrieving and ranking medical large language model with retrieval augmented generation. Sci Rep 2025;15:18062. [PMID: 40413225 PMCID: PMC12103550 DOI: 10.1038/s41598-025-00724-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Accepted: 04/30/2025] [Indexed: 05/27/2025] Open

McInnis MG, Coleman B, Hurwitz E, Robinson PN, Williams AE, Haendel MA, McMurry JA. Integrating Knowledge: The Power of Ontologies in Psychiatric Research and Clinical Informatics. Biol Psychiatry 2025:S0006-3223(25)01213-2. [PMID: 40414449 DOI: 10.1016/j.biopsych.2025.05.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 05/07/2025] [Accepted: 05/14/2025] [Indexed: 05/27/2025]

Abstract

Ontologies are structured frameworks for representing knowledge by systematically defining concepts, categories, and their relationships. While widely adopted in biomedicine, ontologies remain largely absent in mental health research and clinical care, where the field continues to rely heavily on existing classification systems (DSM). Although useful for clinical communication and administrative purposes, they lack the semantic structure, computational, and reasoning properties needed to integrate diverse data sources or support artificial intelligence (AI)-enabled analysis. This reliance on classification systems limits efforts to analyze and interpret complex, heterogeneous psychiatric data. In mood disorders, particularly bipolar disorder, the lack of formalized semantic models contributes to diagnostic inconsistencies, fragmented data structures, and barriers to precision medicine. Ontologies, by contrast, provide a standardized, machine-readable foundation for linking multimodal data sources, such as electronic health records (EHRs), genetic and neuroimaging data, and social determinants of health, while enabling secure, de-identified computation. This review surveys the current landscape of mental health ontologies and highlights the Human Phenotype Ontology (HPO) as a promising framework for bridging psychiatric and medical phenotypes. We describe ongoing efforts to enhance HPO through curated psychiatric terms, refined definitions, and structured mappings of observed phenomena. The Global Bipolar Cohort (GBC), an international collaboration, exemplifies this approach through the development of a consensus-driven ontology tailored to bipolar disorder. By supporting semantic interoperability, reproducible research, and individualized care, ontology-based approaches provide essential infrastructure for overcoming the limitations of classification systems and advancing data-driven precision psychiatry.

Collapse

Chen YC, Lee SH, Sheu H, Lin SH, Hu CC, Fu SC, Yang CP, Lin YC. Enhancing responses from large language models with role-playing prompts: a comparative study on answering frequently asked questions about total knee arthroplasty. BMC Med Inform Decis Mak 2025;25:196. [PMID: 40410756 PMCID: PMC12102839 DOI: 10.1186/s12911-025-03024-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2025] [Accepted: 05/09/2025] [Indexed: 05/25/2025] Open

Abstract

BACKGROUND

The application of artificial intelligence (AI) in medical education and patient interaction is rapidly growing. Large language models (LLMs) such as GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus have shown potential in providing relevant medical information. This study aims to evaluate and compare the performance of these LLMs in answering frequently asked questions (FAQs) about Total Knee Arthroplasty (TKA), with a specific focus on the impact of role-playing prompts.

METHODS

Four leading LLMs-GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus-were evaluated using ten standardized patient inquiries related to TKA. Each model produced two distinct responses per question: one generated under zero-shot prompting (question-only), and one under role-playing prompting (instructed to simulate an experienced orthopaedic surgeon). Four orthopaedic surgeons evaluated responses for accuracy and comprehensiveness on a 5-point Likert scale, along with a binary measure for acceptability. Statistical analyses (Wilcoxon rank sum and Chi-squared tests; P < 0.05) were conducted to compare model performance.

RESULTS

ChatGPT-4 with role-playing prompts achieved the highest scores for accuracy (3.73), comprehensiveness (4.05), and acceptability (77.5%), followed closely by ChatGPT-3.5 with role-playing prompts (3.70, 3.85, 72.5%, respectively). Google Gemini and Claude 3 Opus demonstrated lower performance across all metrics. In between-model comparisons based on zero-shot prompting, ChatGPT-4 achieved significantly higher scores of both accuracy and comprehensiveness relative to Google Gemini (P = 0.031 and P = 0.009, respectively) and Claude 3 Opus (P = 0.019 and P = 0.002), and demonstrated higher acceptability than Claude 3 Opus (P = 0.006). Within-model comparisons showed role-playing significantly improved all metrics for ChatGPT-3.5 (P < 0.05) and acceptability for ChatGPT-4 (P = 0.033). No significant prompting effects were observed for Gemini or Claude.

CONCLUSIONS

This study demonstrates that role-playing prompts significantly enhance the performance of LLMs, particularly for ChatGPT-3.5 and ChatGPT-4, in answering FAQs related to TKA. ChatGPT-4, with role-playing prompts, showed superior performance in terms of accuracy, comprehensiveness, and acceptability. Despite occasional inaccuracies, LLMs hold promise for improving patient education and clinical decision-making in orthopaedic practice.

CLINICAL TRIAL NUMBER

Not applicable.

Collapse

Khan M, Ahuja K, Tsirikos AI. AI and machine learning in paediatric spine deformity surgery. Bone Jt Open 2025;6:569-581. [PMID: 40407025 PMCID: PMC12100669 DOI: 10.1302/2633-1462.65.bjo-2024-0089.r1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 05/26/2025] Open

Mao C, Li J, Pang PCI, Zhu Q, Chen R. Identifying Kidney Stone Risk Factors Through Patient Experiences With a Large Language Model: Text Analysis and Empirical Study. J Med Internet Res 2025;27:e66365. [PMID: 40403294 DOI: 10.2196/66365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Revised: 12/16/2024] [Accepted: 04/10/2025] [Indexed: 05/24/2025] Open

Abstract

BACKGROUND

Kidney stones, a prevalent urinary disease, pose significant health risks. Factors like insufficient water intake or a high-protein diet increase an individual's susceptibility to the disease. Social media platforms can be a valuable avenue for users to share their experiences in managing these risk factors. Analyzing such patient-reported information can provide crucial insights into risk factors, potentially leading to improved quality of life for other patients.

OBJECTIVE

This study aims to develop a model KSrisk-GPT, based on a large language model (LLM) to identify potential kidney stone risk factors from web-based user experiences.

METHODS

This study collected data on the topic of kidney stones on Zhihu in the past 5 years and obtained 11,819 user comments. Experts organized the most common risk factors for kidney stones into six categories. Then, we use the least-to-most prompting in the chain-of-thought prompting to enable GPT-4.0 to think like an expert and ask GPT to identify risk factors from the comments. Metrics, including accuracy, precision, recall, and F1-score, were used to evaluate the performance of such a model.

RESULTS

Our proposed method outperforms other models in identifying comments containing risk factors with 95.9% accuracy and F1-score, with a precision of 95.6% and a recall of 96.2%. Out of the 863 comments identified with risk factors, our analysis showed the most mentioned risk factors for kidney stones in Zhihu user discussions, mainly including dietary habits (high protein, high calcium intake), insufficient water intake, genetic factors, and lifestyle. In addition, new potential risk factors were discovered with GPT, such as excessive use of supplements like vitamin C and calcium, laxatives, and hyperparathyroidism.

CONCLUSIONS

Comments from social media users offer a new data source for disease prevention and understanding patient journeys. Our method not only sheds light on using LLMs to efficiently summarize risk factors from social media data but also on LLMs' potential to identify new potential factors from the patient's perspective.

Collapse

Ma J, Yu J, Xie A, Huang T, Liu W, Ma M, Tao Y, Zang F, Zheng Q, Zhu W, Chen Y, Ning M, Zhu Y. Large language model evaluation in autoimmune disease clinical questions comparing ChatGPT 4o, Claude 3.5 Sonnet and Gemini 1.5 pro. Sci Rep 2025;15:17635. [PMID: 40399509 PMCID: PMC12095533 DOI: 10.1038/s41598-025-02601-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2024] [Accepted: 05/14/2025] [Indexed: 05/23/2025] Open

Affiliation(s)

Juntao Ma Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
Jie Yu Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
Anran Xie Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
Taihong Huang Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
Wenjing Liu Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
Mengyin Ma Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
Yue Tao Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
Fuyu Zang Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
Qisi Zheng Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
Wenbo Zhu Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
Yuxin Chen Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China.
Mingzhe Ning Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China. Yizheng Hospital of Nanjing Drum Tower Hospital Group, Yizheng 211900, Jiangsu, China, Yangzhou, China.
Yijia Zhu Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China.

Collapse

Alter IL, Dias C, Briano J, Rameau A. Digital health technologies in swallowing care from screening to rehabilitation: A narrative review. Auris Nasus Larynx 2025;52:319-326. [PMID: 40403345 DOI: 10.1016/j.anl.2025.05.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2025] [Revised: 05/14/2025] [Accepted: 05/16/2025] [Indexed: 05/24/2025]

Abstract

OBJECTIVES

Digital health technologies (DHTs) have rapidly advanced in the past two decades, through developments in mobile and wearable devices and most recently with the explosion of artificial intelligence (AI) capabilities and subsequent extension into the health space. DHT has myriad potential applications to deglutology, many of which have undergone promising investigations and developments in recent years. We present the first literature review on applications of DHT in swallowing health, from screening to therapeutics. Public health interventions for swallowing care are increasingly needed in the setting of aging populations in the West and East Asia, and DHT may offer a scalable and low-cost solution.

METHODS

A narrative review was performed using PubMed and Google Scholar to identify recent research on applications of AI and digital health in swallow practice. Database searches, conducted in September 2024, included terms such as "digital," "AI," "machine learning," "tools" in combination with "deglutition," "Otolaryngology," "Head and Neck," "speech language pathology," "swallow," and "dysphagia." Primary literature pertaining to digital health in deglutology was included for review.

RESULTS

We review the various applications of DHT in swallowing care, including prevention, screening, diagnosis, treatment planning and rehabilitation.

CONCLUSION

DHT may offer innovative and scalable solutions for swallowing care as public health needs grow and in the setting of limited specialized healthcare workforce. These technological advances are also being explored as time and resource saving solutions at many points of care in swallow practice. DHT could bring affordable and accurate information for self-management of dysphagia to broader patient populations that otherwise lack access to expert providers.

Collapse

Bai X, Wang S, Zhao Y, Feng M, Ma W, Liu X. Application of AI Chatbot in Responding to Asynchronous Text-Based Messages From Patients With Cancer: Comparative Study. J Med Internet Res 2025;27:e67462. [PMID: 40397947 DOI: 10.2196/67462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2024] [Revised: 12/22/2024] [Accepted: 04/14/2025] [Indexed: 05/23/2025] Open

Abstract

BACKGROUND

Telemedicine, which incorporates artificial intelligence such as chatbots, offers significant potential for enhancing health care delivery. However, the efficacy of artificial intelligence chatbots compared to human physicians in clinical settings remains underexplored, particularly in complex scenarios involving patients with cancer and asynchronous text-based interactions.

OBJECTIVE

This study aimed to evaluate the performance of the GPT-4 (OpenAI) chatbot in responding to asynchronous text-based medical messages from patients with cancer by comparing its responses with those of physicians across two clinical scenarios: patient education and medical decision-making.

METHODS

We collected 4257 deidentified asynchronous text-based medical consultation records from 17 oncologists across China between January 1, 2020, and March 31, 2024. Each record included patient questions, demographic data, and disease-related details. The records were categorized into two scenarios: patient education (eg, symptom explanations and test interpretations) and medical decision-making (eg, treatment planning). The GPT-4 chatbot was used to simulate physician responses to these records, with each session conducted in a new conversation to avoid cross-session interference. The chatbot responses, along with the original physician responses, were evaluated by a medical review panel (3 oncologists) and a patient panel (20 patients with cancer). The medical panel assessed completeness, accuracy, and safety using a 3-level scale, whereas the patient panel rated completeness, trustworthiness, and empathy on a 5-point ordinal scale. Statistical analyses included chi-square tests for categorical variables and Wilcoxon signed-rank tests for ordinal ratings.

RESULTS

In the patient education scenario (n=2364), the chatbot scored higher than physicians in completeness (n=2301, 97.34% vs n=2213, 93.61% for fully complete responses; P=.002), with no significant differences in accuracy or safety (P>.05). In the medical decision-making scenario (n=1893), the chatbot exhibited lower accuracy (n=1834, 96.88% vs n=1855, 97.99% for fully accurate responses; P<.001) and trustworthiness (n=860, 50.71% vs n=1766, 93.29% rated as "Moderately trustworthy" or higher; P<.001) compared with physicians. Regarding empathy, the medical review panel rated the chatbot as demonstrating higher empathy scores across both scenarios, whereas the patient review panel reached the opposite conclusion, consistently favoring physicians in empathetic communication. Errors in chatbot responses were primarily due to misinterpretations of medical terminology or the lack of updated guidelines, with 3.12% (59/1893) of its responses potentially leading to adverse outcomes, compared with 2.01% (38/1893) for physicians.

CONCLUSIONS

The GPT-4 chatbot performs comparably to physicians in patient education by providing comprehensive and empathetic responses. However, its reliability in medical decision-making remains limited, particularly in complex scenarios requiring nuanced clinical judgment. These findings underscore the chatbot's potential as a supplementary tool in telemedicine while highlighting the need for physician oversight to ensure patient safety and accuracy.

Collapse

Andras D, Ilies RA, Esanu V, Agoston S, Marginean Jumate TF, Dindelegan GC. Artificial Intelligence as a Potential Tool for Predicting Surgical Margin Status in Early Breast Cancer Using Mammographic Specimen Images. Diagnostics (Basel) 2025;15:1276. [PMID: 40428269 PMCID: PMC12109882 DOI: 10.3390/diagnostics15101276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2025] [Revised: 05/10/2025] [Accepted: 05/13/2025] [Indexed: 05/29/2025] Open

Shashikumar SP, Mohammadi S, Krishnamoorthy R, Patel A, Wardi G, Ahn JC, Singh K, Aronoff-Spencer E, Nemati S. Development and prospective implementation of a large language model based system for early sepsis prediction. NPJ Digit Med 2025;8:290. [PMID: 40379845 PMCID: PMC12084535 DOI: 10.1038/s41746-025-01689-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Accepted: 04/27/2025] [Indexed: 05/19/2025] Open

Kanani MM, Monawer A, Brown L, King WE, Miller ZD, Venugopal N, Heagerty PJ, Jarvik JG, Cohen T, Cross NM. High-Performance Prompting for LLM Extraction of Compression Fracture Findings from Radiology Reports. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2025:10.1007/s10278-025-01530-6. [PMID: 40379860 DOI: 10.1007/s10278-025-01530-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/17/2025] [Revised: 04/20/2025] [Accepted: 04/28/2025] [Indexed: 05/19/2025]

Omar M, Agbareia R, Glicksberg BS, Nadkarni GN, Klang E. Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study. JMIR Med Inform 2025;13:e66917. [PMID: 40378406 PMCID: PMC12101789 DOI: 10.2196/66917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 01/31/2025] [Accepted: 01/31/2025] [Indexed: 05/18/2025] Open

Abstract

Background

The capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored.

Objective

This study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs' ability to accurately judge their own responses.

Methods

We used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%-100%). We calculated the correlation between each model's mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t tests.

Results

The correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (r=-0.40; P=.001), indicating that worse-performing models exhibited paradoxically higher confidence. For instance, a top-performing model-GPT-4o-had a mean accuracy of 74% (SD 9.4%), with a mean confidence of 63% (SD 8.3%), whereas a low-performing model-Qwen2-7B-showed a mean accuracy of 46% (SD 10.5%) but a mean confidence of 76% (SD 11.7%). The mean difference in confidence between correct and incorrect responses was low for all models, ranging from 0.6% to 5.4%, with GPT-4o having the highest mean difference (5.4%, SD 2.3%; P=.003).

Conclusions

Better-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.

Collapse

Bednarczyk L, Reichenpfader D, Gaudet-Blavignac C, Ette AK, Zaghir J, Zheng Y, Bensahla A, Bjelogrlic M, Lovis C. Scientific Evidence for Clinical Text Summarization Using Large Language Models: Scoping Review. J Med Internet Res 2025;27:e68998. [PMID: 40371947 PMCID: PMC12123242 DOI: 10.2196/68998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Revised: 02/21/2025] [Accepted: 03/12/2025] [Indexed: 05/16/2025] Open

Abstract

BACKGROUND

Information overload in electronic health records requires effective solutions to alleviate clinicians' administrative tasks. Automatically summarizing clinical text has gained significant attention with the rise of large language models. While individual studies show optimism, a structured overview of the research landscape is lacking.

OBJECTIVE

This study aims to present the current state of the art on clinical text summarization using large language models, evaluate the level of evidence in existing research and assess the applicability of performance findings in clinical settings.

METHODS

This scoping review complied with the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines. Literature published between January 1, 2019, and June 18, 2024, was identified from 5 databases: PubMed, Embase, Web of Science, IEEE Xplore, and ACM Digital Library. Studies were excluded if they did not describe transformer-based models, did not focus on clinical text summarization, did not engage with free-text data, were not original research, were nonretrievable, were not peer-reviewed, or were not in English, French, Spanish, or German. Data related to study context and characteristics, scope of research, and evaluation methodologies were systematically collected and analyzed by 3 authors independently.

RESULTS

A total of 30 original studies were included in the analysis. All used observational retrospective designs, mainly using real patient data (n=28, 93%). The research landscape demonstrated a narrow research focus, often centered on summarizing radiology reports (n=17, 57%), primarily involving data from the intensive care unit (n=15, 50%) of US-based institutions (n=19, 73%), in English (n=26, 87%). This focus aligned with the frequent reliance on the open-source Medical Information Mart for Intensive Care dataset (n=15, 50%). Summarization methodologies predominantly involved abstractive approaches (n=17, 57%) on single-document inputs (n=4, 13%) with unstructured data (n=13, 43%), yet reporting on methodological details remained inconsistent across studies. Model selection involved both open-source models (n=26, 87%) and proprietary models (n=7, 23%). Evaluation frameworks were highly heterogeneous. All studies conducted internal validation, but external validation (n=2, 7%), failure analysis (n=6, 20%), and patient safety risks analysis (n=1, 3%) were infrequent, and none reported bias assessment. Most studies used both automated metrics and human evaluation (n=16, 53%), while 10 (33%) used only automated metrics, and 4 (13%) only human evaluation.

CONCLUSIONS

Key barriers hinder the translation of current research into trustworthy, clinically valid applications. Current research remains exploratory and limited in scope, with many applications yet to be explored. Performance assessments often lack reliability, and clinical impact evaluations are insufficient raising concerns about model utility, safety, fairness, and data privacy. Advancing the field requires more robust evaluation frameworks, a broader research scope, and a stronger focus on real-world applicability.

Collapse

Wang C, Wang F, Li S, Ren QW, Tan X, Fu Y, Liu D, Qian G, Cao Y, Yin R, Li K. Patient Triage and Guidance in Emergency Departments Using Large Language Models: Multimetric Study. J Med Internet Res 2025;27:e71613. [PMID: 40374171 PMCID: PMC12123234 DOI: 10.2196/71613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2025] [Revised: 03/13/2025] [Accepted: 05/01/2025] [Indexed: 05/17/2025] Open

Abstract

BACKGROUND

Emergency departments (EDs) face significant challenges due to overcrowding, prolonged waiting times, and staff shortages, leading to increased strain on health care systems. Efficient triage systems and accurate departmental guidance are critical for alleviating these pressures. Recent advancements in large language models (LLMs), such as ChatGPT, offer potential solutions for improving patient triage and outpatient department selection in emergency settings.

OBJECTIVE

The study aimed to assess the accuracy, consistency, and feasibility of GPT-4-based ChatGPT models (GPT-4o and GPT-4-Turbo) for patient triage using the Modified Early Warning Score (MEWS) and evaluate GPT-4o's ability to provide accurate outpatient department guidance based on simulated patient scenarios.

METHODS

A 2-phase experimental study was conducted. In the first phase, 2 ChatGPT models (GPT-4o and GPT-4-Turbo) were evaluated for MEWS-based patient triage accuracy using 1854 simulated patient scenarios. Accuracy and consistency were assessed before and after prompt engineering. In the second phase, GPT-4o was tested for outpatient department selection accuracy using 264 scenarios sourced from the Chinese Medical Case Repository. Each scenario was independently evaluated by GPT-4o thrice. Data analyses included Wilcoxon tests, Kendall correlation coefficients, and logistic regression analyses.

RESULTS

In the first phase, ChatGPT's triage accuracy, based on MEWS, improved following prompt engineering. Interestingly, GPT-4-Turbo outperformed GPT-4o. GPT-4-Turbo achieved an accuracy of 100% compared to GPT-4o's accuracy of 96.2%, despite GPT-4o initially showing better performance prior to prompt engineering. This finding suggests that GPT-4-Turbo may be more adaptable to prompt optimization. In the second phase, GPT-4o, with superior performance on emotional responsiveness compared to GPT-4-Turbo, demonstrated an overall guidance accuracy of 92.63% (95% CI 90.34%-94.93%), with the highest accuracy in internal medicine (93.51%, 95% CI 90.85%-96.17%) and the lowest in general surgery (91.46%, 95% CI 86.50%-96.43%).

CONCLUSIONS

ChatGPT demonstrated promising capability for supporting patient triage and outpatient guidance in EDs. GPT-4-Turbo showed greater adaptability to prompt engineering, whereas GPT-4o exhibited superior responsiveness and emotional interaction, which are essential for patient-facing tasks. Future studies should explore real-world implementation and address the identified limitations to enhance ChatGPT's clinical integration.

Collapse

Chen R, Zhang S, Zheng Y, Yu Q, Wang C. Enhancing treatment decision-making for low back pain: a novel framework integrating large language models with retrieval-augmented generation technology. Front Med (Lausanne) 2025;12:1599241. [PMID: 40438365 PMCID: PMC12116667 DOI: 10.3389/fmed.2025.1599241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2025] [Accepted: 04/30/2025] [Indexed: 06/01/2025] Open

Abstract

Introduction

Chronic low back pain (CLBP) is a global health problem that seriously affects the quality of life among patients. The etiology of CLBP is complex, with non-specific symptoms and considerable heterogeneity, which poses a great challenge for diagnosis. In addition, the uncertain treatment responses as well as the potential influence of psychological and social factors further increase the difficulty of personalized decision-making in clinical practice.

Methods

This study proposed an innovative support framework on clinical decision, which combined large language models (LLMs) with retrieval-augmented generation (RAG) technology. Moreover, the least-to-most (LtM) prompting technology was introduced, aiming to simulate the decision-making process of senior experts thereby improving personalized treatment for CLBP. Additionally, a special CLBP-related dataset was generated to verify effectiveness of the framework, which compared the proposed model CLBP-GPT with GPT-4.0, ERNIE Bot, and DeepSeek in terms of five key indicators: accuracy, relevance, clarity, benefit, and completeness.

Results

The results showed that the CLBP-GPT model proposed in this study scored significantly better than other comparison models in all five evaluation dimensions. Specifically, the total score of CLBP-GPT was 4.40 (SD = 0.20), substantially higher than GPT-4.0 (4.03, SD = 0.48), ERNIE Bot (3.54, SD = 0.53), and DeepSeek (3.81, SD = 0.47). In terms of accuracy, the average score of CLBP-GPT was 4.38 (SD = 0.19), while the scores of other models were all below 4, indicating that CLBP-GPT could provide more accurate clinical decision-making recommendations. In addition, CLBP-GPT scored as high as 4.42 (SD = 0.19) in the completeness dimension, further demonstrating that the decision content output by the model was more comprehensive and covered more key information related to CLBP.

Discussion

This study not only provides new technical support for clinical decision-making in CLBP, but also introduces a powerful tool for doctors to formulate personalized and efficient treatment strategies. It is expected to improve the diagnosis and treatment of CLBP in the future.

Collapse

Jiao C, Rosas E, Asadigandomani H, Delsoz M, Madadi Y, Raja H, Munir WM, Tamm B, Mehravaran S, Djalilian AR, Yousefi S, Soleimani M. Diagnostic Performance of Publicly Available Large Language Models in Corneal Diseases: A Comparison with Human Specialists. Diagnostics (Basel) 2025;15:1221. [PMID: 40428214 PMCID: PMC12110359 DOI: 10.3390/diagnostics15101221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2025] [Revised: 05/10/2025] [Accepted: 05/10/2025] [Indexed: 05/29/2025] Open

Affiliation(s)

Cheng Jiao Department of Ophthalmology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA; (C.J.); (E.R.)
Erik Rosas Department of Ophthalmology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA; (C.J.); (E.R.)
Hassan Asadigandomani Department of Ophthalmology, University of California San Francisco, San Francisco, CA 94143, USA;
Mohammad Delsoz Department of Ophthalmology, Hamilton Eye Institute, University of Tennessee Health Science Center, Memphis, TN 38103, USA; (M.D.); (Y.M.); (H.R.); (S.Y.)
Yeganeh Madadi Department of Ophthalmology, Hamilton Eye Institute, University of Tennessee Health Science Center, Memphis, TN 38103, USA; (M.D.); (Y.M.); (H.R.); (S.Y.)
Hina Raja Department of Ophthalmology, Hamilton Eye Institute, University of Tennessee Health Science Center, Memphis, TN 38103, USA; (M.D.); (Y.M.); (H.R.); (S.Y.)
Wuqaas M. Munir Department of Ophthalmology and Visual Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA; (W.M.M.); (B.T.)
Brendan Tamm Department of Ophthalmology and Visual Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA; (W.M.M.); (B.T.)
Shiva Mehravaran Department of Biology, School of Computer, Mathematical, and Natural Sciences, Morgan State University, Baltimore, MD 21251, USA;
Ali R. Djalilian Department of Ophthalmology and Visual Sciences, University of Illinois at Chicago, Chicago, IL 60612, USA;
Siamak Yousefi Department of Ophthalmology, Hamilton Eye Institute, University of Tennessee Health Science Center, Memphis, TN 38103, USA; (M.D.); (Y.M.); (H.R.); (S.Y.) Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN 38136, USA
Mohammad Soleimani Department of Ophthalmology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA; (C.J.); (E.R.)

Collapse

Chen D, Chauhan K, Parsa R, Liu ZA, Liu FF, Mak E, Eng L, Hannon BL, Croke J, Hope A, Fallah-Rad N, Wong P, Raman S. Patient perceptions of empathy in physician and artificial intelligence chatbot responses to patient questions about cancer. NPJ Digit Med 2025;8:275. [PMID: 40360673 PMCID: PMC12075825 DOI: 10.1038/s41746-025-01671-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2024] [Accepted: 04/24/2025] [Indexed: 05/15/2025] Open

Affiliation(s)

David Chen Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada
Kabir Chauhan Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada
Rod Parsa Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada
Zhihui Amy Liu Department of Biostatistics, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
Fei-Fei Liu Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
Ernie Mak Department of Supportive Care, University Health Network, Toronto, ON, Canada Department of Family & Community Medicine, University of Toronto, Toronto, ON, Canada
Lawson Eng Division of Medical Oncology and Hematology, Department of Medicine, Princess Margaret Cancer Centre/University Health Network Toronto, Toronto, ON, Canada Division of Medical Oncology, Department of Medicine, University of Toronto, Toronto, ON, Canada
Breffni Louise Hannon Department of Supportive Care, University Health Network, Toronto, ON, Canada Department of Medicine, University of Toronto, Toronto, ON, Canada
Jennifer Croke Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
Andrew Hope Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
Nazanin Fallah-Rad Division of Medical Oncology and Hematology, Department of Medicine, Princess Margaret Cancer Centre/University Health Network Toronto, Toronto, ON, Canada
Phillip Wong Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
Srinivas Raman Princess Margaret Cancer Centre, Radiation Medicine Program, Toronto, ON, Canada. Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada.

Collapse

Shi B, Chen L, Pang S, Wang Y, Wang S, Li F, Zhao W, Guo P, Zhang L, Fan C, Zou Y, Wu X. Large Language Models and Artificial Neural Networks for Assessing 1-Year Mortality in Patients With Myocardial Infarction: Analysis From the Medical Information Mart for Intensive Care IV (MIMIC-IV) Database. J Med Internet Res 2025;27:e67253. [PMID: 40354652 PMCID: PMC12107198 DOI: 10.2196/67253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2024] [Revised: 04/01/2025] [Accepted: 04/17/2025] [Indexed: 05/14/2025] Open

Abstract

BACKGROUND

Accurate mortality risk prediction is crucial for effective cardiovascular risk management. Recent advancements in artificial intelligence (AI) have demonstrated potential in this specific medical field. Qwen-2 and Llama-3 are high-performance, open-source large language models (LLMs) available online. An artificial neural network (ANN) algorithm derived from the SWEDEHEART (Swedish Web System for Enhancement and Development of Evidence-Based Care in Heart Disease Evaluated According to Recommended Therapies) registry, termed SWEDEHEART-AI, can predict patient prognosis following acute myocardial infarction (AMI).

OBJECTIVE

This study aims to evaluate the 3 models mentioned above in predicting 1-year all-cause mortality in critically ill patients with AMI.

METHODS

The Medical Information Mart for Intensive Care IV (MIMIC-IV) database is a publicly available data set in critical care medicine. We included 2758 patients who were first admitted for AMI and discharged alive. SWEDEHEART-AI calculated the mortality rate based on each patient's 21 clinical variables. Qwen-2 and Llama-3 analyzed the content of patients' discharge records and directly provided a 1-decimal value between 0 and 1 to represent 1-year death risk probabilities. The patients' actual mortality was verified using follow-up data. The predictive performance of the 3 models was assessed and compared using the Harrell C-statistic (C-index), the area under the receiver operating characteristic curve (AUROC), calibration plots, Kaplan-Meier curves, and decision curve analysis.

RESULTS

SWEDEHEART-AI demonstrated strong discrimination in predicting 1-year all-cause mortality in patients with AMI, with a higher C-index than Qwen-2 and Llama-3 (C-index 0.72, 95% CI 0.69-0.74 vs C-index 0.65, 0.62-0.67 vs C-index 0.56, 95% CI 0.53-0.58, respectively; all P<.001 for both comparisons). SWEDEHEART-AI also showed high and consistent AUROC in the time-dependent ROC curve. The death rates calculated by SWEDEHEART-AI were positively correlated with actual mortality, and the 3 risk classes derived from this model showed clear differentiation in the Kaplan-Meier curve (P<.001). Calibration plots indicated that SWEDEHEART-AI tended to overestimate mortality risk, with an observed-to-expected ratio of 0.478. Compared with the LLMs, SWEDEHEART-AI demonstrated positive and greater net benefits at risk thresholds below 19%.

CONCLUSIONS

SWEDEHEART-AI, a trained ANN model, demonstrated the best performance, with strong discrimination and clinical utility in predicting 1-year all-cause mortality in patients with AMI from an intensive care cohort. Among the LLMs, Qwen-2 outperformed Llama-3 and showed moderate predictive value. Qwen-2 and SWEDEHEART-AI exhibited comparable classification effectiveness. The future integration of LLMs into clinical decision support systems holds promise for accurate risk stratification in patients with AMI; however, further research is needed to optimize LLM performance and address calibration issues across diverse patient populations.

Collapse

Luo Y, Jiao M, Fotedar N, Ding JE, Karakis I, Rao VR, Asmar M, Xian X, Aboud O, Wen Y, Lin JJ, Hung FM, Sun H, Rosenow F, Liu F. Clinical Value of ChatGPT for Epilepsy Presurgical Decision-Making: Systematic Evaluation of Seizure Semiology Interpretation. J Med Internet Res 2025;27:e69173. [PMID: 40354107 PMCID: PMC12107199 DOI: 10.2196/69173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2024] [Revised: 02/03/2025] [Accepted: 03/10/2025] [Indexed: 05/14/2025] Open

Abstract

BACKGROUND

For patients with drug-resistant focal epilepsy, surgical resection of the epileptogenic zone (EZ) is an effective treatment to control seizures. Accurate localization of the EZ is crucial and is typically achieved through comprehensive presurgical approaches such as seizure semiology interpretation, electroencephalography (EEG), magnetic resonance imaging (MRI), and intracranial EEG (iEEG). However, interpreting seizure semiology is challenging because it heavily relies on expert knowledge. The semiologies are often inconsistent and incoherent, leading to variability and potential limitations in presurgical evaluation. To overcome these challenges, advanced technologies like large language models (LLMs)-with ChatGPT being a notable example-offer valuable tools for analyzing complex textual information, making them well-suited to interpret detailed seizure semiology descriptions and accurately localize the EZ.

OBJECTIVE

This study evaluates the clinical value of ChatGPT for interpreting seizure semiology to localize EZs in presurgical assessments for patients with focal epilepsy and compares its performance with that of epileptologists.

METHODS

We compiled 2 data cohorts: a publicly sourced cohort of 852 semiology-EZ pairs from 193 peer-reviewed journal publications and a private cohort of 184 semiology-EZ pairs collected from Far Eastern Memorial Hospital (FEMH) in Taiwan. ChatGPT was evaluated to predict the most likely EZ locations using 2 prompt methods: zero-shot prompting (ZSP) and few-shot prompting (FSP). To compare the performance of ChatGPT, 8 epileptologists were recruited to participate in an online survey to interpret 100 randomly selected semiology records. The responses from ChatGPT and epileptologists were compared using 3 metrics: regional sensitivity (RSens), weighted sensitivity (WSens), and net positive inference rate (NPIR).

RESULTS

In the publicly sourced cohort, ChatGPT demonstrated high RSens reliability, achieving 80% to 90% for the frontal and temporal lobes; 20% to 40% for the parietal lobe, occipital lobe, and insular cortex; and only 3% for the cingulate cortex. The WSens, which accounts for biased data distribution, consistently exceeded 67%, while the mean NPIR remained around 0. These evaluation results based on the private FEMH cohort are consistent with those from the publicly sourced cohort. A group t test with 1000 bootstrap samples revealed that ChatGPT-4 significantly outperformed epileptologists in RSens for the most frequently implicated EZs, such as the frontal and temporal lobes (P<.001). Additionally, ChatGPT-4 demonstrated superior overall performance in WSens (P<.001). However, no significant differences were observed between ChatGPT and the epileptologists in NPIR, highlighting comparable performance in this metric.

CONCLUSIONS

ChatGPT demonstrated clinical value as a tool to assist decision-making during epilepsy preoperative workups. With ongoing advancements in LLMs, their reliability and accuracy are anticipated to improve.

Collapse

Affiliation(s)

Yaxi Luo Department of Computer Science, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
Meng Jiao Department of Systems and Enterprises, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
Neel Fotedar School of Medicine, Case Western Reserve University, Cleveland, OH, United States Department of Neurology, University Hospitals Cleveland Medical Center, Cleveland, OH, United States
Jun-En Ding Department of Systems and Enterprises, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
Ioannis Karakis Department of Neurology, School of Medicine, Emory University, Atlanta, GA, United States Department of Neurology, School of Medicine, University of Crete, Heraklion, Greece
Vikram R Rao Department of Neurology and Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA, United States
Melissa Asmar Department of Neurology, University of California, Davis, Davis, CA, United States
Xiaochen Xian H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, United States
Orwa Aboud Department of Neurology and Neurological Surgery, University of California, Davis, Davis, CA, United States
Yuxin Wen Fowler School of Engineering, Chapman University, Orange, CA, United States
Jack J Lin Department of Neurology, University of California, Davis, Davis, CA, United States
Fang-Ming Hung Center of Artificial Intelligence, Far Eastern Memorial Hospital, New Taipei City, Taiwan Surgical Trauma Intensive Care Unit, Far Eastern Memorial Hospital, New Taipei City, Taiwan
Hai Sun Department of Neurosurgery, Rutgers Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, New Brunswick, NJ, United States
Felix Rosenow Department of Neurology, Epilepsy Center Frankfurt Rhine-Main, Goethe University Frankfurt, Frankfurt am Main, Germany
Feng Liu Department of Systems and Enterprises, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States Semcer Center for Healthcare Innovation, Stevens Institute of Technology, Hoboken, NJ, United States

Collapse