1
|
Holt NM, Byrne MF. The Role of Artificial Intelligence and Big Data for Gastrointestinal Disease. Gastrointest Endosc Clin N Am 2025; 35:291-308. [PMID: 40021230 DOI: 10.1016/j.giec.2024.09.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/03/2025]
Abstract
Artificial intelligence (AI) is a rapidly evolving presence in all fields and industries, with the ability to both improve quality and reduce the burden of human effort. Gastroenterology is a field with a focus on diagnostic techniques and procedures, and AI and big data have established and growing roles to play. Alongside these opportunities are challenges, which will evolve in parallel.
Collapse
Affiliation(s)
- Nicholas Mathew Holt
- Gastroenterology and Hepatology Unit, The Canberra Hospital, Yamba Drive, Garran, ACT 2605, Australia.
| | - Michael Francis Byrne
- Division of Gastroenterology, Vancouver General Hospital, University of British Columbia, UBC Division of Gastroenterology, 5153 - 2775 Laurel Street, Vancouver, British Columbia V5Z 1M9, Canada
| |
Collapse
|
2
|
Berry P, Dhanakshirur RR, Khanna S. Utilizing large language models for gastroenterology research: a conceptual framework. Therap Adv Gastroenterol 2025; 18:17562848251328577. [PMID: 40171241 PMCID: PMC11960180 DOI: 10.1177/17562848251328577] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Accepted: 03/04/2025] [Indexed: 04/03/2025] Open
Abstract
Large language models (LLMs) transform healthcare by assisting clinicians with decision-making, research, and patient management. In gastroenterology, LLMs have shown potential in clinical decision support, data extraction, and patient education. However, challenges such as bias, hallucinations, integration with clinical workflows, and regulatory compliance must be addressed for safe and effective implementation. This manuscript presents a structured framework for integrating LLMs into gastroenterology, using Hepatitis C treatment as a real-world application. The framework outlines key steps to ensure accuracy, safety, and clinical relevance while mitigating risks associated with artificial intelligence (AI)-driven healthcare tools. The framework includes defining clinical goals, assembling a multidisciplinary team, data collection and preparation, model selection, fine-tuning, calibration, hallucination mitigation, user interface development, integration with electronic health records, real-world validation, and continuous improvement. Retrieval-augmented generation and fine-tuning approaches are evaluated for optimizing model adaptability. Bias detection, reinforcement learning from human feedback, and structured prompt engineering are incorporated to enhance reliability. Ethical and regulatory considerations, including the Health Insurance Portability and Accountability Act, General Data Protection Regulation, and AI-specific guidelines (DECIDE-AI, SPIRIT-AI, CONSORT-AI), are addressed to ensure responsible AI deployment. LLMs have the potential to enhance decision-making, research efficiency, and patient care in gastroenterology, but responsible deployment requires bias mitigation, transparency, and ongoing validation. Future research should focus on multi-institutional validation and AI-assisted clinical trials to establish LLMs as reliable tools in gastroenterology.
Collapse
Affiliation(s)
- Parul Berry
- Division of Gastroenterology and Hepatology, Mayo Clinic, Rochester, MN, USA
| | | | - Sahil Khanna
- Division of Gastroenterology and Hepatology, Mayo Clinic, 200 1st Street SW, Rochester, MN 55905, USA
| |
Collapse
|
3
|
Fazilat AZ, Brenac C, Kawamoto-Duran D, Berry CE, Alyono J, Chang MT, Liu DT, Patel ZM, Tringali S, Wan DC, Fieux M. Evaluating the quality and readability of ChatGPT-generated patient-facing medical information in rhinology. Eur Arch Otorhinolaryngol 2025; 282:1911-1920. [PMID: 39724239 DOI: 10.1007/s00405-024-09180-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2024] [Accepted: 12/16/2024] [Indexed: 12/28/2024]
Abstract
PURPOSE The artificial intelligence (AI) chatbot ChatGPT has become a major tool for generating responses in healthcare. This study assessed ChatGPT's ability to generate French preoperative patient-facing medical information (PFI) in rhinology at a comparable level to material provided by an academic source, the French Society of Otorhinolaryngology (Société Française d'Otorhinolaryngologie et Chirurgie Cervico-Faciale, SFORL). METHODS ChatGPT and SFORL French preoperative PFI in rhinology were compared by analyzing responses to 16 questions regarding common rhinology procedures: ethmoidectomy, sphenoidotomy, septoplasty, and endonasal dacryocystorhinostomy. Twenty rhinologists assessed the clarity, comprehensiveness, accuracy, and overall quality of the information, while 24 nonmedical individuals analyzed the clarity and overall quality. Six readability formulas were used to compare readability scores. RESULTS Among rhinologists, no significant difference was found between ChatGPT and SFORL regarding clarity (7.61 ± 0.36 vs. 7.53 ± 0.28; p = 0.485), comprehensiveness (7.32 ± 0.77 vs. 7.58 ± 0.50; p = 0.872), and accuracy (inaccuracies: 60% vs. 40%; p = 0.228), respectively. Non-medical individuals scored the clarity of ChatGPT significantly higher than that of the SFORL (8.16 ± 1.16 vs. 6.32 ± 1.33; p < 0.0001). The non-medical individuals chose ChatGPT as the most informative source significantly more often than rhinologists (62.8% vs. 39.7%, p < 0.001). CONCLUSION ChatGPT-generated French preoperative PFI in rhinology was comparable to SFORL-provided PFI regarding clarity, comprehensiveness, accuracy, readability, and overall quality. This study highlights ChatGPT's potential to increase accessibility to high quality PFI and suggests its use by physicians as a complement to academic resources written by learned societies such as the SFORL.
Collapse
Affiliation(s)
- Alexander Z Fazilat
- Hagey Laboratory for Pediatric Regenerative Medicine, Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University School of Medicine, Stanford, CA, USA
| | - Camille Brenac
- Hagey Laboratory for Pediatric Regenerative Medicine, Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University School of Medicine, Stanford, CA, USA
- Service de chirurgie plastique reconstructrice et esthétique, Hospices Civils de Lyon, Hôpital de la Croix Rousse, Lyon, F-69004, France
- Université de Lyon, Université Lyon 1, Lyon, F-69003, France
| | - Danae Kawamoto-Duran
- Hagey Laboratory for Pediatric Regenerative Medicine, Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University School of Medicine, Stanford, CA, USA
| | - Charlotte E Berry
- Hagey Laboratory for Pediatric Regenerative Medicine, Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University School of Medicine, Stanford, CA, USA
| | - Jennifer Alyono
- Department of Otolaryngology, Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, USA
| | - Michael T Chang
- Department of Otolaryngology, Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, USA
| | - David T Liu
- Department of Otolaryngology, Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, USA
| | - Zara M Patel
- Department of Otolaryngology, Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, USA
| | - Stéphane Tringali
- Université de Lyon, Université Lyon 1, Lyon, F-69003, France
- Service d'ORL, d'otoneurochirurgie et de chirurgie cervico-faciale, Hospices civils de Lyon, Hôpital Lyon Sud, Pierre Bénite, F-69310, France
- Laboratoire de Biologie Tissulaire et d'Ingénierie Thérapeutique, UMR 5305, Institut de Biologie et Chimie des Protéines, CNRS/Université Claude Bernard Lyon 1, 7 Passage du Vercors, CEDEX 07, Lyon, 69367, France
| | - Derrick C Wan
- Hagey Laboratory for Pediatric Regenerative Medicine, Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University School of Medicine, Stanford, CA, USA
| | - Maxime Fieux
- Université de Lyon, Université Lyon 1, Lyon, F-69003, France.
- Department of Otolaryngology, Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, USA.
- Service d'ORL, d'otoneurochirurgie et de chirurgie cervico-faciale, Hospices civils de Lyon, Hôpital Lyon Sud, Pierre Bénite, F-69310, France.
- Laboratoire de Biologie Tissulaire et d'Ingénierie Thérapeutique, UMR 5305, Institut de Biologie et Chimie des Protéines, CNRS/Université Claude Bernard Lyon 1, 7 Passage du Vercors, CEDEX 07, Lyon, 69367, France.
| |
Collapse
|
4
|
Niriella MA, Premaratna P, Senanayake M, Kodisinghe S, Dassanayake U, Dassanayake A, Ediriweera DS, de Silva HJ. The reliability of freely accessible, baseline, general-purpose large language model generated patient information for frequently asked questions on liver disease: a preliminary cross-sectional study. Expert Rev Gastroenterol Hepatol 2025; 19:437-442. [PMID: 39985424 DOI: 10.1080/17474124.2025.2471874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Revised: 02/12/2025] [Accepted: 02/21/2025] [Indexed: 02/24/2025]
Abstract
BACKGROUND We assessed the use of large language models (LLMs) like ChatGPT-3.5 and Gemini against human experts as sources of patient information. RESEARCH DESIGN AND METHODS We compared the accuracy, completeness and quality of freely accessible, baseline, general-purpose LLM-generated responses to 20 frequently asked questions (FAQs) on liver disease, with those from two gastroenterologists, using the Kruskal-Wallis test. Three independent gastroenterologists blindly rated each response. RESULTS The expert and AI-generated responses displayed high mean scores across all domains, with no statistical difference between the groups for accuracy [H(2) = 0.421, p = 0.811], completeness [H(2) = 3.146, p = 0.207], or quality [H(2) = 3.350, p = 0.187]. We found no statistical difference between rank totals in accuracy [H(2) = 5.559, p = 0.062], completeness [H(2) = 0.104, p = 0.949], or quality [H(2) = 0.420, p = 0.810] between the three raters (R1, R2, R3). CONCLUSION Our findings outline the potential of freely accessible, baseline, general-purpose LLMs in providing reliable answers to FAQs on liver disease.
Collapse
|
5
|
Wu W, Guo Y, Li Q, Jia C. Exploring the potential of large language models in identifying metabolic dysfunction-associated steatotic liver disease: A comparative study of non-invasive tests and artificial intelligence-generated responses. Liver Int 2025; 45:e16112. [PMID: 39526465 DOI: 10.1111/liv.16112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Revised: 09/05/2024] [Accepted: 09/11/2024] [Indexed: 11/16/2024]
Abstract
BACKGROUND AND AIMS This study sought to assess the capabilities of large language models (LLMs) in identifying clinically significant metabolic dysfunction-associated steatotic liver disease (MASLD). METHODS We included individuals from NHANES 2017-2018. The validity and reliability of MASLD diagnosis by GPT-3.5 and GPT-4 were quantitatively examined and compared with those of the Fatty Liver Index (FLI) and United States FLI (USFLI). A receiver operating characteristic curve was conducted to assess the accuracy of MASLD diagnosis via different scoring systems. Additionally, GPT-4V's potential in clinical diagnosis using ultrasound images from MASLD patients was evaluated to provide assessments of LLM capabilities in both textual and visual data interpretation. RESULTS GPT-4 demonstrated comparable performance in MASLD diagnosis to FLI and USFLI with the AUROC values of .831 (95% CI .796-.867), .817 (95% CI .797-.837) and .827 (95% CI .807-.848), respectively. GPT-4 exhibited a trend of enhanced accuracy, clinical relevance and efficiency compared to GPT-3.5 based on clinician evaluation. Additionally, Pearson's r values between GPT-4 and FLI, as well as USFLI, were .718 and .695, respectively, indicating robust and moderate correlations. Moreover, GPT-4V showed potential in understanding characteristics from hepatic ultrasound imaging but exhibited limited interpretive accuracy in diagnosing MASLD compared to skilled radiologists. CONCLUSIONS GPT-4 achieved performance comparable to traditional risk scores in diagnosing MASLD and exhibited improved convenience, versatility and the capacity to offer user-friendly outputs. The integration of GPT-4V highlights the capacities of LLMs in handling both textual and visual medical data, reinforcing their expansive utility in healthcare practice.
Collapse
Affiliation(s)
- Wanying Wu
- Department of Cardiology, Guangdong Cardiovascular Institute, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
- Department of Guangdong Provincial Key Laboratory of Coronary Heart Disease Prevention, Guangdong Cardiovascular Institute, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
| | - Yuhu Guo
- Faculty of Science and Engineering, The University of Manchester, Manchester, UK
| | - Qi Li
- Department of Neurology, The First Affiliated Hospital of Hebei North University, Zhangjiakou, China
| | - Congzhuo Jia
- Department of Cardiology, Guangdong Cardiovascular Institute, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
- Department of Guangdong Provincial Key Laboratory of Coronary Heart Disease Prevention, Guangdong Cardiovascular Institute, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
| |
Collapse
|
6
|
Chou HH, Chen YH, Lin CT, Chang HT, Wu AC, Tsai JL, Chen HW, Hsu CC, Liu SY, Lee JT. AI-driven patient support: Evaluating the effectiveness of ChatGPT-4 in addressing queries about ovarian cancer compared with healthcare professionals in gynecologic oncology. Support Care Cancer 2025; 33:337. [PMID: 40167802 DOI: 10.1007/s00520-025-09389-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2024] [Accepted: 03/20/2025] [Indexed: 04/02/2025]
Abstract
PURPOSE Artificial intelligence (AI) chatbots, such as ChatGPT-4, allow a user to ask questions on an interactive level. This study evaluated the correctness and completeness of responses to questions about ovarian cancer from a GPT-4 chatbot, LilyBot, compared with responses from healthcare professionals in gynecologic cancer care. METHODS Fifteen categories of questions about ovarian cancer were collected from an online patient Chatgroup forum. Ten healthcare professionals in gynecologic oncology generated 150 questions and responses relative to these topics. Responses from LilyBot and the healthcare professionals were scored for correctness and completeness by eight independent healthcare professionals with similar backgrounds blinded to the identity of the responders. Differences between groups were analyzed with Mann-Whitney U and Kruskal-Wallis tests, followed by Tukey's post hoc comparisons. RESULTS Mean scores for overall performance for all 150 questions were significantly higher for LilyBot compared with the healthcare professionals for correctness (5.31 ± 0.98 vs. 5.07 ± 1.00, p = 0.017; range = 1-6) and completeness (2.66 ± 0.55 vs. 2.36 ± 0.55, p < 0.001; range = 1-3). LilyBot had significantly higher scores for immunotherapy compared with the healthcare professionals for correctness (6.00 ± 0.00 vs. 4.70 ± 0.48, p = 0.020) and completeness (3.00 ± 0.00 vs. 2.00 ± 0.00, p < 0.010); and gene therapy for completeness (3.00 ± 0.00 vs. 2.20 ± 0.42, p = 0.023). CONCLUSIONS The significantly better performance by LilyBot compared with healthcare professionals highlights the potential of ChatGPT-4-based dialogue systems to provide patients with clinical information about ovarian cancer.
Collapse
Affiliation(s)
- Hung-Hsueh Chou
- Department of Obstetrics and Gynecology, Linkou Branch, Chang Gung Memorial Hospital, Tao-Yuan, Taiwan
- School of Medicine, National Tsing Hua University, Hsinchu, Taiwan
| | - Yi Hua Chen
- School of Nursing, College of Medicine, Chang Gung University, Tao-Yuan, Taiwan
| | - Chiu-Tzu Lin
- Nursing Department, Linkou Branch, Chang Gung Memorial Hospital, Tao-Yuan, Taiwan
| | - Hsien-Tsung Chang
- Bachelor Program in Artificial Intelligence, Chang Gung University, Taoyuan 333, Tao-Yuan, Taiwan
- Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan 333, Tao-Yuan, Taiwan
- Department of Physical Medicine and Rehabilitation, Chang Gung Memorial Hospital, Taoyuan 333, Tao-Yuan, Taiwan
| | - An-Chieh Wu
- Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan 333, Tao-Yuan, Taiwan
| | - Jia-Ling Tsai
- School of Nursing, College of Medicine, Chang Gung University, Tao-Yuan, Taiwan
| | - Hsiao-Wei Chen
- School of Nursing, College of Medicine, Chang Gung University, Tao-Yuan, Taiwan
| | - Ching-Chun Hsu
- Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan 333, Tao-Yuan, Taiwan
| | - Shu-Ya Liu
- School of Nursing, College of Medicine, Chang Gung University, Tao-Yuan, Taiwan
| | - Jian Tao Lee
- School of Nursing, College of Medicine, Chang Gung University, Tao-Yuan, Taiwan.
- Nursing Department, Linkou Branch, Chang Gung Memorial Hospital, Tao-Yuan, Taiwan.
| |
Collapse
|
7
|
Yan Z, Liu J, Fan Y, Lu S, Xu D, Yang Y, Wang H, Mao J, Tseng HC, Chang TH, Chen Y. Ability of ChatGPT to Replace Doctors in Patient Education: Cross-Sectional Comparative Analysis of Inflammatory Bowel Disease. J Med Internet Res 2025; 27:e62857. [PMID: 40163853 DOI: 10.2196/62857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2024] [Revised: 11/18/2024] [Accepted: 01/19/2025] [Indexed: 04/02/2025] Open
Abstract
BACKGROUND Although large language models (LLMs) such as ChatGPT show promise for providing specialized information, their quality requires further evaluation. This is especially true considering that these models are trained on internet text and the quality of health-related information available online varies widely. OBJECTIVE The aim of this study was to evaluate the performance of ChatGPT in the context of patient education for individuals with chronic diseases, comparing it with that of industry experts to elucidate its strengths and limitations. METHODS This evaluation was conducted in September 2023 by analyzing the responses of ChatGPT and specialist doctors to questions posed by patients with inflammatory bowel disease (IBD). We compared their performance in terms of subjective accuracy, empathy, completeness, and overall quality, as well as readability to support objective analysis. RESULTS In a series of 1578 binary choice assessments, ChatGPT was preferred in 48.4% (95% CI 45.9%-50.9%) of instances. There were 12 instances where ChatGPT's responses were unanimously preferred by all evaluators, compared with 17 instances for specialist doctors. In terms of overall quality, there was no significant difference between the responses of ChatGPT (3.98, 95% CI 3.93-4.02) and those of specialist doctors (3.95, 95% CI 3.90-4.00; t524=0.95, P=.34), both being considered "good." Although differences in accuracy (t521=0.48, P=.63) and empathy (t511=2.19, P=.03) lacked statistical significance, the completeness of textual output (t509=9.27, P<.001) was a distinct advantage of the LLM (ChatGPT). In the sections of the questionnaire where patients and doctors responded together (Q223-Q242), ChatGPT demonstrated inferior performance (t36=2.91, P=.006). Regarding readability, no statistical difference was found between the responses of specialist doctors (median: 7th grade; Q1: 4th grade; Q3: 8th grade) and those of ChatGPT (median: 7th grade; Q1: 7th grade; Q3: 8th grade) according to the Mann-Whitney U test (P=.09). The overall quality of ChatGPT's output exhibited strong correlations with other subdimensions (with empathy: r=0.842; with accuracy: r=0.839; with completeness: r=0.795), and there was also a high correlation between the subdimensions of accuracy and completeness (r=0.762). CONCLUSIONS ChatGPT demonstrated more stable performance across various dimensions. Its output of health information content is more structurally sound, addressing the issue of variability in the information from individual specialist doctors. ChatGPT's performance highlights its potential as an auxiliary tool for health information, despite limitations such as artificial intelligence hallucinations. It is recommended that patients be involved in the creation and evaluation of health information to enhance the quality and relevance of the information.
Collapse
Affiliation(s)
- Zelin Yan
- Zhejiang Provincial Key Laboratory of Gastrointestinal Diseases Pathophysiology, Department of Gastroenterology, The First Affiliated Hospital of Zhejiang Chinese Medical University, Hangzhou, China
- Center of Inflammatory Bowel Diseases, Department of Gastroenterology, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
- The China Crohn's & Colitis Foundation, Hangzhou, China
| | - Jingwen Liu
- Center of Inflammatory Bowel Diseases, Department of Gastroenterology, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Yihong Fan
- Zhejiang Provincial Key Laboratory of Gastrointestinal Diseases Pathophysiology, Department of Gastroenterology, The First Affiliated Hospital of Zhejiang Chinese Medical University, Hangzhou, China
| | - Shiyuan Lu
- Center of Inflammatory Bowel Diseases, Department of Gastroenterology, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Dingting Xu
- Center of Inflammatory Bowel Diseases, Department of Gastroenterology, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Yun Yang
- The Clinical Medical College, Zhejiang University School of Medicine, Hangzhou, China
| | - Honggang Wang
- Department of Gastroenterology, The Affiliated Huaian No.1 People's Hospital of Nanjing Medical University, Huai'an, China
| | - Jie Mao
- The Second Clinical Medical College, Zhejiang Chinese Medical University, Hangzhou, China
| | - Hou-Chiang Tseng
- Graduate Institute of Digital Learning and Education, National Taiwan University of Science and Technology, Taipei, Taiwan
| | - Tao-Hsing Chang
- Department of Computer Science and Information Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
| | - Yan Chen
- Center of Inflammatory Bowel Diseases, Department of Gastroenterology, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
- The China Crohn's & Colitis Foundation, Hangzhou, China
| |
Collapse
|
8
|
Hashim A, Stefanini B, Piscaglia F. Thinking outside the box: unconventional artificial intelligence algorithms in the detection and management of liver cirrhosis. Expert Rev Gastroenterol Hepatol 2025:1-3. [PMID: 40125919 DOI: 10.1080/17474124.2025.2483995] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Revised: 02/27/2025] [Accepted: 03/20/2025] [Indexed: 03/25/2025]
Affiliation(s)
- Ahmed Hashim
- Cambridge Liver Unit, Addenbrooke's Hospital, Cambridge, UK
| | - Bernardo Stefanini
- Department of Medical and Surgical Sciences, University of Bologna, Bologna, Italy
| | - Fabio Piscaglia
- Department of Medical and Surgical Sciences, University of Bologna, Bologna, Italy
- Division of Internal Medicine, Hepatobiliary and Immunoallergic Diseases, IRCCS Azienda Ospedaliero Universitaria di Bologna, Bologna, Italy
| |
Collapse
|
9
|
Zhou X, Chen Y, Abdulghani EA, Zhang X, Zheng W, Li Y. Performance in answering orthodontic patients' frequently asked questions: Conversational artificial intelligence versus orthodontists. J World Fed Orthod 2025:S2212-4438(25)00012-8. [PMID: 40140287 DOI: 10.1016/j.ejwf.2025.02.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Revised: 02/11/2025] [Accepted: 02/11/2025] [Indexed: 03/28/2025]
Abstract
OBJECTIVES Can conversational artificial intelligence (AI) help alleviate orthodontic patients' general doubts? This study aimed to investigate the performance of conversational AI in answering frequently asked questions (FAQs) from orthodontic patients, with comparison to orthodontists. MATERIALS AND METHODS Thirty FAQs were selected covering the pre-, during-, and postorthodontic treatment stages. Each question was respectively answered by AI (Chat Generative Pretrained Transformer [ChatGPT]-4) and two orthodontists (Ortho. A and Ortho. B), randomly drawn out of a panel. Their responses to the 30 FAQs were ranked by four raters, randomly selected from another panel of orthodontists, resulting in 120 rankings. All the participants were Chinese, and all the questions and answers were conducted in Chinese. RESULTS Among the 120 rankings, ChatGPT was ranked first in 61 instances (50.8%), second in 35 instances (29.2%), and third in 24 instances (20.0%). Furthermore, the mean rank of ChatGPT was 1.69 ± 0.79, significantly better than that of Ortho. A (2.23 ± 0.79, P < 0.001) and Ortho. B (2.08 ± 0.79, P < 0.05). No significant difference was found between the two orthodontist groups. Additionally, the Spearman correlation coefficient between the average ranking of ChatGPT and the inter-rater agreement was 0.69 (P < 0.001), indicating a strong positive correlation between the two variables. CONCLUSIONS Overall, the conversational AI ChatGPT-4 may outperform orthodontists in addressing orthodontic patients' FAQs, even in a non-English language. In addition, ChatGPT tends to perform better when responding to questions with answers widely accepted among orthodontic professionals, and vice versa.
Collapse
Affiliation(s)
- Xinlianyi Zhou
- State Key Laboratory of Oral Diseases, National Center for Stomatology, National Clinical Research Center for Oral Diseases, Department of Orthodontics, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| | - Yao Chen
- State Key Laboratory of Oral Diseases, National Center for Stomatology, National Clinical Research Center for Oral Diseases, Department of Orthodontics, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| | - Ehab A Abdulghani
- State Key Laboratory of Oral Diseases, National Center for Stomatology, National Clinical Research Center for Oral Diseases, Department of Orthodontics, West China Hospital of Stomatology, Sichuan University, Chengdu, China; Department of Orthodontics and Dentofacial Orthopedics, College of Dentistry, Thamar University, Dhamar, Yemen
| | - Xu Zhang
- State Key Laboratory of Oral Diseases, National Center for Stomatology, National Clinical Research Center for Oral Diseases, Department of Orthodontics, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| | - Wei Zheng
- State Key Laboratory of Oral Diseases, National Center for Stomatology, National Clinical Research Center for Oral Diseases, Department of Oral and Maxillofacial Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu, China.
| | - Yu Li
- State Key Laboratory of Oral Diseases, National Center for Stomatology, National Clinical Research Center for Oral Diseases, Department of Orthodontics, West China Hospital of Stomatology, Sichuan University, Chengdu, China.
| |
Collapse
|
10
|
Li KP, Wang L, Wan S, Wang CY, Chen SY, Liu SH, Yang L. Enhanced Artificial Intelligence in Bladder Cancer Management: A Comparative Analysis and Optimization Study of Multiple Large Language Models. J Endourol 2025. [PMID: 40099418 DOI: 10.1089/end.2024.0860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/19/2025] Open
Abstract
Background: With the rapid advancement of artificial intelligence in health care, large language models (LLMs) demonstrate increasing potential in medical applications. However, their performance in specialized oncology remains limited. This study evaluates the performance of multiple leading LLMs in addressing clinical inquiries related to bladder cancer (BLCA) and demonstrates how strategic optimization can overcome these limitations. Methods: We developed a comprehensive set of 100 clinical questions based on established guidelines. These questions encompassed epidemiology, diagnosis, treatment, prognosis, and follow-up aspects of BLCA management. Six LLMs (Claude-3.5-Sonnet, ChatGPT-4.0, Grok-beta, Gemini-1.5-Pro, Mistral-Large-2, and GPT-3.5-Turbo) were tested through three independent trials. The responses were validated against current clinical guidelines and expert consensus. We implemented a two-phase training optimization process specifically for GPT-3.5-Turbo to enhance its performance. Results: In the initial evaluation, Claude-3.5-Sonnet demonstrated the highest accuracy (89.33% ± 1.53%), followed by ChatGPT-4 (85.67% ± 1.15%). Grok-beta achieved 84.33% ± 1.53% accuracy, whereas Gemini-1.5-Pro and Mistral-Large-2 showed similar performance (82.00% ± 1.00% and 81.00% ± 1.00%, respectively). GPT-3.5-Turbo demonstrated the lowest accuracy (74.33% ± 3.06%). After the first phase of training, GPT-3.5-Turbo's accuracy improved to 86.67% ± 1.89%. Following the second phase of optimization, the model achieved 100% accuracy. Conclusion: This study not only establishes the comparative performance of various LLMs in BLCA-related queries but also validates the potential for significant improvement through targeted training optimization. The successful enhancement of GPT-3.5-Turbo's performance suggests that strategic model refinement can overcome initial limitations and achieve optimal accuracy in specialized medical applications.
Collapse
Affiliation(s)
- Kun-Peng Li
- Department of Urology, The Second Hospital of Lanzhou University, Lanzhou, China
- Institute of Urology, Clinical Research Center for Urology in Gansu Province, Lanzhou, China
| | - Li Wang
- Department of Urology, The Second Hospital of Lanzhou University, Lanzhou, China
- Institute of Urology, Clinical Research Center for Urology in Gansu Province, Lanzhou, China
| | - Shun Wan
- Department of Urology, The Second Hospital of Lanzhou University, Lanzhou, China
- Institute of Urology, Clinical Research Center for Urology in Gansu Province, Lanzhou, China
| | - Chen-Yang Wang
- Department of Urology, The Second Hospital of Lanzhou University, Lanzhou, China
- Institute of Urology, Clinical Research Center for Urology in Gansu Province, Lanzhou, China
| | - Si-Yu Chen
- Department of Urology, The Second Hospital of Lanzhou University, Lanzhou, China
- Institute of Urology, Clinical Research Center for Urology in Gansu Province, Lanzhou, China
| | - Shan-Hui Liu
- Department of Urology, The Second Hospital of Lanzhou University, Lanzhou, China
- Institute of Urology, Clinical Research Center for Urology in Gansu Province, Lanzhou, China
| | - Li Yang
- Department of Urology, The Second Hospital of Lanzhou University, Lanzhou, China
- Institute of Urology, Clinical Research Center for Urology in Gansu Province, Lanzhou, China
| |
Collapse
|
11
|
Zhu H, Wang R, Qian J, Wu Y, Jin Z, Shan X, Ji F, Yuan Z, Pan T. Leveraging Large Language Models for Predicting Postoperative Acute Kidney Injury in Elderly Patients. BME FRONTIERS 2025; 6:0111. [PMID: 40071150 PMCID: PMC11896637 DOI: 10.34133/bmef.0111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2024] [Revised: 01/25/2025] [Accepted: 02/16/2025] [Indexed: 03/14/2025] Open
Abstract
Objective: The objective of this work is to develop a framework based on large language models (LLMs) to predict postoperative acute kidney injury (AKI) outcomes in elderly patients. Impact Statement: Our study demonstrates that LLMs have the potential to address the issues of poor generalization and weak interpretability commonly encountered in disease prediction using traditional machine learning (ML) models. Introduction: AKI is a severe postoperative complication, especially in elderly patients with declining renal function. Current AKI prediction models rely on ML, but their lack of interpretability and generalizability limits clinical use. LLMs, with extensive pretraining and text generation capabilities, offer a new solution. Methods: We applied prompt engineering and knowledge distillation based on instruction fine-tuning to optimize LLMs for AKI prediction. The framework was tested on 2,649 samples from 2 private Chinese hospitals and one public South Korean dataset, which were divided into internal and external datasets. Results: The LLM framework showed robust external performance, with accuracy rates: commercial LLMs (internal: 63.73%, external: 68.73%), open-source LLMs (internal: 63.70%, external: 64.24%), and ML models (internal: 63.93%, external: 58.27%). LLMs also provided human-readable explanations for better clinical understanding. Conclusion: The proposed framework showcases the potential of LLMs to enhance generalization and interpretability in postoperative AKI prediction, paving the way for more robust and transparent predictive solutions in clinical settings.
Collapse
Affiliation(s)
- Hanfei Zhu
- Center for Intelligent Medical Equipment and Devices, Institute for innovative Medical Devices, School of Biomedical Engineering, Division of Life Sciences and Medicine,
University of Science and Technology of China, Hefei, Anhui, P. R. China
- Suzhou Institute for Advanced Research,
University of Science and Technology of China, Suzhou, Jiangsu, P. R. China
| | - Ruojiang Wang
- Center for Intelligent Medical Equipment and Devices, Institute for innovative Medical Devices, School of Biomedical Engineering, Division of Life Sciences and Medicine,
University of Science and Technology of China, Hefei, Anhui, P. R. China
- Suzhou Institute for Advanced Research,
University of Science and Technology of China, Suzhou, Jiangsu, P. R. China
| | - Jiajie Qian
- Department of Anesthesiology,
The First Affiliated Hospital of Soochow University, Suzhou, Jiangsu, P. R. China
| | - Yuhao Wu
- Center for Intelligent Medical Equipment and Devices, Institute for innovative Medical Devices, School of Biomedical Engineering, Division of Life Sciences and Medicine,
University of Science and Technology of China, Hefei, Anhui, P. R. China
- Suzhou Institute for Advanced Research,
University of Science and Technology of China, Suzhou, Jiangsu, P. R. China
| | - Zhuqing Jin
- Department of Nephrology,
The Second Affiliated Hospital of Anhui Medical University, Hefei, P. R. China
| | - Xishen Shan
- Department of Anesthesiology,
The First Affiliated Hospital of Soochow University, Suzhou, Jiangsu, P. R. China
| | - Fuhai Ji
- Department of Anesthesiology,
The First Affiliated Hospital of Soochow University, Suzhou, Jiangsu, P. R. China
| | - Zixuan Yuan
- Thrust of Financial Technology,
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, P. R. China
| | - Tingrui Pan
- Center for Intelligent Medical Equipment and Devices, Institute for innovative Medical Devices, School of Biomedical Engineering, Division of Life Sciences and Medicine,
University of Science and Technology of China, Hefei, Anhui, P. R. China
- Suzhou Institute for Advanced Research,
University of Science and Technology of China, Suzhou, Jiangsu, P. R. China
- Department of Precision Machinery and Precision Instrumentation,
School of Engineering Science, University of Science and Technology of China, Hefei, Anhui, P. R. China
| |
Collapse
|
12
|
Altorfer FCS, Kelly MJ, Avrumova F, Rohatgi V, Zhu J, Bono CM, Lebl DR. The double-edged sword of generative AI: surpassing an expert or a deceptive "false friend"? Spine J 2025:S1529-9430(25)00122-6. [PMID: 40049450 DOI: 10.1016/j.spinee.2025.02.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Revised: 01/22/2025] [Accepted: 02/27/2025] [Indexed: 03/24/2025]
Abstract
BACKGROUND CONTEXT Generative artificial intelligence (AI), ChatGPT being the most popular example, has been extensively assessed for its capability to respond to medical questions, such as queries in spine treatment approaches or technological advances. However, it often lacks scientific foundation or fabricates inauthentic references, also known as AI hallucinations. PURPOSE To develop an understanding of the scientific basis of generative AI tools by studying the authenticity of references and reliability in comparison to the alignment of responses of evidence-based guidelines. STUDY DESIGN Comparative study. METHODS Thirty-three previously published North American Spine Society (NASS) guideline questions were posed as prompts to 2 freely available generative AI tools (Tools I and II). The responses were scored for correctness compared with the published NASS guideline responses using a 5-point "alignment score." Furthermore, all cited references were evaluated for authenticity, source type, year of publication, and inclusion in the scientific guidelines. RESULTS Both tools' responses to guideline questions achieved an overall score of 3.5±1.1, which is considered acceptable to be equivalent to the guideline. Both tools generated 254 references to support their responses, of which 76.0% (n=193) were authentic and 24.0% (n=61) were fabricated. From these, authentic references were: peer-reviewed scientific research papers (147, 76.2%), guidelines (16, 8.3%), educational websites (9, 4.7%), books (9, 4.7%), a government website (1, 0.5%), insurance websites (6, 3.1%) and newspaper websites (5, 2.6%). Claude referenced significantly more authentic peer-reviewed scientific papers (Claude: n=111, 91.0%; Gemini: n=36, 50.7%; p<.001). The year of publication amongst all references ranged from 1988-2023, with significantly older references provided by Claude (Claude: 2008±6; Gemini: 2014±6; p<.001). Lastly, significantly more references provided by Claude were also referenced in the published NASS guidelines (Claude: n=27, 24.3%; Gemini: n=1, 2.8%; p=.04). CONCLUSIONS Both generative AI tools provided responses that had acceptable alignment with NASS evidence-based guideline recommendations and offered references, though nearly a quarter of the references were inauthentic or nonscientific sources. This deficiency of legitimate scientific references does not meet standards for clinical implementation. Considering this limitation, caution should be exercised when applying the output of generative AI tools to clinical applications.
Collapse
Affiliation(s)
- Franziska C S Altorfer
- Department of Spine Surgery, Hospital for Special Surgery, New York, NY, USA; Universtity Spine Center Zurich, Balgrist University Hospital, University of Zurich, 8006 Zurich, Switzerland
| | - Michael J Kelly
- Department of Spine Surgery, Hospital for Special Surgery, New York, NY, USA
| | - Fedan Avrumova
- Department of Spine Surgery, Hospital for Special Surgery, New York, NY, USA
| | - Varun Rohatgi
- Department of Surgery Weill Cornell Medicine, NY, USA
| | - Jiaqi Zhu
- Department of Biostatistics, Hospital for Special Surgery, New York, NY, USA
| | - Christopher M Bono
- Department of Orthopedics, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Darren R Lebl
- Department of Spine Surgery, Hospital for Special Surgery, New York, NY, USA.
| |
Collapse
|
13
|
Gao M, Varshney A, Chen S, Goddla V, Gallifant J, Doyle P, Novack C, Dillon-Martin M, Perkins T, Correia X, Duhaime E, Isenstein H, Sharon E, Lehmann LS, Kozono D, Anthony B, Dligach D, Bitterman DS. The use of large language models to enhance cancer clinical trial educational materials. JNCI Cancer Spectr 2025; 9:pkaf021. [PMID: 39921887 DOI: 10.1093/jncics/pkaf021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Revised: 01/21/2025] [Accepted: 01/30/2025] [Indexed: 02/10/2025] Open
Abstract
BACKGROUND Adequate patient awareness and understanding of cancer clinical trials is essential for trial recruitment, informed decision making, and protocol adherence. Although large language models (LLMs) have shown promise for patient education, their role in enhancing patient awareness of clinical trials remains unexplored. This study explored the performance and risks of LLMs in generating trial-specific educational content for potential participants. METHODS Generative Pretrained Transformer 4 (GPT4) was prompted to generate short clinical trial summaries and multiple-choice question-answer pairs from informed consent forms from ClinicalTrials.gov. Zero-shot learning was used for summaries, using a direct summarization, sequential extraction, and summarization approach. One-shot learning was used for question-answer pairs development. We evaluated performance through patient surveys of summary effectiveness and crowdsourced annotation of question-answer pair accuracy, using held-out cancer trial informed consent forms not used in prompt development. RESULTS For summaries, both prompting approaches achieved comparable results for readability and core content. Patients found summaries to be understandable and to improve clinical trial comprehension and interest in learning more about trials. The generated multiple-choice questions achieved high accuracy and agreement with crowdsourced annotators. For both summaries and multiple-choice questions, GPT4 was most likely to include inaccurate information when prompted to provide information that was not adequately described in the informed consent forms. CONCLUSIONS LLMs such as GPT4 show promise in generating patient-friendly educational content for clinical trials with minimal trial-specific engineering. The findings serve as a proof of concept for the role of LLMs in improving patient education and engagement in clinical trials, as well as the need for ongoing human oversight.
Collapse
Affiliation(s)
- Mingye Gao
- Massachusetts Institute of Technology, Cambridge, MA 02139, United States
| | - Aman Varshney
- Technical University of Munich, Munich 80333, Germany
| | - Shan Chen
- Artificial Intelligence in Medicine Program, Mass General Brigham, Harvard Medical School, Boston, MA 02115, United States
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, MA 02115, United States
| | - Vikram Goddla
- Artificial Intelligence in Medicine Program, Mass General Brigham, Harvard Medical School, Boston, MA 02115, United States
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, MA 02115, United States
| | - Jack Gallifant
- Artificial Intelligence in Medicine Program, Mass General Brigham, Harvard Medical School, Boston, MA 02115, United States
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, MA 02115, United States
| | - Patrick Doyle
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, MA 02115, United States
| | - Claire Novack
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, MA 02115, United States
| | - Maeve Dillon-Martin
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, MA 02115, United States
| | - Teresia Perkins
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, MA 02115, United States
| | | | | | | | - Elad Sharon
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02115, United States
| | - Lisa Soleymani Lehmann
- Department of Medicine, Mass General Brigham, Harvard Medical School, Boston, MA, United States
| | - David Kozono
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, MA 02115, United States
| | - Brian Anthony
- Massachusetts Institute of Technology, Cambridge, MA 02139, United States
| | | | - Danielle S Bitterman
- Artificial Intelligence in Medicine Program, Mass General Brigham, Harvard Medical School, Boston, MA 02115, United States
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, MA 02115, United States
| |
Collapse
|
14
|
Nasef H, Patel H, Amin Q, Baum S, Ratnasekera A, Ang D, Havron WS, Nakayama D, Elkbuli A. Evaluating the Accuracy, Comprehensiveness, and Validity of ChatGPT Compared to Evidence-Based Sources Regarding Common Surgical Conditions: Surgeons' Perspectives. Am Surg 2025; 91:325-335. [PMID: 38794965 DOI: 10.1177/00031348241256075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/27/2024]
Abstract
BackgroundThis study aims to assess the accuracy, comprehensiveness, and validity of ChatGPT compared to evidence-based sources regarding the diagnosis and management of common surgical conditions by surveying the perceptions of U.S. board-certified practicing surgeons.MethodsAn anonymous cross-sectional survey was distributed to U.S. practicing surgeons from June 2023 to March 2024. The survey comprised 94 multiple-choice questions evaluating diagnostic and management information for five common surgical conditions from evidence-based sources or generated by ChatGPT. Statistical analysis included descriptive statistics and paired-sample t-tests.ResultsParticipating surgeons were primarily aged 40-50 years (43%), male (86%), White (57%), and had 5-10 years or >15 years of experience (86%). The majority of surgeons had no prior experience with ChatGPT in surgical practice (86%). For material discussing both acute cholecystitis and upper gastrointestinal hemorrhage, evidence-based sources were rated as significantly more comprehensive (3.57 (±.535) vs 2.00 (±1.16), P = .025) (4.14 (±.69) vs 2.43 (±.98), P < .001) and valid (3.71 (±.488) vs 2.86 (±1.07), P = .045) (3.71 (±.76) vs 2.71 (±.95) P = .038) than ChatGPT. However, there was no significant difference in accuracy between the two sources (3.71 vs 3.29, P = .289) (3.57 vs 2.71, P = .111).ConclusionSurveyed U.S. board-certified practicing surgeons rated evidence-based sources as significantly more comprehensive and valid compared to ChatGPT across the majority of surveyed surgical conditions. However, there was no significant difference in accuracy between the sources across the majority of surveyed conditions. While ChatGPT may offer potential benefits in surgical practice, further refinement and validation are necessary to enhance its utility and acceptance among surgeons.
Collapse
Affiliation(s)
- Hazem Nasef
- NOVA Southeastern University, Kiran Patel College of Allopathic Medicine, Fort Lauderdale, FL, USA
| | - Heli Patel
- NOVA Southeastern University, Kiran Patel College of Allopathic Medicine, Fort Lauderdale, FL, USA
| | - Quratulain Amin
- NOVA Southeastern University, Kiran Patel College of Allopathic Medicine, Fort Lauderdale, FL, USA
| | - Samuel Baum
- Louisiana State University Health Science Center, College of Medicine, New Orleans, LA, USA
| | | | - Darwin Ang
- Department of Surgery, Ocala Regional Medical Center, Ocala, FL, USA
| | - William S Havron
- Department of Surgical Education, Orlando Regional Medical Center, Orlando, FL, USA
- Department of Surgery, Division of Trauma and Surgical Critical Care, Orlando Regional Medical Center, Orlando, FL, USA
| | - Don Nakayama
- Mercer University School of Medicine, Columbus, GA, USA
| | - Adel Elkbuli
- Department of Surgical Education, Orlando Regional Medical Center, Orlando, FL, USA
- Department of Surgery, Division of Trauma and Surgical Critical Care, Orlando Regional Medical Center, Orlando, FL, USA
| |
Collapse
|
15
|
Trapp C, Schmidt-Hegemann N, Keilholz M, Brose SF, Marschner SN, Schönecker S, Maier SH, Dehelean DC, Rottler M, Konnerth D, Belka C, Corradini S, Rogowski P. Patient- and clinician-based evaluation of large language models for patient education in prostate cancer radiotherapy. Strahlenther Onkol 2025; 201:333-342. [PMID: 39792259 PMCID: PMC11839798 DOI: 10.1007/s00066-024-02342-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/18/2024] [Indexed: 01/12/2025]
Abstract
BACKGROUND This study aims to evaluate the capabilities and limitations of large language models (LLMs) for providing patient education for men undergoing radiotherapy for localized prostate cancer, incorporating assessments from both clinicians and patients. METHODS Six questions about definitive radiotherapy for prostate cancer were designed based on common patient inquiries. These questions were presented to different LLMs [ChatGPT‑4, ChatGPT-4o (both OpenAI Inc., San Francisco, CA, USA), Gemini (Google LLC, Mountain View, CA, USA), Copilot (Microsoft Corp., Redmond, WA, USA), and Claude (Anthropic PBC, San Francisco, CA, USA)] via the respective web interfaces. Responses were evaluated for readability using the Flesch Reading Ease Index. Five radiation oncologists assessed the responses for relevance, correctness, and completeness using a five-point Likert scale. Additionally, 35 prostate cancer patients evaluated the responses from ChatGPT‑4 for comprehensibility, accuracy, relevance, trustworthiness, and overall informativeness. RESULTS The Flesch Reading Ease Index indicated that the responses from all LLMs were relatively difficult to understand. All LLMs provided answers that clinicians found to be generally relevant and correct. The answers from ChatGPT‑4, ChatGPT-4o, and Claude AI were also found to be complete. However, we found significant differences between the performance of different LLMs regarding relevance and completeness. Some answers lacked detail or contained inaccuracies. Patients perceived the information as easy to understand and relevant, with most expressing confidence in the information and a willingness to use ChatGPT‑4 for future medical questions. ChatGPT-4's responses helped patients feel better informed, despite the initially standardized information provided. CONCLUSION Overall, LLMs show promise as a tool for patient education in prostate cancer radiotherapy. While improvements are needed in terms of accuracy and readability, positive feedback from clinicians and patients suggests that LLMs can enhance patient understanding and engagement. Further research is essential to fully realize the potential of artificial intelligence in patient education.
Collapse
Affiliation(s)
- Christian Trapp
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany.
| | - Nina Schmidt-Hegemann
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Michael Keilholz
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Sarah Frederike Brose
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Sebastian N Marschner
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Stephan Schönecker
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Sebastian H Maier
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Diana-Coralia Dehelean
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Maya Rottler
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Dinah Konnerth
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Claus Belka
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
- Bavarian Cancer Research Center (BZKF), Munich, Germany
- German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany
| | - Stefanie Corradini
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Paul Rogowski
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| |
Collapse
|
16
|
Hunter N, Allen D, Xiao D, Cox M, Jain K. Patient education resources for oral mucositis: a google search and ChatGPT analysis. Eur Arch Otorhinolaryngol 2025; 282:1609-1618. [PMID: 39198303 DOI: 10.1007/s00405-024-08913-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Accepted: 08/11/2024] [Indexed: 09/01/2024]
Abstract
PURPOSE Oral mucositis affects 90% of patients receiving chemotherapy or radiation for head and neck malignancies. Many patients use the internet to learn about their condition and treatments; however, the quality of online resources is not guaranteed. Our objective was to determine the most common Google searches related to "oral mucositis" and assess the quality and readability of available resources compared to ChatGPT-generated responses. METHODS Data related to Google searches for "oral mucositis" were analyzed. People Also Ask (PAA) questions (generated by Google) related to searches for "oral mucositis" were documented. Google resources were rated on quality, understandability, ease of reading, and reading grade level using the Journal of the American Medical Association benchmark criteria, Patient Education Materials Assessment Tool, Flesch Reading Ease Score, and Flesh-Kincaid Grade Level, respectively. ChatGPT-generated responses to the most popular PAA questions were rated using identical metrics. RESULTS Google search popularity for "oral mucositis" has significantly increased since 2004. 78% of the Google resources answered the associated PAA question, and 6% met the criteria for universal readability. 100% of the ChatGPT-generated responses answered the prompt, and 20% met the criteria for universal readability when asked to write for the appropriate audience. CONCLUSION Most resources provided by Google do not meet the criteria for universal readability. When prompted specifically, ChatGPT-generated responses were consistently more readable than Google resources. After verification of accuracy by healthcare professionals, ChatGPT could be a reasonable alternative to generate universally readable patient education resources.
Collapse
Affiliation(s)
- Nathaniel Hunter
- McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - David Allen
- Department of Otorhinolaryngology-Head and Neck Surgery, The University of Texas Health Science Center at Houston, Houston, TX, USA.
| | - Daniel Xiao
- McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Madisyn Cox
- McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Kunal Jain
- Department of Otorhinolaryngology-Head and Neck Surgery, The University of Texas Health Science Center at Houston, Houston, TX, USA
| |
Collapse
|
17
|
Nieves-Lopez B, Bechtle AR, Traverse J, Klifto C, Schoch BS, Aziz KT. Evaluating the Evolution of ChatGPT as an Information Resource in Shoulder and Elbow Surgery. Orthopedics 2025; 48:e69-e74. [PMID: 39879624 DOI: 10.3928/01477447-20250123-03] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/31/2025]
Abstract
BACKGROUND The purpose of this study was to evaluate the performance and evolution of Chat Generative Pre-Trained Transformer (ChatGPT; OpenAI) as a resource for shoulder and elbow surgery information by assessing its accuracy on the American Academy of Orthopaedic Surgeons shoulder-elbow self-assessment questions. We hypothesized that both ChatGPT models would demonstrate proficiency and that there would be significant improvement with progressive iterations. MATERIALS AND METHODS A total of 200 questions were selected from the 2019 and 2021 American Academy of Orthopaedic Surgeons shoulder-elbow self-assessment questions. ChatGPT 3.5 and 4 were used to evaluate all questions. Questions with non-text data were excluded (114 questions). Remaining questions were input into ChatGPT and categorized as follows: anatomy, arthroplasty, basic science, instability, miscellaneous, nonoperative, and trauma. ChatGPT's performances were quantified and compared across categories with chi-square tests. The continuing medical education credit threshold of 50% was used to determine proficiency. Statistical significance was set at P<.05. RESULTS ChatGPT 3.5 and 4 answered 52.3% and 73.3% of the questions correctly, respectively (P=.003). ChatGPT 3.5 performed significantly better in the instability category (P=.037). ChatGPT 4's performance did not significantly differ across categories (P=.841). ChatGPT 4 performed significantly better than ChatGPT 3.5 in all categories except instability and miscellaneous. CONCLUSION ChatGPT 3.5 and 4 exceeded the proficiency threshold. ChatGPT 4 performed better than ChatGPT 3.5, showing an increased capability to correctly answer shoulder and elbow-focused questions. Further refinement of ChatGPT's training may improve its performance and utility as a resource. Currently, ChatGPT remains unable to answer questions at a high enough accuracy to replace clinical decision-making. [Orthopedics. 2025;48(2):e69-e74.].
Collapse
|
18
|
Xu J, Mankowski M, Vanterpool KB, Strauss AT, Lonze BE, Orandi BJ, Stewart D, Bae S, Ali N, Stern J, Mattoo A, Robalino R, Soomro I, Weldon E, Oermann EK, Aphinyanaphongs Y, Sidoti C, McAdams-DeMarco M, Massie AB, Gentry SE, Segev DL, Levan ML. Trials and Tribulations: Responses of ChatGPT to Patient Questions About Kidney Transplantation. Transplantation 2025; 109:399-402. [PMID: 39477825 DOI: 10.1097/tp.0000000000005261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/20/2025]
Affiliation(s)
- Jingzhi Xu
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
| | - Michal Mankowski
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
| | | | - Alexandra T Strauss
- Department of Medicine, School of Medicine, Johns Hopkins University, Baltimore, MD
| | - Bonnie E Lonze
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
- Department of Medicine, NYU Grossman School of Medicine, New York, NY
| | - Babak J Orandi
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
- Department of Medicine, NYU Grossman School of Medicine, New York, NY
| | - Darren Stewart
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
| | - Sunjae Bae
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
- Department of Population Health, NYU Grossman School of Medicine, New York, NY
| | - Nicole Ali
- Department of Medicine, NYU Grossman School of Medicine, New York, NY
| | - Jeffrey Stern
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
| | - Aprajita Mattoo
- Department of Medicine, NYU Grossman School of Medicine, New York, NY
| | - Ryan Robalino
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
| | - Irfana Soomro
- Department of Medicine, NYU Grossman School of Medicine, New York, NY
| | - Elaina Weldon
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
- Department of Medicine, NYU Grossman School of Medicine, New York, NY
| | - Eric K Oermann
- Department of Medicine, NYU Grossman School of Medicine, New York, NY
- Department of Neurosurgery, NYU Grossman School of Medicine, New York, NY
| | - Yin Aphinyanaphongs
- Department of Medicine, NYU Grossman School of Medicine, New York, NY
- Department of Population Health, NYU Grossman School of Medicine, New York, NY
| | - Carolyn Sidoti
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
| | - Mara McAdams-DeMarco
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
- Department of Population Health, NYU Grossman School of Medicine, New York, NY
| | - Allan B Massie
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
- Department of Population Health, NYU Grossman School of Medicine, New York, NY
| | - Sommer E Gentry
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
- Department of Population Health, NYU Grossman School of Medicine, New York, NY
| | - Dorry L Segev
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
- Department of Population Health, NYU Grossman School of Medicine, New York, NY
| | - Macey L Levan
- Department of Surgery, NYU Grossman School of Medicine, New York, NY
- Department of Population Health, NYU Grossman School of Medicine, New York, NY
| |
Collapse
|
19
|
Zhou Z, Qin P, Cheng X, Shao M, Ren Z, Zhao Y, Li Q, Liu L. ChatGPT in Oncology Diagnosis and Treatment: Applications, Legal and Ethical Challenges. Curr Oncol Rep 2025:10.1007/s11912-025-01649-3. [PMID: 39998782 DOI: 10.1007/s11912-025-01649-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/01/2025] [Indexed: 02/27/2025]
Abstract
PURPOSE OF REVIEW This study aims to systematically review the trajectory of artificial intelligence (AI) development in the medical field, with a particular emphasis on ChatGPT, a cutting-edge tool that is transforming oncology's diagnosis and treatment practices. RECENT FINDINGS Recent advancements have demonstrated that ChatGPT can be effectively utilized in various areas, including collecting medical histories, conducting radiological & pathological diagnoses, generating electronic medical record (EMR), providing nutritional support, participating in Multidisciplinary Team (MDT) and formulating personalized, multidisciplinary treatment plans. However, some significant challenges related to data privacy and legal issues that need to be addressed for the safe and effective integration of ChatGPT into clinical practice. ChatGPT, an emerging AI technology, opens up new avenues and viewpoints for oncology diagnosis and treatment. If current technological and legal challenges can be overcome, ChatGPT is expected to play a more significant role in oncology diagnosis and treatment in the future, providing better treatment options and improving the quality of medical services.
Collapse
Affiliation(s)
- Zihan Zhou
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Peng Qin
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Xi Cheng
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Maoxuan Shao
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Zhaozheng Ren
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Yiting Zhao
- Stomatological College of Nanjing Medical University, Nanjing, 211166, China
| | - Qiunuo Li
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Lingxiang Liu
- Department of Oncology, The First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing, 210029, Jiangsu, China.
| |
Collapse
|
20
|
Pugliese N, Bertazzoni A, Hassan C, Schattenberg JM, Aghemo A. Revolutionizing MASLD: How Artificial Intelligence Is Shaping the Future of Liver Care. Cancers (Basel) 2025; 17:722. [PMID: 40075570 PMCID: PMC11899536 DOI: 10.3390/cancers17050722] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2025] [Revised: 02/08/2025] [Accepted: 02/17/2025] [Indexed: 03/14/2025] Open
Abstract
Metabolic dysfunction-associated steatotic liver disease (MASLD) is emerging as a leading cause of chronic liver disease. In recent years, artificial intelligence (AI) has attracted significant attention in healthcare, particularly in diagnostics, patient management, and drug development, demonstrating immense potential for application and implementation. In the field of MASLD, substantial research has explored the application of AI in various areas, including patient counseling, improved patient stratification, enhanced diagnostic accuracy, drug development, and prognosis prediction. However, the integration of AI in hepatology is not without challenges. Key issues include data management and privacy, algorithmic bias, and the risk of AI-generated inaccuracies, commonly referred to as "hallucinations". This review aims to provide a comprehensive overview of the applications of AI in hepatology, with a focus on MASLD, highlighting both its transformative potential and its inherent limitations.
Collapse
Affiliation(s)
- Nicola Pugliese
- Department of Biomedical Sciences, Humanitas University, 20072 Pieve Emanuele, MI, Italy; (N.P.); (A.B.); (C.H.)
- Division of Internal Medicine and Hepatology, Department of Gastroenterology, IRCCS Humanitas Research Hospital, 20089 Rozzano, MI, Italy
| | - Arianna Bertazzoni
- Department of Biomedical Sciences, Humanitas University, 20072 Pieve Emanuele, MI, Italy; (N.P.); (A.B.); (C.H.)
- Division of Internal Medicine and Hepatology, Department of Gastroenterology, IRCCS Humanitas Research Hospital, 20089 Rozzano, MI, Italy
| | - Cesare Hassan
- Department of Biomedical Sciences, Humanitas University, 20072 Pieve Emanuele, MI, Italy; (N.P.); (A.B.); (C.H.)
- Endoscopy Unit, Department of Gastroenterology, IRCCS Humanitas Research Hospital, 20089 Rozzano, MI, Italy
| | - Jörn M. Schattenberg
- Department of Internal Medicine II, Saarland University Medical Center, 66421 Homburg, Germany;
| | - Alessio Aghemo
- Department of Biomedical Sciences, Humanitas University, 20072 Pieve Emanuele, MI, Italy; (N.P.); (A.B.); (C.H.)
- Division of Internal Medicine and Hepatology, Department of Gastroenterology, IRCCS Humanitas Research Hospital, 20089 Rozzano, MI, Italy
| |
Collapse
|
21
|
Guo S, Li R, Li G, Chen W, Huang J, He L, Ma Y, Wang L, Zheng H, Tian C, Zhao Y, Pan X, Wan H, Liu D, Li Z, Lei J. Comparing ChatGPT's and Surgeon's Responses to Thyroid-related Questions From Patients. J Clin Endocrinol Metab 2025; 110:e841-e850. [PMID: 38597169 DOI: 10.1210/clinem/dgae235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 04/03/2024] [Accepted: 04/08/2024] [Indexed: 04/11/2024]
Abstract
CONTEXT For some common thyroid-related conditions with high prevalence and long follow-up times, ChatGPT can be used to respond to common thyroid-related questions. OBJECTIVE In this cross-sectional study, we assessed the ability of ChatGPT (version GPT-4.0) to provide accurate, comprehensive, compassionate, and satisfactory responses to common thyroid-related questions. METHODS First, we obtained 28 thyroid-related questions from the Huayitong app, which together with the 2 interfering questions eventually formed 30 questions. Then, these questions were responded to by ChatGPT (on July 19, 2023), a junior specialist, and a senior specialist (on July 20, 2023) separately. Finally, 26 patients and 11 thyroid surgeons evaluated those responses on 4 dimensions: accuracy, comprehensiveness, compassion, and satisfaction. RESULTS Among the 30 questions and responses, ChatGPT's speed of response was faster than that of the junior specialist (8.69 [7.53-9.48] vs 4.33 [4.05-4.60]; P < .001) and the senior specialist (8.69 [7.53-9.48] vs 4.22 [3.36-4.76]; P < .001). The word count of the ChatGPT's responses was greater than that of both the junior specialist (341.50 [301.00-384.25] vs 74.50 [51.75-84.75]; P < .001) and senior specialist (341.50 [301.00-384.25] vs 104.00 [63.75-177.75]; P < .001). ChatGPT received higher scores than the junior specialist and senior specialist in terms of accuracy, comprehensiveness, compassion, and satisfaction in responding to common thyroid-related questions. CONCLUSION ChatGPT performed better than a junior specialist and senior specialist in answering common thyroid-related questions, but further research is needed to validate the logical ability of the ChatGPT for complex thyroid questions.
Collapse
Affiliation(s)
- Siyin Guo
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Ruicen Li
- Health Management Center, General Practice Medical Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Genpeng Li
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Wenjie Chen
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Jing Huang
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Linye He
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Yu Ma
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Liying Wang
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Hongping Zheng
- Department of Thyroid Surgery, General Surgery Ward 7, The First Hospital of Lanzhou University, Lanzhou, Gansu 730000, China
| | - Chunxiang Tian
- Chengdu Women's and Children's Central Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, Sichuan 610031, China
| | - Yatong Zhao
- Thyroid Surgery, Zhengzhou Central Hospital Affiliated of Zhengzhou University, Zhengzhou, Henan 450007, China
| | - Xinmin Pan
- Department of Thyroid Surgery, General Surgery III, Gansu Provincial Hospital, Lanzhou, Gansu 730000, China
| | - Hongxing Wan
- Department of Oncology, Sanya People's Hospital, Sanya, Hainan 572000, China
| | - Dasheng Liu
- Department of Vascular Thyroid Surgery, The Second Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, Guangdong 510120, China
| | - Zhihui Li
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Jianyong Lei
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| |
Collapse
|
22
|
Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen JP, McKechnie T, Lee Y, Mayol J, Antoniou SA, Thirunavukarasu AJ, Sanger S, Ramji K, Guyatt G. Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Netw Open 2025; 8:e2457879. [PMID: 39903463 PMCID: PMC11795331 DOI: 10.1001/jamanetworkopen.2024.57879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/26/2024] [Indexed: 02/06/2025] Open
Abstract
Importance There is much interest in the clinical integration of large language models (LLMs) in health care. Many studies have assessed the ability of LLMs to provide health advice, but the quality of their reporting is uncertain. Objective To perform a systematic review to examine the reporting variability among peer-reviewed studies evaluating the performance of generative artificial intelligence (AI)-driven chatbots for summarizing evidence and providing health advice to inform the development of the Chatbot Assessment Reporting Tool (CHART). Evidence Review A search of MEDLINE via Ovid, Embase via Elsevier, and Web of Science from inception to October 27, 2023, was conducted with the help of a health sciences librarian to yield 7752 articles. Two reviewers screened articles by title and abstract followed by full-text review to identify primary studies evaluating the clinical accuracy of generative AI-driven chatbots in providing health advice (chatbot health advice studies). Two reviewers then performed data extraction for 137 eligible studies. Findings A total of 137 studies were included. Studies examined topics in surgery (55 [40.1%]), medicine (51 [37.2%]), and primary care (13 [9.5%]). Many studies focused on treatment (91 [66.4%]), diagnosis (60 [43.8%]), or disease prevention (29 [21.2%]). Most studies (136 [99.3%]) evaluated inaccessible, closed-source LLMs and did not provide enough information to identify the version of the LLM under evaluation. All studies lacked a sufficient description of LLM characteristics, including temperature, token length, fine-tuning availability, layers, and other details. Most studies (136 [99.3%]) did not describe a prompt engineering phase in their study. The date of LLM querying was reported in 54 (39.4%) studies. Most studies (89 [65.0%]) used subjective means to define the successful performance of the chatbot, while less than one-third addressed the ethical, regulatory, and patient safety implications of the clinical integration of LLMs. Conclusions and Relevance In this systematic review of 137 chatbot health advice studies, the reporting quality was heterogeneous and may inform the development of the CHART reporting standards. Ethical, regulatory, and patient safety considerations are crucial as interest grows in the clinical integration of LLMs.
Collapse
Affiliation(s)
- Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Amy Boyle
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Nana Marfo
- H. Ross University School of Medicine, Miramar, Florida
| | - Wimonchat Tangamornsuksan
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Jeremy P. Steen
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Tyler McKechnie
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Yung Lee
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Julio Mayol
- Hospital Clinico San Carlos, IdISSC, Universidad Complutense de Madrid, Madrid, Spain
| | | | | | - Stephanie Sanger
- Health Science Library, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Karim Ramji
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Gordon Guyatt
- Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
23
|
Bader S, Schneider MO, Psilopatis I, Anetsberger D, Emons J, Kehl S. [AI-supported decision-making in obstetrics - a feasibility study on the medical accuracy and reliability of ChatGPT]. Z Geburtshilfe Neonatol 2025; 229:15-21. [PMID: 39401518 DOI: 10.1055/a-2411-9516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2025]
Abstract
The aim of this study is to investigate the feasibility of artificial intelligence in the interpretation and application of medical guidelines to support clinical decision-making in obstetrics. ChatGPT was provided with guidelines on specific obstetric issues. Using several clinical scenarios as examples, the AI was then evaluated for its ability to make accurate diagnoses and appropriate clinical decisions. The results varied, with ChatGPT providing predominantly correct answers in some fictional scenarios but performing inadequately in others. Despite ChatGPT's ability to grasp complex medical information, the study revealed limitations in the precision and reliability of its interpretations and recommendations. These discrepancies highlight the need for careful review by healthcare professionals and underscore the importance of clear, unambiguous guideline recommendations. Furthermore, continuous technical development is required to harness artificial intelligence as a supportive tool in clinical practice. Overall, while the use of AI in medicine shows promise, its current suitability primarily lies in controlled scientific settings due to potential error susceptibility and interpretation weaknesses, aiming to safeguard the safety and accuracy of patient care.
Collapse
Affiliation(s)
- Simon Bader
- Frauenklinik, Universitätsklinikum Erlangen, Erlangen, Germany
| | | | | | | | - Julius Emons
- Frauenklinik, Universitätsklinikum Erlangen, Erlangen, Germany
| | - Sven Kehl
- Frauenklinik, Klinik Hallerwiese, Nürnberg, Germany
| |
Collapse
|
24
|
Cohen ND, Ho M, McIntire D, Smith K, Kho KA. A comparative analysis of generative artificial intelligence responses from leading chatbots to questions about endometriosis. AJOG GLOBAL REPORTS 2025; 5:100405. [PMID: 39810943 PMCID: PMC11730533 DOI: 10.1016/j.xagr.2024.100405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2025] Open
Abstract
Introduction The use of generative artificial intelligence (AI) has begun to permeate most industries, including medicine, and patients will inevitably start using these large language model (LLM) chatbots as a modality for education. As healthcare information technology evolves, it is imperative to evaluate chatbots and the accuracy of the information they provide to patients and to determine if there is variability between them. Objective This study aimed to evaluate the accuracy and comprehensiveness of three chatbots in addressing questions related to endometriosis and determine the level of variability between them. Study Design Three LLMs, including Chat GPT-4 (Open AI), Claude (Anthropic), and Bard (Google) were asked to generate answers to 10 commonly asked questions about endometriosis. The responses were qualitatively compared to current guidelines and expert opinion on endometriosis and rated on a scale by nine gynecologists. The grading scale included the following: (1) Completely incorrect, (2) mostly incorrect and some correct, (3) mostly correct and some incorrect, (4) correct but inadequate, (5) correct and comprehensive. Final scores were averaged between the nine reviewers. Kendall's W and the related chi-square test were used to evaluate the reviewers' strength of agreement in ranking the LLMs' responses for each item. Results Average scores for the 10 answers amongst Bard, Chat GPT, and Claude were 3.69, 4.24, and 3.7, respectively. Two questions showed significant disagreement between the nine reviewers. There were no questions the models could answer comprehensively or correctly across the reviewers. The model most associated with comprehensive and correct responses was ChatGPT. Chatbots showed an improved ability to accurately answer questions about symptoms and pathophysiology over treatment and risk of recurrence. Conclusion The analysis of LLMs revealed that, on average, they mainly provided correct but inadequate responses to commonly asked patient questions about endometriosis. While chatbot responses can serve as valuable supplements to information provided by licensed medical professionals, it is crucial to maintain a thorough ongoing evaluation process for outputs to provide the most comprehensive and accurate information to patients. Further research into this technology and its role in patient education and treatment is crucial as generative AI becomes more embedded in the medical field.
Collapse
Affiliation(s)
- Natalie D. Cohen
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| | - Milan Ho
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| | - Donald McIntire
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| | - Katherine Smith
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| | - Kimberly A. Kho
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| |
Collapse
|
25
|
Tangsrivimol JA, Darzidehkalani E, Virk HUH, Wang Z, Egger J, Wang M, Hacking S, Glicksberg BS, Strauss M, Krittanawong C. Benefits, limits, and risks of ChatGPT in medicine. Front Artif Intell 2025; 8:1518049. [PMID: 39949509 PMCID: PMC11821943 DOI: 10.3389/frai.2025.1518049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2024] [Accepted: 01/15/2025] [Indexed: 02/16/2025] Open
Abstract
ChatGPT represents a transformative technology in healthcare, with demonstrated impacts across clinical practice, medical education, and research. Studies show significant efficiency gains, including 70% reduction in administrative time for discharge summaries and achievement of medical professional-level performance on standardized tests (60% accuracy on USMLE, 78.2% on PubMedQA). ChatGPT offers personalized learning platforms, automated scoring, and instant access to vast medical knowledge in medical education, addressing resource limitations and enhancing training efficiency. It streamlines clinical workflows by supporting triage processes, generating discharge summaries, and alleviating administrative burdens, allowing healthcare professionals to focus more on patient care. Additionally, ChatGPT facilitates remote monitoring and chronic disease management, providing personalized advice, medication reminders, and emotional support, thus bridging gaps between clinical visits. Its ability to process and synthesize vast amounts of data accelerates research workflows, aiding in literature reviews, hypothesis generation, and clinical trial designs. This paper aims to gather and analyze published studies involving ChatGPT, focusing on exploring its advantages and disadvantages within the healthcare context. To aid in understanding and progress, our analysis is organized into six key areas: (1) Information and Education, (2) Triage and Symptom Assessment, (3) Remote Monitoring and Support, (4) Mental Healthcare Assistance, (5) Research and Decision Support, and (6) Language Translation. Realizing ChatGPT's full potential in healthcare requires addressing key limitations, such as its lack of clinical experience, inability to process visual data, and absence of emotional intelligence. Ethical, privacy, and regulatory challenges further complicate its integration. Future improvements should focus on enhancing accuracy, developing multimodal AI models, improving empathy through sentiment analysis, and safeguarding against artificial hallucination. While not a replacement for healthcare professionals, ChatGPT can serve as a powerful assistant, augmenting their expertise to improve efficiency, accessibility, and quality of care. This collaboration ensures responsible adoption of AI in transforming healthcare delivery. While ChatGPT demonstrates significant potential in healthcare transformation, systematic evaluation of its implementation across different healthcare settings reveals varying levels of evidence quality-from robust randomized trials in medical education to preliminary observational studies in clinical practice. This heterogeneity in evidence quality necessitates a structured approach to future research and implementation.
Collapse
Affiliation(s)
- Jonathan A. Tangsrivimol
- Department of Neurosurgery, and Neuroscience, Weill Cornell Medicine, NewYork-Presbyterian Hospital, New York, NY, United States
- Department of Neurosurgery, Chulabhorn Hospital, Chulabhorn Royal Academy, Bangkok, Thailand
| | - Erfan Darzidehkalani
- MIT Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, United States
| | - Hafeez Ul Hassan Virk
- Harrington Heart & Vascular Institute, University Hospitals Cleveland Medical Center, Case Western Reserve University, Cleveland, OH, United States
| | - Zhen Wang
- Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, United States
- Division of Health Care Policy and Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Jan Egger
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Essen, Germany
| | - Michelle Wang
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, United States
| | - Sean Hacking
- Department of Pathology, NYU Grossman School of Medicine, New York, NY, United States
| | - Benjamin S. Glicksberg
- Hasso Plattner Institute for Digital Health, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Markus Strauss
- Department of Cardiology I, Coronary and Peripheral Vascular Disease, Heart Failure Medicine, University Hospital Muenster, Muenster, Germany
- Department of Cardiology, Sector Preventive Medicine, Health Promotion, Faculty of Health, School of Medicine, University Witten/Herdecke, Hagen, Germany
| | - Chayakrit Krittanawong
- Cardiology Division, New York University Langone Health, New York University School of Medicine, New York, NY, United States
- HumanX, Delaware, DE, United States
| |
Collapse
|
26
|
Maron CM, Emile SH, Horesh N, Freund MR, Pellino G, Wexner SD. Comparing answers of ChatGPT and Google Gemini to common questions on benign anal conditions. Tech Coloproctol 2025; 29:57. [PMID: 39864043 DOI: 10.1007/s10151-024-03096-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/25/2024] [Accepted: 12/22/2024] [Indexed: 01/27/2025]
Abstract
INTRODUCTION Chatbots have been increasingly used as a source of patient education. This study aimed to compare the answers of ChatGPT-4 and Google Gemini to common questions on benign anal conditions in terms of appropriateness, comprehensiveness, and language level. METHODS Each chatbot was asked a set of 30 questions on hemorrhoidal disease, anal fissures, and anal fistulas. The responses were assessed for appropriateness, comprehensiveness, and reference provision. The assessments were made by three subject experts who were unaware of the name of the chatbots. The language level of the chatbot answers was assessed using the Flesch-Kincaid Reading Ease score and grade level. RESULTS Overall, the answers provided by both models were appropriate and comprehensive. The answers of Google Gemini were more appropriate, comprehensive, and supported by references compared with the answers of ChatGPT. In addition, the agreement among the assessors on the appropriateness of Google Gemini answers was higher, attesting to a higher consistency. ChatGPT had a significantly higher Flesh-Kincaid grade level than Google Gemini (12.3 versus 10.6, p = 0.015), but a similar median Flesh-Kincaid Ease score. CONCLUSIONS The answers of Google Gemini to questions on common benign anal conditions were more appropriate and comprehensive, and more often supported with references, than the answers of ChatGPT. The answers of both chatbots were at grade levels higher than the 6th grade level, which may be difficult for nonmedical individuals to comprehend.
Collapse
Affiliation(s)
| | - S H Emile
- Ellen Leifer Shulman and Steven Shulman Digestive Disease Center, Cleveland Clinic Florida, 2950 Cleveland Clinic Blvd, Weston, FL, USA
- Colorectal Surgery Unit, Mansoura University Hospitals, Mansoura, Egypt
| | - N Horesh
- Ellen Leifer Shulman and Steven Shulman Digestive Disease Center, Cleveland Clinic Florida, 2950 Cleveland Clinic Blvd, Weston, FL, USA
- Department of Surgery and Transplantations, Sheba Medical Center, Ramat Gan, Tel Aviv, Israel
| | - M R Freund
- Department of General Surgery, Shaare Zedek Medical Center, Faculty of Medicine, Jerusalem, Israel
| | - G Pellino
- Colorectal Surgery, Vall d'Hebron University Hospital, Universitat Autonoma de Barcelona UAB, Barcelona, Spain
- Department of Advanced Medical and Surgical Sciences, Università Degli Studi Della Campania "Luigi Vanvitelli", Naples, Italy
| | - S D Wexner
- Ellen Leifer Shulman and Steven Shulman Digestive Disease Center, Cleveland Clinic Florida, 2950 Cleveland Clinic Blvd, Weston, FL, USA.
| |
Collapse
|
27
|
Omar M, Nassar S, SharIf K, Glicksberg BS, Nadkarni GN, Klang E. Emerging applications of NLP and large language models in gastroenterology and hepatology: a systematic review. Front Med (Lausanne) 2025; 11:1512824. [PMID: 39917263 PMCID: PMC11799763 DOI: 10.3389/fmed.2024.1512824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Accepted: 12/09/2024] [Indexed: 02/09/2025] Open
Abstract
Background and aim In the last years, natural language processing (NLP) has transformed significantly with the introduction of large language models (LLM). This review updates on NLP and LLM applications and challenges in gastroenterology and hepatology. Methods Registered with PROSPERO (CRD42024542275) and adhering to PRISMA guidelines, we searched six databases for relevant studies published from 2003 to 2024, ultimately including 57 studies. Results Our review of 57 studies notes an increase in relevant publications in 2023-2024 compared to previous years, reflecting growing interest in newer models such as GPT-3 and GPT-4. The results demonstrate that NLP models have enhanced data extraction from electronic health records and other unstructured medical data sources. Key findings include high precision in identifying disease characteristics from unstructured reports and ongoing improvement in clinical decision-making. Risk of bias assessments using ROBINS-I, QUADAS-2, and PROBAST tools confirmed the methodological robustness of the included studies. Conclusion NLP and LLMs can enhance diagnosis and treatment in gastroenterology and hepatology. They enable extraction of data from unstructured medical records, such as endoscopy reports and patient notes, and for enhancing clinical decision-making. Despite these advancements, integrating these tools into routine practice is still challenging. Future work should prospectively demonstrate real-world value.
Collapse
Affiliation(s)
- Mahmud Omar
- Maccabi Health Services, Tel Aviv, Israel
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | | | - Kassem SharIf
- Department of Gastroenterology, Sheba Medical Center, Tel HaShomer, Israel
| | - Benjamin S. Glicksberg
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Girish N. Nadkarni
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Eyal Klang
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, United States
| |
Collapse
|
28
|
Li Y, Huang CK, Hu Y, Zhou XD, He C, Zhong JW. Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study. World J Gastroenterol 2025; 31:101092. [PMID: 39839898 PMCID: PMC11684168 DOI: 10.3748/wjg.v31.i3.101092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 10/29/2024] [Accepted: 12/03/2024] [Indexed: 12/20/2024] Open
Abstract
BACKGROUND Patients with hepatitis B virus (HBV) infection require chronic and personalized care to improve outcomes. Large language models (LLMs) can potentially provide medical information for patients. AIM To examine the performance of three LLMs, ChatGPT-3.5, ChatGPT-4.0, and Google Gemini, in answering HBV-related questions. METHODS LLMs' responses to HBV-related questions were independently graded by two medical professionals using a four-point accuracy scale, and disagreements were resolved by a third reviewer. Each question was run three times using three LLMs. Readability was assessed via the Gunning Fog index and Flesch-Kincaid grade level. RESULTS Overall, all three LLM chatbots achieved high average accuracy scores for subjective questions (ChatGPT-3.5: 3.50; ChatGPT-4.0: 3.69; Google Gemini: 3.53, out of a maximum score of 4). With respect to objective questions, ChatGPT-4.0 achieved an 80.8% accuracy rate, compared with 62.9% for ChatGPT-3.5 and 73.1% for Google Gemini. Across the six domains, ChatGPT-4.0 performed better in terms of diagnosis, whereas Google Gemini demonstrated excellent clinical manifestations. Notably, in the readability analysis, the mean Gunning Fog index and Flesch-Kincaid grade level scores of the three LLM chatbots were significantly higher than the standard level eight, far exceeding the reading level of the normal population. CONCLUSION Our results highlight the potential of LLMs, especially ChatGPT-4.0, for delivering responses to HBV-related questions. LLMs may be an adjunctive informational tool for patients and physicians to improve outcomes. Nevertheless, current LLMs should not replace personalized treatment recommendations from physicians in the management of HBV infection.
Collapse
Affiliation(s)
- Yu Li
- Department of Gastroenterology, Jiangxi Provincial Key Laboratory of Digestive Diseases, Jiangxi Clinical Research Center for Gastroenterology, Digestive Disease Hospital, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang 330006, Jiangxi Province, China
- HuanKui Academy, Nanchang University, Nanchang 330006, Jiangxi Province, China
| | - Chen-Kai Huang
- Department of Gastroenterology, Jiangxi Provincial Key Laboratory of Digestive Diseases, Jiangxi Clinical Research Center for Gastroenterology, Digestive Disease Hospital, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang 330006, Jiangxi Province, China
| | - Yi Hu
- Department of Gastroenterology, Jiangxi Provincial Key Laboratory of Digestive Diseases, Jiangxi Clinical Research Center for Gastroenterology, Digestive Disease Hospital, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang 330006, Jiangxi Province, China
| | - Xiao-Dong Zhou
- Department of Gastroenterology, Jiangxi Provincial Key Laboratory of Digestive Diseases, Jiangxi Clinical Research Center for Gastroenterology, Digestive Disease Hospital, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang 330006, Jiangxi Province, China
| | - Cong He
- Department of Gastroenterology, Jiangxi Provincial Key Laboratory of Digestive Diseases, Jiangxi Clinical Research Center for Gastroenterology, Digestive Disease Hospital, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang 330006, Jiangxi Province, China
| | - Jia-Wei Zhong
- Department of Gastroenterology, Jiangxi Provincial Key Laboratory of Digestive Diseases, Jiangxi Clinical Research Center for Gastroenterology, Digestive Disease Hospital, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang 330006, Jiangxi Province, China
| |
Collapse
|
29
|
Çi Ti R M. ChatGPT and oral cancer: a study on informational reliability. BMC Oral Health 2025; 25:86. [PMID: 39833819 PMCID: PMC11745001 DOI: 10.1186/s12903-025-05479-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2024] [Accepted: 01/13/2025] [Indexed: 01/22/2025] Open
Abstract
BACKGROUND Artificial intelligence (AI) and large language models (LLMs) like ChatGPT have transformed information retrieval, including in healthcare. ChatGPT, trained on diverse datasets, can provide medical advice but faces ethical and accuracy concerns. This study evaluates the accuracy of ChatGPT-3.5's answers to frequently asked questions about oral cancer, a condition where early diagnosis is crucial for improving patient outcomes. METHODS A total of 20 questions were asked to ChatGPT-3.5, selected from Google Trends and questions asked by patients in the clinic. The responses provided by ChatGPT were evaluated for accuracy by medical oncologists and oral and maxillofacial radiologists. Inter-rater agreement was assessed using Fleiss's and Cohen kappa tests. The scores given by the specialties were compared with the Mann-Whitney U test. The references provided by ChatGPT-3.5 were evaluated for authenticity. RESULTS Of the 80 responses from 20 questions, 41 (51.25%) were rated as very good, 37 (46.25%) as good, 2 (2.50%) as acceptable. There was no significant difference between oral and maxillofacial radiologists and medical oncologists in all 20 questions. Of the 81 references to ChatGPT-3.5 answers, only 13 were scientific articles, 10 were fake, and the remaining references were data from websites. CONCLUSION ChatGPT provided reliable information about oral cancer and did not provide incorrect information and suggestions. However, all information provided by ChatGPT is not based on real references.
Collapse
Affiliation(s)
- Mesude Çi Ti R
- Faculty of Dentistry, Department of Dentomaxillofacial Radiology, Tokat Gaziosmanpasa University, Tokat, Turkey.
| |
Collapse
|
30
|
Xu Q, Wang J, Chen X, Wang J, Li H, Wang Z, Li W, Gao J, Chen C, Gao Y. Assessing the Efficacy of ChatGPT Prompting Strategies in Enhancing Thyroid Cancer Patient Education: A Prospective Study. J Med Syst 2025; 49:11. [PMID: 39820814 DOI: 10.1007/s10916-024-02129-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2024] [Accepted: 11/18/2024] [Indexed: 01/19/2025]
Abstract
With the rise of AI platforms, patients increasingly use them for information, relying on advanced language models like ChatGPT for answers and advice. However, the effectiveness of ChatGPT in educating thyroid cancer patients remains unclear. We designed 50 questions covering key areas of thyroid cancer management and generated corresponding responses under four different prompt strategies. These answers were evaluated based on four dimensions: accuracy, comprehensiveness, human care, and satisfaction. Additionally, the readability of the responses was assessed using the Flesch-Kincaid grade level, Gunning Fog Index, Simple Measure of Gobbledygook, and Fry readability score. We also statistically analyzed the references in the responses generated by ChatGPT. The type of prompt significantly influences the quality of ChatGPT's responses. Notably, the "statistics and references" prompt yields the highest quality outcomes. Prompts tailored to a "6th-grade level" generated the most easily understandable text, whereas responses without specific prompts were the most complex. Additionally, the "statistics and references" prompt produced the longest responses while the "6th-grade level" prompt resulted in the shortest. Notably, 87.84% of citations referenced published medical literature, but 12.82% contained misinformation or errors. ChatGPT demonstrates considerable potential for enhancing the readability and quality of thyroid cancer patient education materials. By adjusting prompt strategies, ChatGPT can generate responses that cater to diverse patient needs, improving their understanding and management of the disease. However, AI-generated content must be carefully supervised to ensure that the information it provides is accurate.
Collapse
Affiliation(s)
- Qi Xu
- Department of Breast and Thyroid Surgery, The First Affiliated Hospital of Nanyang Medical College, Nanyang, China.
- Department of Breast Surgery, Harbin Medical University Cancer Hospital, Harbin, China.
- Surgical Department, Nanyang Central Hospital, Nanyang, China.
| | - Jing Wang
- Department of Breast Surgery, Harbin Medical University Cancer Hospital, Harbin, China
| | - Xiaohui Chen
- Department of Breast and Thyroid Surgery, The Second Affiliated Hospital of Guilin Medical University, Guilin, China
| | - Jiale Wang
- Department of Internal Medicine, The Affiliated Jiangning Hospital of Nanjing Medical University, Nanjing, China
| | - Hanzhi Li
- Department of Breast and Thyroid Surgery, The First Affiliated Hospital of Nanyang Medical College, Nanyang, China
| | - Zheng Wang
- Department of Breast Surgery, Nanyang Central Hospital, Nanyang, China
| | - Weihan Li
- Department of Thyroid and Breast Surgery, Nanyang Central Hospital, Nanyang, China
| | - Jinliang Gao
- Department of Thyroid and Breast Surgery, Nanyang Central Hospital, Nanyang, China
| | - Chen Chen
- Department of Breast Surgery, Nanyang Central Hospital, Nanyang, China
| | - Yuwan Gao
- Department of Ophthalmology, The First Affiliated Hospital of Nanyang Medical College, Nanyang, China
| |
Collapse
|
31
|
Kim J, Vajravelu BN. Assessing the Current Limitations of Large Language Models in Advancing Health Care Education. JMIR Form Res 2025; 9:e51319. [PMID: 39819585 PMCID: PMC11756841 DOI: 10.2196/51319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 08/31/2024] [Accepted: 09/03/2024] [Indexed: 01/19/2025] Open
Abstract
Unlabelled The integration of large language models (LLMs), as seen with the generative pretrained transformers series, into health care education and clinical management represents a transformative potential. The practical use of current LLMs in health care sparks great anticipation for new avenues, yet its embracement also elicits considerable concerns that necessitate careful deliberation. This study aims to evaluate the application of state-of-the-art LLMs in health care education, highlighting the following shortcomings as areas requiring significant and urgent improvements: (1) threats to academic integrity, (2) dissemination of misinformation and risks of automation bias, (3) challenges with information completeness and consistency, (4) inequity of access, (5) risks of algorithmic bias, (6) exhibition of moral instability, (7) technological limitations in plugin tools, and (8) lack of regulatory oversight in addressing legal and ethical challenges. Future research should focus on strategically addressing the persistent challenges of LLMs highlighted in this paper, opening the door for effective measures that can improve their application in health care education.
Collapse
Affiliation(s)
- JaeYong Kim
- School of Pharmacy, Massachusetts College of Pharmacy and Health Sciences, Boston, MA, United States
| | - Bathri Narayan Vajravelu
- Department of Physician Assistant Studies, Massachusetts College of Pharmacy and Health Sciences, 179 Longwood Avenue, Boston, MA, 02115, United States, 1 6177322961
| |
Collapse
|
32
|
Zhang Y, Lu X, Luo Y, Zhu Y, Ling W. Performance of Artificial Intelligence Chatbots on Ultrasound Examinations: Cross-Sectional Comparative Analysis. JMIR Med Inform 2025; 13:e63924. [PMID: 39814698 PMCID: PMC11737282 DOI: 10.2196/63924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 10/23/2024] [Accepted: 11/19/2024] [Indexed: 01/18/2025] Open
Abstract
Background Artificial intelligence chatbots are being increasingly used for medical inquiries, particularly in the field of ultrasound medicine. However, their performance varies and is influenced by factors such as language, question type, and topic. Objective This study aimed to evaluate the performance of ChatGPT and ERNIE Bot in answering ultrasound-related medical examination questions, providing insights for users and developers. Methods We curated 554 questions from ultrasound medicine examinations, covering various question types and topics. The questions were posed in both English and Chinese. Objective questions were scored based on accuracy rates, whereas subjective questions were rated by 5 experienced doctors using a Likert scale. The data were analyzed in Excel. Results Of the 554 questions included in this study, single-choice questions comprised the largest share (354/554, 64%), followed by short answers (69/554, 12%) and noun explanations (63/554, 11%). The accuracy rates for objective questions ranged from 8.33% to 80%, with true or false questions scoring highest. Subjective questions received acceptability rates ranging from 47.62% to 75.36%. ERNIE Bot was superior to ChatGPT in many aspects (P<.05). Both models showed a performance decline in English, but ERNIE Bot's decline was less significant. The models performed better in terms of basic knowledge, ultrasound methods, and diseases than in terms of ultrasound signs and diagnosis. Conclusions Chatbots can provide valuable ultrasound-related answers, but performance differs by model and is influenced by language, question type, and topic. In general, ERNIE Bot outperforms ChatGPT. Users and developers should understand model performance characteristics and select appropriate models for different questions and languages to optimize chatbot use.
Collapse
Affiliation(s)
- Yong Zhang
- Department of Medical Ultrasound, West China Hospital of Sichuan University, 37 Guoxue Alley, Chengdu, 610041, China, 86 18980605569
| | - Xiao Lu
- Department of Medical Ultrasound, West China Hospital of Sichuan University, 37 Guoxue Alley, Chengdu, 610041, China, 86 18980605569
| | - Yan Luo
- Department of Medical Ultrasound, West China Hospital of Sichuan University, 37 Guoxue Alley, Chengdu, 610041, China, 86 18980605569
| | - Ying Zhu
- Department of Thoracic Surgery, West China Hospital of Sichuan University, Chengdu, China
| | - Wenwu Ling
- Department of Medical Ultrasound, West China Hospital of Sichuan University, 37 Guoxue Alley, Chengdu, 610041, China, 86 18980605569
| |
Collapse
|
33
|
Derbal Y. Adaptive Treatment of Metastatic Prostate Cancer Using Generative Artificial Intelligence. Clin Med Insights Oncol 2025; 19:11795549241311408. [PMID: 39776668 PMCID: PMC11701910 DOI: 10.1177/11795549241311408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Accepted: 12/12/2024] [Indexed: 01/11/2025] Open
Abstract
Despite the expanding therapeutic options available to cancer patients, therapeutic resistance, disease recurrence, and metastasis persist as hallmark challenges in the treatment of cancer. The rise to prominence of generative artificial intelligence (GenAI) in many realms of human activities is compelling the consideration of its capabilities as a potential lever to advance the development of effective cancer treatments. This article presents a hypothetical case study on the application of generative pre-trained transformers (GPTs) to the treatment of metastatic prostate cancer (mPC). The case explores the design of GPT-supported adaptive intermittent therapy for mPC. Testosterone and prostate-specific antigen (PSA) are assumed to be repeatedly monitored while treatment may involve a combination of androgen deprivation therapy (ADT), androgen receptor-signalling inhibitors (ARSI), chemotherapy, and radiotherapy. The analysis covers various questions relevant to the configuration, training, and inferencing of GPTs for the case of mPC treatment with a particular attention to risk mitigation regarding the hallucination problem and its implications to clinical integration of GenAI technologies. The case study provides elements of an actionable pathway to the realization of GenAI-assisted adaptive treatment of metastatic prostate cancer. As such, the study is expected to help facilitate the design of clinical trials of GenAI-supported cancer treatments.
Collapse
Affiliation(s)
- Youcef Derbal
- Ted Rogers School of Information Technology Management, Toronto Metropolitan University, Toronto, ON, Canada
| |
Collapse
|
34
|
Barlas İŞ, Tunç L. Quality of Chatbot Responses to the Most Popular Questions Regarding Erectile Dysfunction. UROLOGY RESEARCH & PRACTICE 2025; 50:253-260. [PMID: 39873458 PMCID: PMC11883663 DOI: 10.5152/tud.2025.24098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Accepted: 10/14/2024] [Indexed: 01/30/2025]
Abstract
Objective Erectile dysfunction (ED) is a common cause of male sexual dysfunction. We aimed to evaluate the quality of ChatGPT and Gemini's responses to the most frequently asked questions about ED. Methods This study was conducted as a cross-sectional, observational study. Google Trends was used to determine the most frequently asked questions on the internet. ChatGPT-3.5 and Gemini were compared for these chatbots' answers to the questions about ED. Two urologists with board certificates assessed the quality of responses using the Global Quality Score (GQS). Results Fifteen questions about ED were included according to the Google Trends. ChatGPT was able to answer all the questions systematically, whereas Gemini could not answer two questions. Upon assessing the quality of the responses provided by both researchers with the GQS, it was observed that the frequency of low-quality responses from Gemini exceeded that of ChatGPT. The agreement between researchers was 92% for ChatGPT and 95% for Gemini. Conclusion Despite the expeditious and comprehensive answers provided by chatbots, we identified inadequacies in their responses related to ED. In their current state, they cannot replace the patient-centered approach of healthcare professionals and require further development.
Collapse
Affiliation(s)
| | - Lütfi Tunç
- Clinic of Urology, Ankara Acibadem Hospital, Ankara, Türkiye
| |
Collapse
|
35
|
Guo Y, Li T, Xie J, Luo M, Zheng C. Evaluating the accuracy, time and cost of GPT-4 and GPT-4o in liver disease diagnoses using cases from "What is Your Diagnosis". J Hepatol 2025; 82:e15-e17. [PMID: 39307371 DOI: 10.1016/j.jhep.2024.09.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Revised: 09/13/2024] [Accepted: 09/16/2024] [Indexed: 11/09/2024]
Affiliation(s)
- Yusheng Guo
- Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430022, China; Hubei Key Laboratory of Molecular Imaging, Wuhan 430022, China
| | - Tianxiang Li
- Department of Ultrasound, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical. Sciences, Peking Union Medical College, Beijing, 100730, China
| | - Jiao Xie
- Health Management Center, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430022, China
| | - Miao Luo
- Department of Infectious Diseases, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430022, China
| | - Chuansheng Zheng
- Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430022, China; Hubei Key Laboratory of Molecular Imaging, Wuhan 430022, China.
| |
Collapse
|
36
|
Nasra M, Jaffri R, Pavlin-Premrl D, Kok HK, Khabaza A, Barras C, Slater LA, Yazdabadi A, Moore J, Russell J, Smith P, Chandra RV, Brooks M, Jhamb A, Chong W, Maingard J, Asadi H. Can artificial intelligence improve patient educational material readability? A systematic review and narrative synthesis. Intern Med J 2025; 55:20-34. [PMID: 39720869 DOI: 10.1111/imj.16607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2024] [Accepted: 11/19/2024] [Indexed: 12/26/2024]
Abstract
Enhancing patient comprehension of their health is crucial in improving health outcomes. The integration of artificial intelligence (AI) in distilling medical information into a conversational, legible format can potentially enhance health literacy. This review aims to examine the accuracy, reliability, comprehensiveness and readability of medical patient education materials (PEMs) simplified by AI models. A systematic review was conducted searching for articles assessing outcomes of use of AI in simplifying PEMs. Inclusion criteria are as follows: publication between January 2019 and June 2023, various modalities of AI, English language, AI use in PEMs and including physicians and/or patients. An inductive thematic approach was utilised to code for unifying topics which were qualitatively analysed. Twenty studies were included, and seven themes were identified (reproducibility, accessibility and ease of use, emotional support and user satisfaction, readability, data security, accuracy and reliability and comprehensiveness). AI effectively simplified PEMs, with reproducibility rates up to 90.7% in specific domains. User satisfaction exceeded 85% in AI-generated materials. AI models showed promising readability improvements, with ChatGPT achieving 100% post-simplification readability scores. AI's performance in accuracy and reliability was mixed, with occasional lack of comprehensiveness and inaccuracies, particularly when addressing complex medical topics. AI models accurately simplified basic tasks but lacked soft skills and personalisation. These limitations can be addressed with higher-calibre models combined with prompt engineering. In conclusion, the literature reveals a scope for AI to enhance patient health literacy through medical PEMs. Further refinement is needed to improve AI's accuracy and reliability, especially when simplifying complex medical information.
Collapse
Affiliation(s)
- Mohamed Nasra
- Department of Medicine, Northern Health, Melbourne, Victoria, Australia
| | - Rimsha Jaffri
- Melbourne Medical School, The University of Melbourne, Melbourne, Victoria, Australia
| | - Davor Pavlin-Premrl
- Department of Neurology, Austin Health, Melbourne, Victoria, Australia
- Department of Interventional Neuroradiology, Austin Health, Melbourne, Victoria, Australia
- Department of Interventional Neuroradiology, St Vincent's Health, Melbourne, Victoria, Australia
| | - Hong Kuan Kok
- Melbourne Medical School, The University of Melbourne, Melbourne, Victoria, Australia
- Interventional Radiology Service, Department of Radiology, Northern Health, Melbourne, Victoria, Australia
| | - Ali Khabaza
- Department of Interventional Neuroradiology, Austin Health, Melbourne, Victoria, Australia
- Department of Interventional Neuroradiology, St Vincent's Health, Melbourne, Victoria, Australia
- Department of Interventional Neuroradiology, Monash Health, Melbourne, Victoria, Australia
| | - Christen Barras
- South Australian Institute of Health and Medical Research, Adelaide, South Australia, Australia
- School of Medicine, The University of Adelaide, Adelaide, South Australia, Australia
| | - Lee-Anne Slater
- Department of Interventional Neuroradiology, Monash Health, Melbourne, Victoria, Australia
| | - Anousha Yazdabadi
- Melbourne Medical School, The University of Melbourne, Melbourne, Victoria, Australia
- Department of Dermatology, Eastern Health, Melbourne, Victoria, Australia
| | - Justin Moore
- Department of Neurosurgery, Monash Health, Melbourne, Victoria, Australia
| | - Jeremy Russell
- Department of Neurosurgery, Austin Health, Melbourne, Victoria, Australia
| | - Paul Smith
- Melbourne Medical School, The University of Melbourne, Melbourne, Victoria, Australia
- Department of Neurosurgery, St Vincent's Hospital, Melbourne, Victoria, Australia
| | - Ronil V Chandra
- Department of Interventional Neuroradiology, Monash Health, Melbourne, Victoria, Australia
| | - Mark Brooks
- Department of Interventional Neuroradiology, Austin Health, Melbourne, Victoria, Australia
| | - Ashu Jhamb
- Department of Radiology, St Vincent's Health, Melbourne, Victoria, Australia
| | - Winston Chong
- Department of Interventional Neuroradiology, Monash Health, Melbourne, Victoria, Australia
| | - Julian Maingard
- Department of Interventional Neuroradiology, Austin Health, Melbourne, Victoria, Australia
- Department of Interventional Neuroradiology, St Vincent's Health, Melbourne, Victoria, Australia
- School of Medicine, Deakin University, Waurn Ponds, Victoria, Australia
| | - Hamed Asadi
- Department of Interventional Neuroradiology, Austin Health, Melbourne, Victoria, Australia
- Department of Interventional Neuroradiology, Monash Health, Melbourne, Victoria, Australia
- School of Medicine, Deakin University, Waurn Ponds, Victoria, Australia
| |
Collapse
|
37
|
Yuan XT, Shao CY, Zhang ZZ, Qian D. Comparing the performance of ChatGPT and ERNIE Bot in answering questions regarding liver cancer interventional radiology in Chinese and English contexts: A comparative study. Digit Health 2025; 11:20552076251315511. [PMID: 39850627 PMCID: PMC11755525 DOI: 10.1177/20552076251315511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2024] [Accepted: 01/08/2025] [Indexed: 01/25/2025] Open
Abstract
Introduction This study aims to critically assess the appropriateness and limitations of two prominent large language models (LLMs), enhanced representation through knowledge integration (ERNIE Bot) and chat generative pre-trained transformer (ChatGPT), in answering questions about liver cancer interventional radiology. Through a comparative analysis, the performance of these models will be evaluated based on their responses to questions about transarterial chemoembolization and hepatic arterial infusion chemotherapy in both English and Chinese contexts. Methods A total of 38 questions were developed to cover a range of topics related to transarterial chemoembolization (TACE) and hepatic arterial infusion chemotherapy (HAIC), including foundational knowledge, patient education, and treatment and care. The responses generated by ERNIE Bot and ChatGPT were rigorously evaluated by 10 professionals in liver cancer interventional radiology. The final score was determined by one seasoned clinical expert. Each response was rated on a five-point Likert scale, facilitating a quantitative analysis of the accuracy and comprehensiveness of the information provided by each language model. Results ERNIE Bot is superior to ChatGPT in the Chinese context (ERNIE Bot: 5, 89.47%; 4, 10.53%; 3, 0%; 2, 0%; 1, 0% vs ChatGPT: 5, 57.89%; 4, 5.27%; 3, 34.21%; 2, 2.63%; 1, 0%; P = 0.001). However, ChatGPT outperformed ERNIE Bot in the English context (ERNIE Bot: 5, 73.68%; 4, 2.63%; 3, 13.16; 2, 10.53%;1, 0% vs ChatGPT: 5, 92.11%; 4, 2.63%; 3, 5.26%; 2, 0%; 1, 0%; P = 0.026). Conclusions This study preliminarily demonstrated that ERNIE Bot and ChatGPT effectively address questions related to liver cancer interventional radiology. However, their performance varied by language: ChatGPT excelled in English contexts, while ERNIE Bot performed better in Chinese. We found that choosing the appropriate LLMs is beneficial for patients in obtaining more accurate treatment information. Both models require manual review to ensure accuracy and reliability in practical use.
Collapse
Affiliation(s)
- Xue-ting Yuan
- Department of Interventional Radiology, The First Affiliated Hospital of Soochow University, Suzhou, China
| | - Chen-ye Shao
- School of Nursing, Department of Thoracic Surgery, The First Affiliated Hospital of Soochow University, Suzhou, China
| | - Zhen-zhen Zhang
- Department of Interventional Radiology, The First Affiliated Hospital of Soochow University, Suzhou, China
| | - Duo Qian
- Department of Interventional Radiology, The First Affiliated Hospital of Soochow University, Suzhou, China
| |
Collapse
|
38
|
Barbosa-Silva J, Driusso P, Ferreira EA, de Abreu RM. Exploring the Efficacy of Artificial Intelligence: A Comprehensive Analysis of CHAT-GPT's Accuracy and Completeness in Addressing Urinary Incontinence Queries. Neurourol Urodyn 2025; 44:153-164. [PMID: 39390731 DOI: 10.1002/nau.25603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Revised: 09/05/2024] [Accepted: 09/25/2024] [Indexed: 10/12/2024]
Abstract
BACKGROUND Artificial intelligence models are increasingly gaining popularity among patients and healthcare professionals. While it is impossible to restrict patient's access to different sources of information on the Internet, healthcare professional needs to be aware of the content-quality available across different platforms. OBJECTIVE To investigate the accuracy and completeness of Chat Generative Pretrained Transformer (ChatGPT) in addressing frequently asked questions related to the management and treatment of female urinary incontinence (UI), compared to recommendations from guidelines. METHODS This is a cross-sectional study. Two researchers developed 14 frequently asked questions related to UI. Then, they were inserted into the ChatGPT platform on September 16, 2023. The accuracy (scores from 1 to 5) and completeness (score from 1 to 3) of ChatGPT's answers were assessed individually by two experienced researchers in the Women's Health field, following the recommendations proposed by the guidelines for UI. RESULTS Most of the answers were classified as "more correct than incorrect" (n = 6), followed by "incorrect information than correct" (n = 3), "approximately equal correct and incorrect" (n = 2), "near all correct" (n = 2, and "correct" (n = 1). Regarding the appropriateness, most of the answers were classified as adequate, as they provided the minimum information expected to be classified as correct. CONCLUSION These results showed an inconsistency when evaluating the accuracy of answers generated by ChatGPT compared by scientific guidelines. Almost all the answers did not bring the complete content expected or reported in previous guidelines, which highlights to healthcare professionals and scientific community a concern about using artificial intelligence in patient counseling.
Collapse
Affiliation(s)
- Jordana Barbosa-Silva
- Women's Health Research Laboratory, Physical Therapy Department, Federal University of São Carlos, São Carlos, Brazil
| | - Patricia Driusso
- Women's Health Research Laboratory, Physical Therapy Department, Federal University of São Carlos, São Carlos, Brazil
| | - Elizabeth A Ferreira
- Department of Obstetrics and Gynecology, FMUSP School of Medicine, University of São Paulo, São Paulo, Brazil
- Department of Physiotherapy, Speech Therapy and Occupational Therapy, School of Medicine, University of São Paulo, São Paulo, Brazil
| | - Raphael M de Abreu
- Department of Physiotherapy, LUNEX University, International University of Health, Exercise & Sports S.A., Differdange, Luxembourg
- LUNEX ASBL Luxembourg Health & Sport Sciences Research Institute, Differdange, Luxembourg
| |
Collapse
|
39
|
Tarris G, Martin L. Performance assessment of ChatGPT 4, ChatGPT 3.5, Gemini Advanced Pro 1.5 and Bard 2.0 to problem solving in pathology in French language. Digit Health 2025; 11:20552076241310630. [PMID: 39896270 PMCID: PMC11786284 DOI: 10.1177/20552076241310630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Accepted: 12/09/2024] [Indexed: 02/04/2025] Open
Abstract
Digital teaching diversifies the ways of knowledge assessment, as natural language processing offers the possibility of answering questions posed by students and teachers. Objective This study evaluated ChatGPT's, Bard's and Gemini's performances on second year of medical studies' (DFGSM2) Pathology exams from the Health Sciences Center of Dijon (France) in 2018-2022. Methods From 2018 to 2022, exam scores, discriminating powers and discordance rates were retrieved. Seventy questions (25 first-order single response questions and 45 second-order multiple response questions) were submitted on May 2023 to ChatGPT 3.5 and Bard 2.0, and on September 2024 to Gemini 1.5 and ChatGPT-4. Chatbot's and student's average scores were compared, as well as discriminating powers of questions answered by chatbots. The percentage of student-chatbot identical answers was retrieved, and linear regression analysis correlated the scores of chatbots with student's discordance rates. Chatbot's reliability was assessed by submitting the questions in four successive rounds and comparing score variability using a Fleiss' Kappa and a Cohen's Kappa. Results Newer chatbots outperformed both students and older chatbots as for the overall scores and multiple-response questions. All chatbots outperformed students on less discriminating questions. Oppositely, all chatbots were outperformed by students to questions with a high discriminating power. Chatbot's scores were correlated to student discordance rates. ChatGPT 4 and Gemini 1.5 provided variable answers, due to effects linked to prompt engineering. Conclusion Our study in line with the literature confirms chatbot's moderate performance for questions requiring complex reasoning, with ChatGPT outperforming Google chatbots. The use of NLP software based on distributional semantics remains a challenge for the generation of questions in French. Drawbacks to the use of NLP software in generating questions include the generation of hallucinations and erroneous medical knowledge which have to be taken into count when using NLP software in medical education.
Collapse
Affiliation(s)
- Georges Tarris
- Department of Pathology, University Hospital François Mitterrand of Dijon–Bourgogne, Dijon, France
- University of Burgundy Health Sciences Center, Dijon, France
| | - Laurent Martin
- Department of Pathology, University Hospital François Mitterrand of Dijon–Bourgogne, Dijon, France
- University of Burgundy Health Sciences Center, Dijon, France
| |
Collapse
|
40
|
Yeo YH, Peng Y, Mehra M, Samaan J, Hakimian J, Clark A, Suchak K, Krut Z, Andersson T, Persky S, Liran O, Spiegel B. Evaluating for Evidence of Sociodemographic Bias in Conversational AI for Mental Health Support. CYBERPSYCHOLOGY, BEHAVIOR AND SOCIAL NETWORKING 2025; 28:44-51. [PMID: 39446671 PMCID: PMC11807910 DOI: 10.1089/cyber.2024.0199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2024]
Abstract
The integration of large language models (LLMs) into healthcare highlights the need to ensure their efficacy while mitigating potential harms, such as the perpetuation of biases. Current evidence on the existence of bias within LLMs remains inconclusive. In this study, we present an approach to investigate the presence of bias within an LLM designed for mental health support. We simulated physician-patient conversations by using a communication loop between an LLM-based conversational agent and digital standardized patients (DSPs) that engaged the agent in dialogue while remaining agnostic to sociodemographic characteristics. In contrast, the conversational agent was made aware of each DSP's characteristics, including age, sex, race/ethnicity, and annual income. The agent's responses were analyzed to discern potential systematic biases using the Linguistic Inquiry and Word Count tool. Multivariate regression analysis, trend analysis, and group-based trajectory models were used to quantify potential biases. Among 449 conversations, there was no evidence of bias in both descriptive assessments and multivariable linear regression analyses. Moreover, when evaluating changes in mean tone scores throughout a dialogue, the conversational agent exhibited a capacity to show understanding of the DSPs' chief complaints and to elevate the tone scores of the DSPs throughout conversations. This finding did not vary by any sociodemographic characteristics of the DSP. Using an objective methodology, our study did not uncover significant evidence of bias within an LLM-enabled mental health conversational agent. These findings offer a complementary approach to examining bias in LLM-based conversational agents for mental health support.
Collapse
Affiliation(s)
- Yee Hui Yeo
- Karsh Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, Los Angeles, California, USA
| | - Yuxin Peng
- School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, China
| | - Muskaan Mehra
- Division of Health Services Research, Cedars-Sinai Medical Center, Center for Outcomes Research and Education (CS-CORE), Los Angeles, California, USA
| | - Jamil Samaan
- Karsh Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, Los Angeles, California, USA
| | - Joshua Hakimian
- Division of Health Services Research, Cedars-Sinai Medical Center, Center for Outcomes Research and Education (CS-CORE), Los Angeles, California, USA
| | - Allistair Clark
- Division of Health Services Research, Cedars-Sinai Medical Center, Center for Outcomes Research and Education (CS-CORE), Los Angeles, California, USA
| | - Karisma Suchak
- Division of Health Services Research, Cedars-Sinai Medical Center, Center for Outcomes Research and Education (CS-CORE), Los Angeles, California, USA
| | - Zoe Krut
- Division of Health Services Research, Cedars-Sinai Medical Center, Center for Outcomes Research and Education (CS-CORE), Los Angeles, California, USA
| | - Taiga Andersson
- Department of Cardiology, Smidt Heart Institute, Cedars-Sinai Medical Center, Los Angeles, California, USA
| | - Susan Persky
- Social and Behavioral Research Branch, National Human Genome Research Institute, Bethesda, Maryland, USA
| | - Omer Liran
- Division of Health Services Research, Cedars-Sinai Medical Center, Center for Outcomes Research and Education (CS-CORE), Los Angeles, California, USA
| | - Brennan Spiegel
- Division of Health Services Research, Cedars-Sinai Medical Center, Center for Outcomes Research and Education (CS-CORE), Los Angeles, California, USA
| |
Collapse
|
41
|
Samaan JS, Issokson K, Feldman E, Fasulo C, Rajeev N, Ng WH, Hollander B, Yeo YH, Vasiliauskas E. Examining the Accuracy and Reproducibility of Responses to Nutrition Questions Related to Inflammatory Bowel Disease by Generative Pre-trained Transformer-4. CROHN'S & COLITIS 360 2025; 7:otae077. [PMID: 40078587 PMCID: PMC11897593 DOI: 10.1093/crocol/otae077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/05/2024] [Indexed: 03/14/2025] Open
Abstract
Background Generative pre-trained transformer-4 (GPT-4) is a large language model (LLM) trained on a vast corpus of data, including the medical literature. Nutrition plays an important role in managing inflammatory bowel disease (IBD), with an unmet need for nutrition-related patient education resources. This study examines the accuracy, comprehensiveness, and reproducibility of responses by GPT-4 to patient nutrition questions related to IBD. Methods Questions were obtained from adult IBD clinic visits, Facebook, and Reddit. Two IBD-focused registered dieticians independently graded the accuracy and reproducibility of GPT-4's responses while a third senior IBD-focused registered dietitian arbitrated. Each question was inputted twice into the model. Results 88 questions were selected. The model correctly responded to 73/88 questions (83.0%), with 61 (69.0%) graded as comprehensive. 15/88 (17%) responses were graded as mixed with correct and incorrect/outdated data. The model comprehensively responded to 10 (62.5%) questions related to "Nutrition and diet needs for surgery," 12 (92.3%) "Tube feeding and parenteral nutrition," 11 (64.7%) "General diet questions," 10 (50%) "Diet for reducing symptoms/inflammation," and 18 (81.8%) "Micronutrients/supplementation needs." The model provided reproducible responses to 81/88 (92.0%) questions. Conclusions GPT-4 comprehensively answered most questions, demonstrating the promising potential of LLMs as supplementary tools for IBD patients seeking nutrition-related information. However, 17% of responses contained incorrect information, highlighting the need for continuous refinement prior to incorporation into clinical practice. Future studies should emphasize leveraging LLMs to enhance patient outcomes and promoting patient and healthcare professional proficiency in using LLMs to maximize their efficacy.
Collapse
Affiliation(s)
- Jamil S Samaan
- Department of Medicine, Karsh Division of Digestive and Liver Diseases, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Kelly Issokson
- Department of Medicine, Karsh Division of Digestive and Liver Diseases, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Erin Feldman
- Department of Medicine, Karsh Division of Digestive and Liver Diseases, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Christina Fasulo
- Department of Medicine, Karsh Division of Digestive and Liver Diseases, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Nithya Rajeev
- Keck School of Medicine of USC, Los Angeles, CA, USA
| | - Wee Han Ng
- Bristol Medical School, University of Bristol, Bristol, UK
| | - Barbara Hollander
- Department of Medicine, Karsh Division of Digestive and Liver Diseases, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Yee Hui Yeo
- Department of Medicine, Karsh Division of Digestive and Liver Diseases, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Eric Vasiliauskas
- Department of Medicine, Karsh Division of Digestive and Liver Diseases, Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| |
Collapse
|
42
|
Zitu MM, Le TD, Duong T, Haddadan S, Garcia M, Amorrortu R, Zhao Y, Rollison DE, Thieu T. Large language models in cancer: potentials, risks, and safeguards. BJR ARTIFICIAL INTELLIGENCE 2025; 2:ubae019. [PMID: 39777117 PMCID: PMC11703354 DOI: 10.1093/bjrai/ubae019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 10/26/2024] [Accepted: 12/09/2024] [Indexed: 01/11/2025]
Abstract
This review examines the use of large language models (LLMs) in cancer, analysing articles sourced from PubMed, Embase, and Ovid Medline, published between 2017 and 2024. Our search strategy included terms related to LLMs, cancer research, risks, safeguards, and ethical issues, focusing on studies that utilized text-based data. 59 articles were included in the review, categorized into 3 segments: quantitative studies on LLMs, chatbot-focused studies, and qualitative discussions on LLMs on cancer. Quantitative studies highlight LLMs' advanced capabilities in natural language processing (NLP), while chatbot-focused articles demonstrate their potential in clinical support and data management. Qualitative research underscores the broader implications of LLMs, including the risks and ethical considerations. Our findings suggest that LLMs, notably ChatGPT, have potential in data analysis, patient interaction, and personalized treatment in cancer care. However, the review identifies critical risks, including data biases and ethical challenges. We emphasize the need for regulatory oversight, targeted model development, and continuous evaluation. In conclusion, integrating LLMs in cancer research offers promising prospects but necessitates a balanced approach focusing on accuracy, ethical integrity, and data privacy. This review underscores the need for further study, encouraging responsible exploration and application of artificial intelligence in oncology.
Collapse
Affiliation(s)
- Md Muntasir Zitu
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Tuan Dung Le
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Thanh Duong
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Shohreh Haddadan
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Melany Garcia
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Rossybelle Amorrortu
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Yayi Zhao
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Dana E Rollison
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Thanh Thieu
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| |
Collapse
|
43
|
Elhakim T, Brea AR, Fidelis W, Paravastu SS, Malavia M, Omer M, Mort A, Ramasamy SK, Tripathi S, Dezube M, Smolinski-Zhao S, Daye D. Enhanced PROcedural Information READability for Patient-Centered Care in Interventional Radiology With Large Language Models (PRO-READ IR). J Am Coll Radiol 2025; 22:84-97. [PMID: 39216782 DOI: 10.1016/j.jacr.2024.08.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 08/15/2024] [Accepted: 08/19/2024] [Indexed: 09/04/2024]
Abstract
PURPOSE To evaluate the extent to which Generative Pre-trained Transformer 4 (GPT-4) can educate patients by generating easily understandable information about the most common interventional radiology (IR) procedures. MATERIALS AND METHODS We reviewed 10 IR procedures and prepared prompts for GPT-4 to provide patient educational instructions about each procedure in layman's terms. The instructions were then evaluated by four clinical physicians and nine nonclinical assessors to determine their clinical appropriateness, understandability, and clarity using a survey. A grade-level readability assessment was performed using validated metrics to evaluate accessibility to a wide patient population. The same procedures were also evaluated from the patient instructions available at radiologyinfo.org and compared with GPT-generated instructions utilizing a paired t test. RESULTS Evaluation by four clinical physicians shows that nine GPT-generated instructions were fully appropriate, whereas arterial embolization instructions was somewhat appropriate. Evaluation by nine nonclinical assessors shows that paracentesis, dialysis catheter placement, thrombectomy, ultrasound-guided biopsy, and nephrostomy-tube instructions were rated excellent by 57% and good by 43%. The arterial embolization and biliary-drain instructions were rated excellent by 28.6% and good by 71.4%. In contrast, thoracentesis, port placement, and CT-guided biopsy instructions received 43% excellent, 43% good, and 14% fair. The readability assessment across all procedural instructions showed a better Flesch-Kincaid mean grade of GPT-4 instructions compared with radiologyinfo.org (7.8 ± 0.87 versus 9.6 ± 0.83; P = .007) indicating excellent readability at 7th- to 8th-grade level compared with 9th to 10th grade. Additionally there was a lower Gunning Fog mean index (10.4 ± 1.2 versus 12.7 ± 0.93; P = .006), and higher Flesch Reading Ease mean score (69.4 ± 4.8 versus 51.3±3.9; P = .0001) indicating better readability. CONCLUSION IR procedural instructions generated by GPT-4 can aid in improving health literacy and patient-centered care in IR by generating easily understandable explanations.
Collapse
Affiliation(s)
- Tarig Elhakim
- Perelman School of Medicine at the University of Pennsylvania, Philadelphia, Pennsylvania; Massachusetts General Hospital, Boston, Massachusetts.
| | - Allison R Brea
- Tufts University School of Medicine, Boston, Massachusetts
| | - Wilton Fidelis
- Georgetown University School of Medicine, Washington, DC
| | - Sriram S Paravastu
- University of Missouri, Kansas City School of Medicine, Kansas City, Missouri
| | - Mira Malavia
- University of Missouri, Kansas City School of Medicine, Kansas City, Missouri
| | - Mustafa Omer
- The James Cook University Hospital, Middlesbrough, United Kingdom
| | - Ana Mort
- Saint Louis University School of Medicine, St Louis, Missouri
| | | | - Satvik Tripathi
- Perelman School of Medicine at the University of Pennsylvania, Philadelphia, Pennsylvania
| | | | - Sara Smolinski-Zhao
- Associate Program Director of the Interventional Radiology Residency, Massachusetts General Hospital, Boston, Massachusetts; Harvard Medical School, Boston, Massachusetts
| | - Dania Daye
- Harvard Medical School, Boston, Massachusetts; Massachusetts General Hospital, Boston, Massachusetts; IR Division Quality Director and Co-Director of IR Research and also Director of Precision Interventional and Medical Imaging Lab at the Division of Vascular and Interventional Radiology, Massachusetts General Hospital
| |
Collapse
|
44
|
Kinikoglu O, Isik D. Evaluating the Performance of ChatGPT-4o Oncology Expert in Comparison to Standard Medical Oncology Knowledge: A Focus on Treatment-Related Clinical Questions. Cureus 2025; 17:e78076. [PMID: 39872919 PMCID: PMC11771770 DOI: 10.7759/cureus.78076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/27/2025] [Indexed: 01/30/2025] Open
Abstract
Integrating artificial intelligence (AI) into oncology can revolutionize decision-making by providing accurate information. This study evaluates the performance of ChatGPT-4o (OpenAI, San Francisco, CA) Oncology Expert, in addressing open-ended clinical oncology questions. Thirty-seven treatment-related questions on solid organ tumors were selected from a hematology-oncology textbook. Responses from ChatGPT-4o Oncology Expert and the textbook were anonymized and independently evaluated by two medical oncologists using a structured scoring system focused on accuracy and clinical justification. Statistical analysis, including paired t-tests, was conducted to compare scores, and interrater reliability was assessed using Cohen's Kappa. Oncology Expert achieved a significantly higher average score of 7.83 compared to the textbook's 7.0 (p < 0.01). In 10 cases, Oncology Expert provided more accurate and updated answers, demonstrating its ability to integrate recent medical knowledge. In 26 cases, both sources provided equally relevant answers, but the Oncology Expert's responses were clearer and easier to understand. Cohen's Kappa indicated almost perfect agreement (κ = 0.93). Both sources included outdated information for bladder cancer treatment, underscoring the need for regular updates. ChatGPT-4o Oncology Expert shows significant potential as a clinical tool in oncology by offering precise, up-to-date, and user-friendly responses. It could transform oncology practice by enhancing decision-making efficiency, improving educational tools, and serving as a reliable adjunct to clinical workflows. However, its integration requires regular updates, expert validation, and a collaborative approach to ensure reliability and relevance in the rapidly evolving field of oncology.
Collapse
Affiliation(s)
- Oguzcan Kinikoglu
- Medical Oncology, Kartal Dr. Lütfi Kirdar City Hospital, Health Science University, Istanbul, TUR
| | - Deniz Isik
- Medical Oncology, Kartal Dr. Lütfi Kirdar City Hospital, Health Science University, Istanbul, TUR
| |
Collapse
|
45
|
Xia S, Hua Q, Mei Z, Xu W, Lai L, Wei M, Qin Y, Luo L, Wang C, Huo S, Fu L, Zhou F, Wu J, Zhang L, Lv D, Li J, Wang X, Li N, Song Y, Zhou J. Clinical application potential of large language model: a study based on thyroid nodules. Endocrine 2025; 87:206-213. [PMID: 39080210 DOI: 10.1007/s12020-024-03981-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Accepted: 07/23/2024] [Indexed: 01/19/2025]
Abstract
BACKGROUND Limited data indicated the performance of large language model (LLM) taking on the role of doctors. We aimed to investigate the potential for ChatGPT-3.5 and New Bing Chat acting as doctors using thyroid nodules as an example. METHODS A total of 145 patients with thyroid nodules were included for generating questions. Each question was entered into chatbot of ChatGPT-3.5 and New Bing Chat five times and five responses were acquired respectively. These responses were compared with answers given by five junior doctors. Responses from five senior doctors were regarded as gold standard. Accuracy and reproducibility of responses from ChatGPT-3.5 and New Bing Chat were evaluated. RESULTS The accuracy of ChatGPT-3.5 and New Bing Chat in answering Q2, Q3, Q5 were lower than that of junior doctors (all P < 0.05), while both LLMs were comparable to junior doctors when answering Q4 and Q6. In terms of "high reproducibility and accuracy", ChatGPT-3.5 outperformed New Bing Chat in Q1 and Q5 (P < 0.001 and P = 0.008, respectively), but showed no significant difference in Q2, Q3, Q4, and Q6 (P > 0.05 for all). New Bing Chat generated higher accuracy than ChatGPT-3.5 (72.41% vs 58.62%) (P = 0.003) in decision making of thyroid nodules, and both were less accurate than junior doctors (89.66%, P < 0.001 for both). CONCLUSIONS The exploration of ChatGPT-3.5 and New Bing Chat in the diagnosis and management of thyroid nodules illustrates that LLMs currently demonstrate the potential for medical applications, but do not yet reach the clinical decision-making capacity of doctors.
Collapse
Affiliation(s)
- Shujun Xia
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- College of Health Science and Technology, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Qing Hua
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Zihan Mei
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Wenwen Xu
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Limei Lai
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Minyan Wei
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Yu Qin
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Lin Luo
- Department of Endocrinology, Kongjiang Hospital, Yangpu District, Shanghai, China
| | - Changhua Wang
- Department of Thyroid and Breast Surgery, Xianning NO.1 People's Hospital, Xianning, China
| | - ShengNan Huo
- Department of Thyroid, Handan Hangang Hospital, Hanshan District, Handan City, Hebei, China
| | - Lijun Fu
- Department of Thyroid Surgery, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Feidu Zhou
- Thyroid and Breast Surgery, LiuYang People's Hospital, Changsha, China
| | - Jiang Wu
- Department of Thyroid, Breast and Vascular Surgery, Xijing Hospital, The Fourth Military Medical University, Xi'an, Shanxi, China
| | - Li Zhang
- Department of Head and Neck Surgery, Shanxi Province Cancer Hospital, Taiyuan, China
| | - De Lv
- Department of Endocrinology, Hospital of Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Jianxin Li
- Department of Surgery, Mazhanghuiwen Hospital, Ma Zhang District, Zhanjiang, Guangdong, China
| | - Xin Wang
- Endocrine Department, Lianshui People's Hospital, Huaian, Jiangsu, China
| | - Ning Li
- Department of Ultrasound, Anning First People's Hospital, Affliated to Kunming University of Science and Technology, Anning, Yunnan Province, China
| | - Yanyan Song
- Department of Biostatistics, Institute of Medical Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Jianqiao Zhou
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
- College of Health Science and Technology, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
| |
Collapse
|
46
|
Pandya S, Bresler TE, Wilson T, Htway Z, Fujita M. Decoding the NCCN Guidelines With AI: A Comparative Evaluation of ChatGPT-4.0 and Llama 2 in the Management of Thyroid Carcinoma. Am Surg 2025; 91:94-98. [PMID: 39136578 DOI: 10.1177/00031348241269430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2024]
Abstract
INTRODUCTION Artificial Intelligence (AI) has emerged as a promising tool in the delivery of health care. ChatGPT-4.0 (OpenAI, San Francisco, California) and Llama 2 (Meta, Menlo Park, CA) have each gained attention for their use in various medical applications. OBJECTIVE This study aims to evaluate and compare the effectiveness of ChatGPT-4.0 and Llama 2 in assisting with complex clinical decision making in the diagnosis and treatment of thyroid carcinoma. PARTICIPANTS We reviewed the National Comprehensive Cancer Network® (NCCN) Clinical Practice Guidelines for the management of thyroid carcinoma and formulated up to 3 complex clinical questions for each decision-making page. ChatGPT-4.0 and Llama 2 were queried in a reproducible manner. The answers were scored on a Likert scale: 5) Correct; 4) correct, with missing information requiring clarification; 3) correct, but unable to complete answer; 2) partially incorrect; 1) absolutely incorrect. Score frequencies were compared, and subgroup analysis was conducted on Correctness (defined as scores 1-2 vs 3-5) and Accuracy (scores 1-3 vs 4-5). RESULTS In total, 58 pages of the NCCN Guidelines® were analyzed, generating 167 unique questions. There was no statistically significant difference between ChatGPT-4.0 and Llama 2 in terms of overall score (Mann-Whitney U-test; Mean Rank = 160.53 vs 174.47, P = 0.123), Correctness (P = 0.177), or Accuracy (P = 0.891).[Formula: see text]. CONCLUSION ChatGPT-4.0 and Llama 2 demonstrate a limited but substantial capacity to assist with complex clinical decision making relating to the management of thyroid carcinoma, with no significant difference in their effectiveness.
Collapse
Affiliation(s)
- Shivam Pandya
- Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, CA, USA
| | - Tamir E Bresler
- Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, CA, USA
| | - Tyler Wilson
- Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, CA, USA
| | - Zin Htway
- Department of Laboratory, Los Robles Regional Medical Center, Thousand Oaks, CA, USA
| | - Manabu Fujita
- Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, CA, USA
- General Surgical Associates, Thousand Oaks, CA, USA
| |
Collapse
|
47
|
Yu Y, Zhao Q, Zhang Y, Lin J, Wang H. Assessing the Performance of ChatGPT's Responses to Questions Related to Atopic Dermatitis. Dermatitis 2025; 36:e105-e107. [PMID: 38783508 PMCID: PMC11840056 DOI: 10.1089/derm.2024.0098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2024]
Affiliation(s)
- Yan Yu
- Department of Dermatovenereology, Tianjin Medical University General Hospital/Tianjin Institute of Sexually Transmitted Disease, Tianjin, China
| | - Qian Zhao
- Department of Dermatovenereology, Tianjin Medical University General Hospital/Tianjin Institute of Sexually Transmitted Disease, Tianjin, China
| | - Yiming Zhang
- Department of Dermatovenereology, Tianjin Medical University General Hospital/Tianjin Institute of Sexually Transmitted Disease, Tianjin, China
| | - JinRu Lin
- Department of Dermatovenereology, Tianjin Medical University General Hospital/Tianjin Institute of Sexually Transmitted Disease, Tianjin, China
| | - Huiping Wang
- Department of Dermatovenereology, Tianjin Medical University General Hospital/Tianjin Institute of Sexually Transmitted Disease, Tianjin, China
| |
Collapse
|
48
|
Anees M, Shaikh FA, Shaikh H, Siddiqui NA, Rehman ZU. Assessing the quality of ChatGPT's responses to questions related to radiofrequency ablation for varicose veins. J Vasc Surg Venous Lymphat Disord 2025; 13:101985. [PMID: 39332626 PMCID: PMC11764857 DOI: 10.1016/j.jvsv.2024.101985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Revised: 09/16/2024] [Accepted: 09/17/2024] [Indexed: 09/29/2024]
Abstract
OBJECTIVE This study aimed to evaluate the accuracy and reproducibility of information provided by ChatGPT, in response to frequently asked questions about radiofrequency ablation (RFA) for varicose veins. METHODS This cross-sectional study was conducted at The Aga Khan University Hospital, Karachi, Pakistan. A set of 18 frequently asked questions regarding RFA for varicose veins were compiled from credible online sources and presented to ChatGPT twice, separately, using the new chat option. Twelve experienced vascular surgeons (with >2 years of experience and ≥20 RFA procedures performed annually) independently evaluated the accuracy of the responses using a 4-point Likert scale and assessed their reproducibility. RESULTS Most evaluators were males (n = 10/12 [83.3%]) with an average of 12.3 ± 6.2 years of experience as a vascular surgeon. Six evaluators (50%) were from the UK followed by three from Saudi Arabia (25.0%), two from Pakistan (16.7%), and one from the United States (8.3%). Among the 216 accuracy grades, most of the evaluators graded the responses as comprehensive (n = 87/216 [40.3%]) or accurate but insufficient (n = 70/216 [32.4%]), whereas only 17.1% (n = 37/216) were graded as a mixture of both accurate and inaccurate information and 10.8% (n = 22/216) as entirely inaccurate. Overall, 89.8% of the responses (n = 194/216) were deemed reproducible. Of the total responses, 70.4% (n = 152/216) were classified as good quality and reproducible. The remaining responses were poor quality with 19.4% reproducible (n = 42/216) and 10.2% nonreproducible (n = 22/216). There was nonsignificant inter-rater disagreement among the vascular surgeons for overall responses (Fleiss' kappa, -0.028; P = .131). CONCLUSIONS ChatGPT provided generally accurate and reproducible information on RFA for varicose veins; however, variability in response quality and limited inter-rater reliability highlight the need for further improvements. Although it has the potential to enhance patient education and support healthcare decision-making, improvements in its training, validation, transparency, and mechanisms to address inaccurate or incomplete information are essential.
Collapse
Affiliation(s)
- Muhammad Anees
- Section of Vascular Surgery, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan
| | - Fareed Ahmed Shaikh
- Section of Vascular Surgery, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan.
| | | | - Nadeem Ahmed Siddiqui
- Section of Vascular Surgery, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan
| | - Zia Ur Rehman
- Section of Vascular Surgery, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan
| |
Collapse
|
49
|
Ventresca HC, Davis HT, Gauthier CW, Kung J, Park JS, Strasser NL, Gonzalez TA, Jackson JB. ChatGPT-4 Effectively Responds to Common Patient Questions on Total Ankle Arthroplasty: A Surgeon-Based Assessment of AI in Patient Education. FOOT & ANKLE ORTHOPAEDICS 2025; 10:24730114251322784. [PMID: 40160854 PMCID: PMC11951880 DOI: 10.1177/24730114251322784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/02/2025] Open
Abstract
Background Patient reliance on internet resources for clinical information has steadily increased. The recent widespread accessibility of artificial intelligence (AI) tools like ChatGPT has increased patient reliance on these resources while also raising concerns about the accuracy, reliability, and appropriateness of the information they provide. Previous studies have evaluated ChatGPT and found it could accurately respond to questions on common surgeries, such as total hip arthroplasty, but is untested for uncommon procedures like total ankle arthroplasty (TAA). This study evaluates ChatGPT-4's performance in answering patient questions on TAA and further explores the opportunity for physician involvement in guiding the implementation of this technology. Methods Twelve commonly asked patient questions regarding TAA were collated from established sources and posed to ChatGPT-4 without additional input. Four fellowship-trained surgeons independently rated the responses using a 1-4 scale, assessing accuracy and need for clarification. Interrater reliability, divergence, and trends in response content were analyzed to evaluate consistency across responses. Results The mean score across all responses was 1.8, indicating an overall satisfactory performance by ChatGPT-4. Ratings were consistently good on factual questions, such as infection risk and success rates, whereas questions requiring nuanced information, such as postoperative protocols and prognosis, received poorer ratings. Significant variability was observed among surgeons' ratings and between questions, reflecting differences in interpretation and expectations. Conclusion ChatGPT-4 demonstrates its potential to reliably provide discrete information for uncommon procedures such as TAA, but it lacks the capability to effectively respond to questions requiring patient- or surgeon-specific insight. This limitation, paired with the growing reliance on AI, highlights the need for AI tools tailored to specific clinical practices to enhance accuracy and relevance in patient education.
Collapse
Affiliation(s)
- Heidi C. Ventresca
- Department of Orthopaedic Surgery, School of Medicine, University of South Carolina, Columbia, SC, USA
| | - Harley T. Davis
- Department of Orthopaedic Surgery, Prisma Health–Midlands, Columbia, SC, USA
| | - Chase W. Gauthier
- Department of Orthopaedic Surgery, School of Medicine, University of South Carolina, Columbia, SC, USA
| | - Justin Kung
- Department of Orthopaedic Surgery, School of Medicine, University of South Carolina, Columbia, SC, USA
| | - Joseph S. Park
- Department of Orthopaedic Surgery, School of Medicine, University of Virginia, Charlottesville, VA, USA
| | - Nicholas L. Strasser
- Department of Orthopaedic Surgery, School of Medicine, Vanderbilt University, Nashville, TN, USA
| | - Tyler A. Gonzalez
- Department of Orthopaedic Surgery, Prisma Health–Midlands, Columbia, SC, USA
| | - J. Benjamin Jackson
- Department of Orthopaedic Surgery, School of Medicine, University of South Carolina, Columbia, SC, USA
| |
Collapse
|
50
|
Li J, Gao X, Dou T, Gao Y, Li X, Zhu W. Quantitative evaluation of GPT-4's performance on US and Chinese osteoarthritis treatment guideline interpretation and orthopaedic case consultation. BMJ Open 2024; 14:e082344. [PMID: 39806703 PMCID: PMC11749315 DOI: 10.1136/bmjopen-2023-082344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 11/28/2024] [Indexed: 01/16/2025] Open
Abstract
OBJECTIVES To evaluate GPT-4's performance in interpreting osteoarthritis (OA) treatment guidelines from the USA and China, and to assess its ability to diagnose and manage orthopaedic cases. SETTING The study was conducted using publicly available OA treatment guidelines and simulated orthopaedic case scenarios. PARTICIPANTS No human participants were involved. The evaluation focused on GPT-4's responses to clinical guidelines and case questions, assessed by two orthopaedic specialists. OUTCOMES Primary outcomes included the accuracy and completeness of GPT-4's responses to guideline-based queries and case scenarios. Metrics included the correct match rate, completeness score and stratification of case responses into predefined tiers of correctness. RESULTS In interpreting the American Academy of Orthopaedic Surgeons and Chinese OA guidelines, GPT-4 achieved a correct match rate of 46.4% and complete agreement with all score-2 recommendations. The accuracy score for guideline interpretation was 4.3±1.6 (95% CI 3.9 to 4.7), and the completeness score was 2.8±0.6 (95% CI 2.5 to 3.1). For case-based questions, GPT-4 demonstrated high performance, with over 88% of responses rated as comprehensive. CONCLUSIONS GPT-4 demonstrates promising capabilities as an auxiliary tool in orthopaedic clinical practice and patient education, with high levels of accuracy and completeness in guideline interpretation and clinical case analysis. However, further validation is necessary to establish its utility in real-world clinical settings.
Collapse
Affiliation(s)
- Juntan Li
- Jinzhou Medical University, Jinzhou, Liaoning, China
- The First Affiliated Hospital of China Medical University, Shenyang, Liaoning, China
| | - Xiang Gao
- Department of Orthopedics, Fourth Affiliated Hospital of China Medical University, Shenyang, Liaoning, China
| | - Tianxu Dou
- Department of Orthopedics, The First Hospital of China Medical University, Shenyang, China
| | - Yuyang Gao
- Department of Orthopedics, The First Hospital of China Medical University, Shenyang, China
| | - Xu Li
- Department of Orthopedics, Fourth Affiliated Hospital of China Medical University, Shenyang, Liaoning, China
| | - Wannan Zhu
- Jinzhou Medical University, Jinzhou, Liaoning, China
| |
Collapse
|