BPG is committed to discovery and dissemination of knowledge
Editorial Open Access
Copyright ©The Author(s) 2025. Published by Baishideng Publishing Group Inc. All rights reserved.
World J Gastrointest Oncol. Dec 15, 2025; 17(12): 114341
Published online Dec 15, 2025. doi: 10.4251/wjgo.v17.i12.114341
Cost vs clinical utility on application of large language models in clinical practice: A double-edged sword
Sunny Chi Lik Au, School of Clinical Medicine, The University of Hong Kong, Hong Kong 999077, China
ORCID number: Sunny Chi Lik Au (0000-0002-5849-3317).
Author contributions: Au SCL drafted the manuscript, acquired and analyzed the data, and revised the manuscript.
Conflict-of-interest statement: The author reports no relevant conflicts of interest for this article.
Open Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: https://creativecommons.org/Licenses/by-nc/4.0/
Corresponding author: Sunny Chi Lik Au, Chief Physician, Clinical Assistant Professor (Honorary), Research Fellow, School of Clinical Medicine, The University of Hong Kong, 9/F, MO Office, Lo Ka Chow Memorial Ophthalmic Centre, No. 19 Eastern Hospital Road, Causeway Bay, Hong Kong 999077, China. kilihcua@gmail.com
Received: September 16, 2025
Revised: September 26, 2025
Accepted: October 27, 2025
Published online: December 15, 2025
Processing time: 86 Days and 0.2 Hours

Abstract

As large language models increasingly permeate medical workflows, a recent study evaluating ChatGPT 4.0’s performance in addressing patient queries about endoscopic submucosal dissection and endoscopic mucosal resection offers critical insights into three domains: Performance parity, cost democratization, and clinical readiness. The findings highlight ChatGPT’s high accuracy, completeness, and comprehensibility, suggesting potential cost efficiency in patient education. Yet, cost-effectiveness alone does not ensure clinical utility. Notably, the study relied exclusively on text-based prompts, omitting multimodal data such as photographs or endoscopic scans. This is a significant limitation in a visually driven field like endoscopy, where large language model performance may drop precipitously without image context. Without multimodal integration, artificial intelligence tools risk failing to capture key diagnostic signals, underscoring the need for cautious adoption and robust validation in clinical practice.

Key Words: Large language models; Artificial intelligence; ChatGPT; Cost; Patient education; Endoscopic submucosal dissection; Endoscopic mucosal resection

Core Tip: As large language models increasingly permeate medical workflows, this study offers insight into 3 areas: Performance parity, cost democratization, and clinical readiness. Perhaps one potentially compelling finding was cost efficiency. Yet cost-effectiveness alone does not ensure clinical utility. Notably, the study relied exclusively on text-based prompts, omitting multimodal data such as photographs or scans. This is an important limitation in a domain like endoscopy, which often can be driven visually. Large language model performance can drop precipitously when deprived of image context. Without multimodal integration, artificial intelligence tools may inevitably fail to capture key diagnostic signals.



INTRODUCTION

The integration of large language models (LLMs) like ChatGPT into healthcare represents a transformative opportunity to enhance patient education and streamline clinical workflows. A recent study by Wang et al[1] evaluated ChatGPT 4.0’s ability to answer 30 patient-centered questions about endoscopic submucosal dissection (ESD) and endoscopic mucosal resection (EMR), procedures critical for managing colorectal polyps and early gastric cancer. The study’s findings - high accuracy, substantial completeness, and strong comprehensibility - position LLMs as potential tools for cost-effective patient support. However, the promise of cost democratization is tempered by limitations in clinical utility, particularly the absence of multimodal data integration. This editorial explores the dual nature of LLMs as a “double-edged sword” in endoscopy, balancing cost efficiency against clinical readiness, and calls for a measured approach to their deployment.

PERFORMANCE PARITY: A STEP TOWARD RELIABILITY

The study demonstrates that ChatGPT 4.0 outperforms traditional search engines like Google, achieving a mean accuracy score of 5.14/6 and high reproducibility across repeated queries. Questions on preoperative preparation, postoperative pathology, and family support scored particularly well, suggesting LLMs can deliver reliable, patient-friendly information for routine queries. This performance parity with expert knowledge, as validated by gastroenterologists with over 20 years of experience, indicates that LLMs could reduce the burden on healthcare providers by addressing common patient concerns. However, lower scores on complex topics, such as surgical indications or insurance policies, reveal gaps in the model’s ability to handle nuanced or region-specific medical contexts. These inconsistencies highlight that while LLMs may approach expert-level performance in some areas, they are not yet a substitute for clinical judgment.

COST DEMOCRATIZATION: OPPORTUNITIES AND HIDDEN COSTS

One of the most compelling aspects of LLMs is their potential to democratize access to medical information[2]. By providing instant, comprehensible responses, tools like ChatGPT could empower patients[3,4], particularly in resource-constrained settings, to better understand procedures like ESD/EMR, potentially improving treatment adherence and informed consent. The study notes that 70%-80% of internet users seek health information online, and LLMs could offer a cost-effective alternative to traditional consultations, reducing strain on healthcare systems. For providers, refining artificial intelligence (AI)-generated responses could enhance efficiency, lowering operational costs.

Yet, this cost efficiency comes with hidden expenses. Developing and maintaining LLMs requires significant computational resources, and healthcare integration demands investments in privacy-compliant systems to prevent data leakage. Moreover, inaccuracies or outdated information - such as those noted in the study for complex ESD/EMR queries - could lead to misinformed decisions, increasing downstream costs through complications or additional interventions. The study’s emphasis on human-AI collaboration underscores the need to balance initial cost savings with investments in validation and oversight to ensure long-term utility.

CLINICAL READINESS: THE MULTIMODAL CHALLENGE

The study’s reliance on text-based prompts reveals a critical barrier to clinical readiness: The lack of multimodal integration. Endoscopy is inherently visual, relying on images and scans for diagnosis and procedural planning. LLMs like ChatGPT, which process only textual input, cannot interpret endoscopic images, potentially missing critical diagnostic signals. For instance, identifying subtle mucosal abnormalities or assessing resection margins requires visual context that text alone cannot provide. This limitation risks reducing the clinical utility of LLMs, particularly in scenarios where visual data is paramount. While the study found high comprehensibility among patients, one patient rated responses to questions on surgical indications and recurrence as “low-level”, possibly reflecting the absence of visual aids to clarify complex concepts.

Recent advancements in multimodal models, such as GPT-4V and Gemini, have shown preliminary promise in gastrointestinal endoscopy by enabling the interpretation of endoscopic images for diagnostic support (Table 1). For instance, these models have been explored for analyzing images from procedures like colonoscopy and esophagogastroduodenoscopy, aiding in the identification of lesions, classification of abnormalities, and even educational simulations for medical training[5,6]. Despite these applications, persistent limitations hinder their clinical utility, including suboptimal diagnostic accuracy when compared to human endoscopists, a propensity for generating confident yet incorrect interpretations, and challenges in handling complex or subtle visual cues such as mucosal irregularities, which can lead to misdiagnoses in real-world scenarios[7,8].

Table 1 Strengths and weaknesses of text-based vs multimodal large language models.

Text-based LLMs
Multimodal LLMs
StrengthsProvide real-time textual guidance, differential diagnoses, and automated report generation from clinical notes and patient historyIntegrate images, videos, and text for comprehensive analysis, improving lesion detection, classification, and spatial localization in procedures like gastroscopy and colonoscopy
Support patient education and reduce health education load on physiciansEnhance diagnostic accuracy and real-time decision support through multi-scale feature fusion and domain-adaptive learning
Effective for processing textual data like electronic health records and guidelines, aiding in treatment suggestionsSupport fine-grained visual understanding and task-specific improvements via fine-tuning, outperforming text-only models in visual tasks
WeaknessesCannot interpret endoscopic images or videos, missing critical visual diagnostic cues such as mucosal abnormalitiesPerformance gaps compared to human experts, with lower sensitivity to increased task complexity
Limited real-time responsiveness and adaptability to new techniques due to reliance on pre-existing textual dataHigh computational demands, data fusion challenges, and scalability issues for real-time processing of high-resolution endoscopic data
Struggle with complex scenarios requiring visual context, leading to potential incomplete assessmentsLimited generalization across institutions and need for large, diverse datasets, plus interpretability concerns

Furthermore, the study highlights risk of “hallucinations” - AI-generated misinformation - and performance drift due to model updates[9]. These issues, combined with ethical concerns like bias and legal ambiguities[10], necessitate robust validation frameworks. For example, AI systems have been documented to misinterpret endoscopic images, such as hallucinating malignant features in benign polyps during colonoscopy, leading to unwarranted invasive procedures, patient distress, and increased healthcare costs, or failing to detect subtle esophageal lesions[11,12], thereby elevating misdiagnosis rates by not improving detection accuracy as evidenced in meta-analyses[13]. Ethically, this undermines patient autonomy and trust, while legally, it heightens liability; a hypothetical yet plausible scenario involves an AI erroneously classifying a capsule endoscopy as normal without endoscopist review, resulting in delayed cancer treatment, patient harm, and potential malpractice suits against physicians or AI developers for negligence[14-16]. Future advancements in multimodal LLMs, capable of processing images alongside text, could address these gaps[17], but such models are not yet standard in clinical settings.

A BALANCED PATH FORWARD

The promise of LLMs in endoscopy lies in their ability to enhance patient education and provider efficiency, but their clinical utility hinges on overcoming current limitations. To wield this double-edged sword effectively, we propose three priorities.

Multimodal integration

Develop LLMs that incorporate endoscopic images and scans to improve diagnostic accuracy and patient comprehension[18]. It could start with piloting hybrid systems like the multimodal AI models tested in randomized crossover studies for diagnosing pancreatic lesions via endoscopic ultrasound, where visual data from scans is fused with textual reports to boost accuracy beyond traditional methods.

Validation frameworks

Establish continuous monitoring and multicenter validation studies to track performance drift and ensure generalizability across diverse populations. These pathways involve establishing multicenter validation protocols, as demonstrated in studies assessing AI for cholangioscopy in detecting biliary malignancies, incorporating continuous performance monitoring through real-time data aggregation from diverse patient cohorts to mitigate drift and ensure generalizability[19].

Human-AI collaboration

Promote hybrid models where AI supports, but does not replace, clinical expertise, ensuring empathy and personalized care remain central. Pilot examples such as the Food and Drug Administration-approved GI Genius system for colonoscopy integrate AI to highlight abnormalities in real-time while relying on endoscopists for final decisions, fostering hybrid workflows that preserve clinical empathy and personalization in procedures like polyp detection[20].

CONCLUSION

While LLMs like ChatGPT offer cost-effective solutions for patient education in endoscopy, their clinical readiness is limited by text-only processing and potential inaccuracies. By addressing these challenges through multimodal integration and rigorous oversight, we can sharpen the blade of clinical utility while mitigating the risks of cost-driven compromises.

Footnotes

Provenance and peer review: Invited article; Externally peer reviewed.

Peer-review model: Single blind

Specialty type: Oncology

Country of origin: China

Peer-review report’s classification

Scientific Quality: Grade B

Novelty: Grade B

Creativity or Innovation: Grade B

Scientific Significance: Grade C

P-Reviewer: Peng WL, MD, Lecturer, Researcher, China S-Editor: Wang JJ L-Editor: A P-Editor: Zhao S

References
1.  Wang SS, Gao H, Lin PY, Qian TC, Du Y, Xu L. Evaluating chat generative pretrained transformer in answering questions on endoscopic mucosal resection and endoscopic submucosal dissection. World J Gastrointest Oncol. 2025;17:109792.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 1]  [Reference Citation Analysis (0)]
2.  Feng Y, Chan TH, Yin G, Yu L. Democratizing large language model-based graph data augmentation via latent knowledge graphs. Neural Netw. 2025;191:107777.  [PubMed]  [DOI]  [Full Text]
3.  Collin H, Keogh K, Basto M, Loeb S, Roberts MJ. ChatGPT can help guide and empower patients after prostate cancer diagnosis. Prostate Cancer Prostatic Dis. 2025;28:513-515.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 3]  [Cited by in RCA: 4]  [Article Influence: 4.0]  [Reference Citation Analysis (0)]
4.  Khetan SP, Suvarna SS. Empowering patients through AI: the role of ChatGPT in daily monitoring of blood pressure and blood glucose levels. Glob Cardiol Sci Pract. 2024;2024:e202456.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 1]  [Reference Citation Analysis (0)]
5.  Wu Y, Ramai D, Smith ER, Mega PF, Qatomah A, Spadaccini M, Maida M, Papaefthymiou A. Applications of Artificial Intelligence in Gastrointestinal Endoscopic Ultrasound: Current Developments, Limitations and Future Directions. Cancers (Basel). 2024;16:4196.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 2]  [Cited by in RCA: 5]  [Article Influence: 5.0]  [Reference Citation Analysis (0)]
6.  Nie Z, Xu M, Wang Z, Lu X, Song W. A Review of Application of Deep Learning in Endoscopic Image Processing. J Imaging. 2024;10:275.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 3]  [Reference Citation Analysis (0)]
7.  Habe TT, Haataja K, Toivanen P. Review of Deep Learning Performance in Wireless Capsule Endoscopy Images for GI Disease Classification. F1000Res. 2024;13:201.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 1]  [Reference Citation Analysis (0)]
8.  Saraiva MM, Ribeiro T, Agudo B, Afonso J, Mendes F, Martins M, Cardoso P, Mota J, Almeida MJ, Costa A, Gonzalez Haba Ruiz M, Widmer J, Moura E, Javed A, Manzione T, Nadal S, Barroso LF, de Parades V, Ferreira J, Macedo G. Evaluating ChatGPT-4 for the Interpretation of Images from Several Diagnostic Techniques in Gastroenterology. J Clin Med. 2025;14:572.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 2]  [Reference Citation Analysis (0)]
9.  Hwang Y, Jeong SH. Generative Artificial Intelligence and Misinformation Acceptance: An Experimental Test of the Effect of Forewarning About Artificial Intelligence Hallucination. Cyberpsychol Behav Soc Netw. 2025;28:284-289.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 5]  [Reference Citation Analysis (0)]
10.  Bottomley D, Thaldar D. Liability for harm caused by AI in healthcare: an overview of the core legal concepts. Front Pharmacol. 2023;14:1297353.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 17]  [Reference Citation Analysis (0)]
11.  Zha B, Cai A, Wang G. Diagnostic Accuracy of Artificial Intelligence in Endoscopy: Umbrella Review. JMIR Med Inform. 2024;12:e56361.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 1]  [Cited by in RCA: 4]  [Article Influence: 4.0]  [Reference Citation Analysis (0)]
12.  Buendgens L, Cifci D, Ghaffari Laleh N, van Treeck M, Koenen MT, Zimmermann HW, Herbold T, Lux TJ, Hann A, Trautwein C, Kather JN. Weakly supervised end-to-end artificial intelligence in gastrointestinal endoscopy. Sci Rep. 2022;12:4829.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 3]  [Cited by in RCA: 7]  [Article Influence: 2.3]  [Reference Citation Analysis (0)]
13.  Sibomana O, Saka SA, Grace Uwizeyimana M, Mwangi Kihunyu A, Obianke A, Oluwo Damilare S, Bueh LT, Agbelemoge BOG, Omoefe Oveh R. Artificial Intelligence-Assisted Endoscopy in Diagnosis of Gastrointestinal Tumors: A Review of Systematic Reviews and Meta-Analyses. Gastro Hep Adv. 2025;4:100754.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 2]  [Reference Citation Analysis (0)]
14.  Jovanovic I. AI in endoscopy and medicolegal issues: the computer is guilty in case of missed cancer? Endosc Int Open. 2020;8:E1385-E1386.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 3]  [Cited by in RCA: 3]  [Article Influence: 0.6]  [Reference Citation Analysis (0)]
15.  Elamin S, Duffourc M, Berzin TM, Geissler ME, Gerke S. Artificial Intelligence and Medical Liability in Gastrointestinal Endoscopy. Clin Gastroenterol Hepatol. 2024;22:1165-1169.e1.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 6]  [Reference Citation Analysis (0)]
16.  Ramoni D, Scuricini A, Carbone F, Liberale L, Montecucco F. Artificial intelligence in gastroenterology: Ethical and diagnostic challenges in clinical practice. World J Gastroenterol. 2025;31:102725.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 5]  [Reference Citation Analysis (1)]
17.  Maruyama H, Toyama Y, Takanami K, Takase K, Kamei T. Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study. JMIR Med Educ. 2025;11:e69313.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 1]  [Reference Citation Analysis (0)]
18.  Qin Y, Chang J, Li L, Wu M. Enhancing gastroenterology with multimodal learning: the role of large language model chatbots in digestive endoscopy. Front Med (Lausanne). 2025;12:1583514.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 5]  [Cited by in RCA: 3]  [Article Influence: 3.0]  [Reference Citation Analysis (0)]
19.  Marya NB, Powers PD, AbiMansour JP, Marcello M, Thiruvengadam NR, Nasser-Ghodsi N, Rau P, Zivny J, Mehta S, Marshall C, Leonor P, Che K, Abu Dayyeh B, Storm AC, Petersen BT, Law R, Martin JA, Vargas EJ, Chandrasekhara V. Multicenter validation of a cholangioscopy artificial intelligence system for the evaluation of biliary tract disease. Endoscopy. 2025;.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 1]  [Reference Citation Analysis (0)]
20.  Reverberi C, Rigon T, Solari A, Hassan C, Cherubini P; GI Genius CADx Study Group, Cherubini A. Experimental evidence of effective human-AI collaboration in medical decision-making. Sci Rep. 2022;12:14952.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 55]  [Cited by in RCA: 73]  [Article Influence: 24.3]  [Reference Citation Analysis (0)]