Au SCL. Cost vs clinical utility on application of large language models in clinical practice: A double-edged sword. World J Gastrointest Oncol 2025; 17(12): 114341 [DOI: 10.4251/wjgo.v17.i12.114341]
Corresponding Author of This Article
Sunny Chi Lik Au, Chief Physician, Clinical Assistant Professor (Honorary), Research Fellow, School of Clinical Medicine, The University of Hong Kong, 9/F, MO Office, Lo Ka Chow Memorial Ophthalmic Centre, No. 19 Eastern Hospital Road, Causeway Bay, Hong Kong 999077, China. kilihcua@gmail.com
Research Domain of This Article
Medicine, General & Internal
Article-Type of This Article
Editorial
Open-Access Policy of This Article
This article is an open-access article which was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Dec 15, 2025 (publication date) through Dec 14, 2025
Times Cited of This Article
Times Cited (0)
Journal Information of This Article
Publication Name
World Journal of Gastrointestinal Oncology
ISSN
1948-5204
Publisher of This Article
Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA
Share the Article
Au SCL. Cost vs clinical utility on application of large language models in clinical practice: A double-edged sword. World J Gastrointest Oncol 2025; 17(12): 114341 [DOI: 10.4251/wjgo.v17.i12.114341]
Author contributions: Au SCL drafted the manuscript, acquired and analyzed the data, and revised the manuscript.
Conflict-of-interest statement: The author reports no relevant conflicts of interest for this article.
Open Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: https://creativecommons.org/Licenses/by-nc/4.0/
Corresponding author: Sunny Chi Lik Au, Chief Physician, Clinical Assistant Professor (Honorary), Research Fellow, School of Clinical Medicine, The University of Hong Kong, 9/F, MO Office, Lo Ka Chow Memorial Ophthalmic Centre, No. 19 Eastern Hospital Road, Causeway Bay, Hong Kong 999077, China. kilihcua@gmail.com
Received: September 16, 2025 Revised: September 26, 2025 Accepted: October 27, 2025 Published online: December 15, 2025 Processing time: 86 Days and 0.2 Hours
Abstract
As large language models increasingly permeate medical workflows, a recent study evaluating ChatGPT 4.0’s performance in addressing patient queries about endoscopic submucosal dissection and endoscopic mucosal resection offers critical insights into three domains: Performance parity, cost democratization, and clinical readiness. The findings highlight ChatGPT’s high accuracy, completeness, and comprehensibility, suggesting potential cost efficiency in patient education. Yet, cost-effectiveness alone does not ensure clinical utility. Notably, the study relied exclusively on text-based prompts, omitting multimodal data such as photographs or endoscopic scans. This is a significant limitation in a visually driven field like endoscopy, where large language model performance may drop precipitously without image context. Without multimodal integration, artificial intelligence tools risk failing to capture key diagnostic signals, underscoring the need for cautious adoption and robust validation in clinical practice.
Core Tip: As large language models increasingly permeate medical workflows, this study offers insight into 3 areas: Performance parity, cost democratization, and clinical readiness. Perhaps one potentially compelling finding was cost efficiency. Yet cost-effectiveness alone does not ensure clinical utility. Notably, the study relied exclusively on text-based prompts, omitting multimodal data such as photographs or scans. This is an important limitation in a domain like endoscopy, which often can be driven visually. Large language model performance can drop precipitously when deprived of image context. Without multimodal integration, artificial intelligence tools may inevitably fail to capture key diagnostic signals.
Citation: Au SCL. Cost vs clinical utility on application of large language models in clinical practice: A double-edged sword. World J Gastrointest Oncol 2025; 17(12): 114341
The integration of large language models (LLMs) like ChatGPT into healthcare represents a transformative opportunity to enhance patient education and streamline clinical workflows. A recent study by Wang et al[1] evaluated ChatGPT 4.0’s ability to answer 30 patient-centered questions about endoscopic submucosal dissection (ESD) and endoscopic mucosal resection (EMR), procedures critical for managing colorectal polyps and early gastric cancer. The study’s findings - high accuracy, substantial completeness, and strong comprehensibility - position LLMs as potential tools for cost-effective patient support. However, the promise of cost democratization is tempered by limitations in clinical utility, particularly the absence of multimodal data integration. This editorial explores the dual nature of LLMs as a “double-edged sword” in endoscopy, balancing cost efficiency against clinical readiness, and calls for a measured approach to their deployment.
PERFORMANCE PARITY: A STEP TOWARD RELIABILITY
The study demonstrates that ChatGPT 4.0 outperforms traditional search engines like Google, achieving a mean accuracy score of 5.14/6 and high reproducibility across repeated queries. Questions on preoperative preparation, postoperative pathology, and family support scored particularly well, suggesting LLMs can deliver reliable, patient-friendly information for routine queries. This performance parity with expert knowledge, as validated by gastroenterologists with over 20 years of experience, indicates that LLMs could reduce the burden on healthcare providers by addressing common patient concerns. However, lower scores on complex topics, such as surgical indications or insurance policies, reveal gaps in the model’s ability to handle nuanced or region-specific medical contexts. These inconsistencies highlight that while LLMs may approach expert-level performance in some areas, they are not yet a substitute for clinical judgment.
COST DEMOCRATIZATION: OPPORTUNITIES AND HIDDEN COSTS
One of the most compelling aspects of LLMs is their potential to democratize access to medical information[2]. By providing instant, comprehensible responses, tools like ChatGPT could empower patients[3,4], particularly in resource-constrained settings, to better understand procedures like ESD/EMR, potentially improving treatment adherence and informed consent. The study notes that 70%-80% of internet users seek health information online, and LLMs could offer a cost-effective alternative to traditional consultations, reducing strain on healthcare systems. For providers, refining artificial intelligence (AI)-generated responses could enhance efficiency, lowering operational costs.
Yet, this cost efficiency comes with hidden expenses. Developing and maintaining LLMs requires significant computational resources, and healthcare integration demands investments in privacy-compliant systems to prevent data leakage. Moreover, inaccuracies or outdated information - such as those noted in the study for complex ESD/EMR queries - could lead to misinformed decisions, increasing downstream costs through complications or additional interventions. The study’s emphasis on human-AI collaboration underscores the need to balance initial cost savings with investments in validation and oversight to ensure long-term utility.
CLINICAL READINESS: THE MULTIMODAL CHALLENGE
The study’s reliance on text-based prompts reveals a critical barrier to clinical readiness: The lack of multimodal integration. Endoscopy is inherently visual, relying on images and scans for diagnosis and procedural planning. LLMs like ChatGPT, which process only textual input, cannot interpret endoscopic images, potentially missing critical diagnostic signals. For instance, identifying subtle mucosal abnormalities or assessing resection margins requires visual context that text alone cannot provide. This limitation risks reducing the clinical utility of LLMs, particularly in scenarios where visual data is paramount. While the study found high comprehensibility among patients, one patient rated responses to questions on surgical indications and recurrence as “low-level”, possibly reflecting the absence of visual aids to clarify complex concepts.
Recent advancements in multimodal models, such as GPT-4V and Gemini, have shown preliminary promise in gastrointestinal endoscopy by enabling the interpretation of endoscopic images for diagnostic support (Table 1). For instance, these models have been explored for analyzing images from procedures like colonoscopy and esophagogastroduodenoscopy, aiding in the identification of lesions, classification of abnormalities, and even educational simulations for medical training[5,6]. Despite these applications, persistent limitations hinder their clinical utility, including suboptimal diagnostic accuracy when compared to human endoscopists, a propensity for generating confident yet incorrect interpretations, and challenges in handling complex or subtle visual cues such as mucosal irregularities, which can lead to misdiagnoses in real-world scenarios[7,8].
Table 1 Strengths and weaknesses of text-based vs multimodal large language models.
Text-based LLMs
Multimodal LLMs
Strengths
Provide real-time textual guidance, differential diagnoses, and automated report generation from clinical notes and patient history
Integrate images, videos, and text for comprehensive analysis, improving lesion detection, classification, and spatial localization in procedures like gastroscopy and colonoscopy
Support patient education and reduce health education load on physicians
Enhance diagnostic accuracy and real-time decision support through multi-scale feature fusion and domain-adaptive learning
Effective for processing textual data like electronic health records and guidelines, aiding in treatment suggestions
Support fine-grained visual understanding and task-specific improvements via fine-tuning, outperforming text-only models in visual tasks
Weaknesses
Cannot interpret endoscopic images or videos, missing critical visual diagnostic cues such as mucosal abnormalities
Performance gaps compared to human experts, with lower sensitivity to increased task complexity
Limited real-time responsiveness and adaptability to new techniques due to reliance on pre-existing textual data
High computational demands, data fusion challenges, and scalability issues for real-time processing of high-resolution endoscopic data
Struggle with complex scenarios requiring visual context, leading to potential incomplete assessments
Limited generalization across institutions and need for large, diverse datasets, plus interpretability concerns
Furthermore, the study highlights risk of “hallucinations” - AI-generated misinformation - and performance drift due to model updates[9]. These issues, combined with ethical concerns like bias and legal ambiguities[10], necessitate robust validation frameworks. For example, AI systems have been documented to misinterpret endoscopic images, such as hallucinating malignant features in benign polyps during colonoscopy, leading to unwarranted invasive procedures, patient distress, and increased healthcare costs, or failing to detect subtle esophageal lesions[11,12], thereby elevating misdiagnosis rates by not improving detection accuracy as evidenced in meta-analyses[13]. Ethically, this undermines patient autonomy and trust, while legally, it heightens liability; a hypothetical yet plausible scenario involves an AI erroneously classifying a capsule endoscopy as normal without endoscopist review, resulting in delayed cancer treatment, patient harm, and potential malpractice suits against physicians or AI developers for negligence[14-16]. Future advancements in multimodal LLMs, capable of processing images alongside text, could address these gaps[17], but such models are not yet standard in clinical settings.
A BALANCED PATH FORWARD
The promise of LLMs in endoscopy lies in their ability to enhance patient education and provider efficiency, but their clinical utility hinges on overcoming current limitations. To wield this double-edged sword effectively, we propose three priorities.
Multimodal integration
Develop LLMs that incorporate endoscopic images and scans to improve diagnostic accuracy and patient comprehension[18]. It could start with piloting hybrid systems like the multimodal AI models tested in randomized crossover studies for diagnosing pancreatic lesions via endoscopic ultrasound, where visual data from scans is fused with textual reports to boost accuracy beyond traditional methods.
Validation frameworks
Establish continuous monitoring and multicenter validation studies to track performance drift and ensure generalizability across diverse populations. These pathways involve establishing multicenter validation protocols, as demonstrated in studies assessing AI for cholangioscopy in detecting biliary malignancies, incorporating continuous performance monitoring through real-time data aggregation from diverse patient cohorts to mitigate drift and ensure generalizability[19].
Human-AI collaboration
Promote hybrid models where AI supports, but does not replace, clinical expertise, ensuring empathy and personalized care remain central. Pilot examples such as the Food and Drug Administration-approved GI Genius system for colonoscopy integrate AI to highlight abnormalities in real-time while relying on endoscopists for final decisions, fostering hybrid workflows that preserve clinical empathy and personalization in procedures like polyp detection[20].
CONCLUSION
While LLMs like ChatGPT offer cost-effective solutions for patient education in endoscopy, their clinical readiness is limited by text-only processing and potential inaccuracies. By addressing these challenges through multimodal integration and rigorous oversight, we can sharpen the blade of clinical utility while mitigating the risks of cost-driven compromises.
Footnotes
Provenance and peer review: Invited article; Externally peer reviewed.
Peer-review model: Single blind
Specialty type: Oncology
Country of origin: China
Peer-review report’s classification
Scientific Quality: Grade B
Novelty: Grade B
Creativity or Innovation: Grade B
Scientific Significance: Grade C
P-Reviewer: Peng WL, MD, Lecturer, Researcher, China S-Editor: Wang JJ L-Editor: A P-Editor: Zhao S
Feng Y, Chan TH, Yin G, Yu L. Democratizing large language model-based graph data augmentation via latent knowledge graphs.Neural Netw. 2025;191:107777.
[PubMed] [DOI] [Full Text]
Saraiva MM, Ribeiro T, Agudo B, Afonso J, Mendes F, Martins M, Cardoso P, Mota J, Almeida MJ, Costa A, Gonzalez Haba Ruiz M, Widmer J, Moura E, Javed A, Manzione T, Nadal S, Barroso LF, de Parades V, Ferreira J, Macedo G. Evaluating ChatGPT-4 for the Interpretation of Images from Several Diagnostic Techniques in Gastroenterology.J Clin Med. 2025;14:572.
[RCA] [PubMed] [DOI] [Full Text][Cited by in RCA: 2][Reference Citation Analysis (0)]
Sibomana O, Saka SA, Grace Uwizeyimana M, Mwangi Kihunyu A, Obianke A, Oluwo Damilare S, Bueh LT, Agbelemoge BOG, Omoefe Oveh R. Artificial Intelligence-Assisted Endoscopy in Diagnosis of Gastrointestinal Tumors: A Review of Systematic Reviews and Meta-Analyses.Gastro Hep Adv. 2025;4:100754.
[RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)][Cited by in RCA: 2][Reference Citation Analysis (0)]
Marya NB, Powers PD, AbiMansour JP, Marcello M, Thiruvengadam NR, Nasser-Ghodsi N, Rau P, Zivny J, Mehta S, Marshall C, Leonor P, Che K, Abu Dayyeh B, Storm AC, Petersen BT, Law R, Martin JA, Vargas EJ, Chandrasekhara V. Multicenter validation of a cholangioscopy artificial intelligence system for the evaluation of biliary tract disease.Endoscopy. 2025;.
[RCA] [PubMed] [DOI] [Full Text][Cited by in RCA: 1][Reference Citation Analysis (0)]