BPG is committed to discovery and dissemination of knowledge
Editorial
Copyright ©The Author(s) 2025.
World J Gastrointest Oncol. Dec 15, 2025; 17(12): 114341
Published online Dec 15, 2025. doi: 10.4251/wjgo.v17.i12.114341
Table 1 Strengths and weaknesses of text-based vs multimodal large language models

Text-based LLMs
Multimodal LLMs
StrengthsProvide real-time textual guidance, differential diagnoses, and automated report generation from clinical notes and patient historyIntegrate images, videos, and text for comprehensive analysis, improving lesion detection, classification, and spatial localization in procedures like gastroscopy and colonoscopy
Support patient education and reduce health education load on physiciansEnhance diagnostic accuracy and real-time decision support through multi-scale feature fusion and domain-adaptive learning
Effective for processing textual data like electronic health records and guidelines, aiding in treatment suggestionsSupport fine-grained visual understanding and task-specific improvements via fine-tuning, outperforming text-only models in visual tasks
WeaknessesCannot interpret endoscopic images or videos, missing critical visual diagnostic cues such as mucosal abnormalitiesPerformance gaps compared to human experts, with lower sensitivity to increased task complexity
Limited real-time responsiveness and adaptability to new techniques due to reliance on pre-existing textual dataHigh computational demands, data fusion challenges, and scalability issues for real-time processing of high-resolution endoscopic data
Struggle with complex scenarios requiring visual context, leading to potential incomplete assessmentsLimited generalization across institutions and need for large, diverse datasets, plus interpretability concerns