Comparative evaluation of multimodal large language models for Mayo Endoscopic Subscore grading in ulcerative colitis

doi:10.3748/wjg.118690

Advanced Search

BPG is committed to discovery and dissemination of knowledge

Home / Archive / Volume 32, Issue 24

This Article

(17)

(16)

(0)

(1)

(475)

Peer-Review Report of This Article

CrossCheck and Google Search of This Article

Academic Rules and Norms of This Article

Supplementary Materials of This Article

Citation of this article

Corresponding Author of This Article

Research Domain of This Article

Article-Type of This Article

Open-Access Policy of This Article

Times Cited Counts in Google of This Article

Journal Information of This Article

Publication Name

World Journal of Gastroenterology

ISSN

1007-9327

Publisher of This Article

Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA

Retrospective Study

Copyright: ©Author(s) 2026. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license. No commercial re-use. See permissions. Published by Baishideng Publishing Group Inc.

World J Gastroenterol. Jun 28, 2026; 32(24): 118690
Published online Jun 28, 2026. doi: 10.3748/wjg.118690

Open in New Tab Full Size Figure Download Figure

Figure 1 Overview of the study workflow, illustrating the collection of endoscopic images, expert consensus grading, model evaluation phases, and comparative performance analysis. UC: Ulcerative colitis; MES: Mayo Endoscopic Subscore; IBD: Inflammatory bowel disease; AI: Artificial intelligence.

Open in New Tab Full Size Figure Download Figure

Figure 2 Comprehensive performance analysis of five multimodal large language models. A: Overall ranking of five multimodal large language models based on composite performance scores; B: Comparison of core evaluation metrics (accuracy, F1 score, κ, precision, recall). Statistical tests: McNemar for accuracy, precision, recall, and F1 score; paired Wilcoxon for κ (P < 0.05 considered significant); C: Radar chart illustrating performance across five key dimensions; D: Heatmap summarizing six quantitative indicators (accuracy, precision, recall, F1 score, κ, and mean absolute error); E: Precision-recall scatter plot demonstrating the trade-off between precision and recall; F: Comparison of error metrics (mean absolute error and root mean squared error) where lower values indicate superior performance. MAE: Mean absolute error; RMSE: Root mean squared error.

Open in New Tab Full Size Figure Download Figure

Figure 3 Segment-wise performance evaluation of multimodal large language models in phase B. A: Overall ranking of five multimodal large language models (Gemini-2.5-Pro, Grok-4, GPT-4o, GPT-5, and Qwen-VL-Max) based on weighted composite scores; B: Summary of key performance metrics (accuracy, F1 score, κ, precision, and recall); C and D: Heatmaps illustrating segment-specific accuracy and mean absolute error across six colonic regions; E: Bar chart showing average accuracy by segment with pairwise comparisons assessed using the McNemar test (P < 0.05); F: Variability analysis of accuracy across segments (lower values indicate greater stability; G and H: Radar plots depicting segment-level and overall performance profiles across the five models. MAE: Mean absolute error; ACC: Accuracy; Corr: Correlation.

Open in New Tab Full Size Figure Download Figure

Figure 4 Confusion matrices of expert scorings. A-D: Confusion matrices depicting the Mayo Endoscopic Subscore classification results for the two senior inflammatory bowel disease physicians. The horizontal axis represents the true Mayo Endoscopic Subscore grades, and the vertical axis indicates the predicted grades assigned by each physician. Color intensity corresponds to the number or proportion of images in each classification cell with deeper red tones denoting higher counts or frequencies.

Open in New Tab Full Size Figure Download Figure

Figure 5 Confusion matrices of multimodal large language models in phase A. A-J: Confusion matrices illustrating the Mayo Endoscopic Subscore classification outputs of five multimodal large language models: GPT-5 (A and B), GPT-4o (C and D), Gemini-2.5-Pro (E and F), Grok-4 (G and H), and Qwen-VL-Max (I and J). The horizontal axis denotes the true Mayo Endoscopic Subscore grades, and the vertical axis represents the predicted grades generated by each model. Color intensity corresponds to the number or proportion of images in each classification cell with deeper red tones indicating higher counts or frequencies.

Open in New Tab Full Size Figure Download Figure

Figure 6 Confusion matrices of multimodal large language models in phase B. A-J: Confusion matrices illustrating the Mayo Endoscopic Subscore classification outputs of five multimodal large language models under the segment-informed condition: GPT-5 (A and B), GPT-4o (C and D), Gemini-2.5-Pro (E and F), Grok-4 (G and H), and Qwen-VL-Max (I and J). The horizontal axis denotes the true Mayo Endoscopic Subscore grades, and the vertical axis represents the predicted grades generated by each model. Color intensity corresponds to the number or proportion of images in each classification cell, with deeper red tones indicating higher counts or frequencies.

Citation: Zhao XY, Shen Y, He JJ, Zhou QY, Jiang LS, Zhou ZR, An FM, Zhan Q, Sun J, Feng W. Comparative evaluation of multimodal large language models for Mayo Endoscopic Subscore grading in ulcerative colitis. World J Gastroenterol 2026; 32(24): 118690
URL: https://www.wjgnet.com/1007-9327/full/v32/i24/118690.htm
DOI: https://dx.doi.org/10.3748/wjg.118690

Zhao XY, Shen Y, He JJ, Zhou QY, Jiang LS, Zhou ZR, An FM, Zhan Q, Sun J, Feng W. Comparative evaluation of multimodal large language models for Mayo Endoscopic Subscore grading in ulcerative colitis. World J Gastroenterol 2026; 32(24): 118690 [DOI: 10.3748/wjg.118690]

All content on this site: Copyright © 1993-2026 Baishideng Publishing Group Inc, its licensors, and contributors. All rights are reserved, including those for text and data mining, AI training, and similar technologies. For all open access content, the relevant licensing terms apply.