Copyright: ©Author(s) 2026.
World J Gastroenterol. Jun 28, 2026; 32(24): 118690
Published online Jun 28, 2026. doi: 10.3748/wjg.118690
Published online Jun 28, 2026. doi: 10.3748/wjg.118690
Figure 1 Overview of the study workflow, illustrating the collection of endoscopic images, expert consensus grading, model evaluation phases, and comparative performance analysis.
UC: Ulcerative colitis; MES: Mayo Endoscopic Subscore; IBD: Inflammatory bowel disease; AI: Artificial intelligence.
Figure 2 Comprehensive performance analysis of five multimodal large language models.
A: Overall ranking of five multimodal large language models based on composite performance scores; B: Comparison of core evaluation metrics (accuracy, F1 score, κ, precision, recall). Statistical tests: McNemar for accuracy, precision, recall, and F1 score; paired Wilcoxon for κ (P < 0.05 considered significant); C: Radar chart illustrating performance across five key dimensions; D: Heatmap summarizing six quantitative indicators (accuracy, precision, recall, F1 score, κ, and mean absolute error); E: Precision-recall scatter plot demonstrating the trade-off between precision and recall; F: Comparison of error metrics (mean absolute error and root mean squared error) where lower values indicate superior performance. MAE: Mean absolute error; RMSE: Root mean squared error.
Figure 3 Segment-wise performance evaluation of multimodal large language models in phase B.
A: Overall ranking of five multimodal large language models (Gemini-2.5-Pro, Grok-4, GPT-4o, GPT-5, and Qwen-VL-Max) based on weighted composite scores; B: Summary of key performance metrics (accuracy, F1 score, κ, precision, and recall); C and D: Heatmaps illustrating segment-specific accuracy and mean absolute error across six colonic regions; E: Bar chart showing average accuracy by segment with pairwise comparisons assessed using the McNemar test (P < 0.05); F: Variability analysis of accuracy across segments (lower values indicate greater stability; G and H: Radar plots depicting segment-level and overall performance profiles across the five models. MAE: Mean absolute error; ACC: Accuracy; Corr: Correlation.
Figure 4 Confusion matrices of expert scorings.
A-D: Confusion matrices depicting the Mayo Endoscopic Subscore classification results for the two senior inflammatory bowel disease physicians. The horizontal axis represents the true Mayo Endoscopic Subscore grades, and the vertical axis indicates the predicted grades assigned by each physician. Color intensity corresponds to the number or proportion of images in each classification cell with deeper red tones denoting higher counts or frequencies.
Figure 5 Confusion matrices of multimodal large language models in phase A.
A-J: Confusion matrices illustrating the Mayo Endoscopic Subscore classification outputs of five multimodal large language models: GPT-5 (A and B), GPT-4o (C and D), Gemini-2.5-Pro (E and F), Grok-4 (G and H), and Qwen-VL-Max (I and J). The horizontal axis denotes the true Mayo Endoscopic Subscore grades, and the vertical axis represents the predicted grades generated by each model. Color intensity corresponds to the number or proportion of images in each classification cell with deeper red tones indicating higher counts or frequencies.
Figure 6 Confusion matrices of multimodal large language models in phase B.
A-J: Confusion matrices illustrating the Mayo Endoscopic Subscore classification outputs of five multimodal large language models under the segment-informed condition: GPT-5 (A and B), GPT-4o (C and D), Gemini-2.5-Pro (E and F), Grok-4 (G and H), and Qwen-VL-Max (I and J). The horizontal axis denotes the true Mayo Endoscopic Subscore grades, and the vertical axis represents the predicted grades generated by each model. Color intensity corresponds to the number or proportion of images in each classification cell, with deeper red tones indicating higher counts or frequencies.
- Citation: Zhao XY, Shen Y, He JJ, Zhou QY, Jiang LS, Zhou ZR, An FM, Zhan Q, Sun J, Feng W. Comparative evaluation of multimodal large language models for Mayo Endoscopic Subscore grading in ulcerative colitis. World J Gastroenterol 2026; 32(24): 118690
- URL: https://www.wjgnet.com/1007-9327/full/v32/i24/118690.htm
- DOI: https://dx.doi.org/10.3748/wjg.118690