Comparative evaluation of multimodal large language models for Mayo Endoscopic Subscore grading in ulcerative colitis

doi:10.3748/wjg.118690

Advanced Search

BPG is committed to discovery and dissemination of knowledge

Home / Archive / Volume 32, Issue 24

This Article

(17)

(16)

(0)

(1)

(475)

Table of Contents

Peer-Review Report of This Article

CrossCheck and Google Search of This Article

Academic Rules and Norms of This Article

Supplementary Materials of This Article

Citation of this article

Corresponding Author of This Article

Research Domain of This Article

Article-Type of This Article

Open-Access Policy of This Article

Times Cited Counts in Google of This Article

Journal Information of This Article

Publication Name

World Journal of Gastroenterology

ISSN

1007-9327

Publisher of This Article

Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA

Retrospective Study Open Access

Copyright: ©Author(s) 2026. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license. No commercial re-use. See permissions. Published by Baishideng Publishing Group Inc.

World J Gastroenterol. Jun 28, 2026; 32(24): 118690
Published online Jun 28, 2026. doi: 10.3748/wjg.118690

Comparative evaluation of multimodal large language models for Mayo Endoscopic Subscore grading in ulcerative colitis

Xiao-Yi Zhao, Yue Shen, Jun-Jie He, Qun-Yan Zhou, Li-Sha Jiang, Zi-Ru Zhou, Fang-Mei An, Qiang Zhan, Jing Sun, Wei Feng

Xiao-Yi Zhao, Yue Shen, Qun-Yan Zhou, Li-Sha Jiang, Zi-Ru Zhou, Fang-Mei An, Qiang Zhan, Jing Sun, Department of Gastroenterology, The Affiliated Wuxi People’s Hospital of Nanjing Medical University, Wuxi 214023, Jiangsu Province, China

Jun-Jie He, State Key Laboratory of Epigenetic Regulation and Intervention, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China

Wei Feng, Department of Information, The Affiliated Wuxi People’s Hospital of Nanjing Medical University, Wuxi 214023, Jiangsu Province, China

ORCID number: Fang-Mei An (0000-0002-6116-1989); Qiang Zhan (0000-0001-5054-3028); Jing Sun (0009-0005-0302-693X); Wei Feng (0000-0002-6843-2067).

Co-first authors: Xiao-Yi Zhao and Yue Shen.

Co-corresponding authors: Jing Sun and Wei Feng.

Author contributions: Zhao XY and Shen Y performed the research and contributed equally to this study as co-first authors; He JJ analyzed the data and wrote the manuscript; Zhou QY, Jiang LS, Zhou ZR, An FM, and Zhan Q participated in parts of the research as a doctor or an expert and assisted in data collection; Sun J and Feng W designed the study and provided overall supervision throughout the research process and contributed equally to this study as co-corresponding authors; All authors read and approved the final manuscript.

AI contribution statement: We used AI tool ChatGPT/DeepL to polish the language during the preparation of this manuscript. All scientific content, ideas, and original drafting are entirely the work of the human authors. AI tools were used exclusively for preliminary language polishing to improve the readability and grammar of the initial draft. It was not used for data analysis, or substantive writing assistance. AI tools did not participate in the study design, data analysis, or the interpretation of the results. All intellectual and scientific contributions were independently made by the authors. None of the images or figures in the manuscript were generated by AI.

Supported by the Program of Jiangsu Branch of the National Clinical Research Center for Digestive Diseases, No. JSZX202301; Basic Research Program of Jiangsu, No. BK20231146; Key Program of Wuxi Medical Center, Nanjing Medical University, No. WMCM202305; Doctoral Talent Fund of Wuxi People’s Hospital, No. BSRC202408; and Scientific Research Project of Wuxi Municipal Health Commission, No. M202503.

Institutional review board statement: The study was reviewed and approved by the Ethics Committee of Wuxi People’s Hospital Affiliated to Nanjing Medical University. (Approval No. KY25207).

Informed consent statement: This retrospective study involved minimal risk to participants and did not compromise their health or rights. Personal privacy and identifying information were fully protected, and obtaining informed consent was practically unfeasible. Accordingly, the ethics committee granted a waiver of informed consent.

Conflict-of-interest statement: There is no conflict of interest associated with the senior author or other coauthors who contributed their efforts in this manuscript.

Data sharing statement: Statistical code and dataset available from the corresponding author at fengwei@njmu.edu.cn.

Corresponding author: Wei Feng, PhD, Department of Information, The Affiliated Wuxi People’s Hospital of Nanjing Medical University, No. 299 Qingyang Road, Wuxi 214023, Jiangsu Province, China. fengwei@njmu.edu.cn

Received: January 12, 2026
Revised: February 17, 2026
Accepted: February 28, 2026
Published online: June 28, 2026
Processing time: 154 Days and 4.5 Hours

Abstract

BACKGROUND

Ulcerative colitis (UC) is a chronic inflammatory disorder that typically affects adults aged 20-40 years and presents with hematochezia and diarrhea. Endoscopic evaluation using the Mayo Endoscopic Subscore (MES; 0-3) is standard for assessing disease severity, but grading depends heavily on endoscopist experience and is prone to interobserver variability. Recent advances in multimodal large language models (MLLMs), such as GPT and Gemini, enable automated interpretation of medical images and text, offering potential for objective and reproducible MES grading. However, their accuracy and consistency across intestinal segments remain largely untested.

AIM

To compare the diagnostic accuracy of five MLLMs in UC MES evaluation and to assess consistency across segments and grades.

METHODS

We collected 402 endoscopic images from patients with UC covering the entire colon. Three experienced experts independently graded all images according to MES criteria, and 283 images with a unanimous consensus were included as the reference standard. These images were evaluated by five MLLMs and two senior physicians under two conditions: Without segmental context and with anatomical segmental information. Model and physician performance were compared, and stratified analyses were conducted by intestinal segment and MES grade.

RESULTS

The diagnostic accuracies of the two inflammatory bowel disease physicians were 81.6% and 78.4%, respectively, with strong interobserver agreement (κ = 0.692). Among the MLLMs, GPT-5 achieved the highest overall performance (F1 = 0.720) with accuracy comparable with that of physician 2 (71.7% vs 78.4%, P = 0.068). Other models exhibited significantly lower performance (all P < 0.001). The sigmoid colon was the most accurately assessed region (mean F1 = 0.682), whereas the rectum and ileocecal region were the most challenging (0.447 and 0.493, respectively). The addition of segmental information improved the accuracy of lower-performing models. Both models and physicians showed the lowest accuracy for MES = 1, reflecting the subjective nature of mild disease activity.

CONCLUSION

This study suggested that GPT-5 holds potential for static-image-based MES grading, but performance varies and requires external validation and optimization. Future work will optimize artificial intelligence endoscopy through tuning and multimodality.

Key Words: Ulcerative colitis; Mayo Endoscopic Subscore; Endoscopic assessment; Artificial intelligence; Multimodal large language models; Diagnostic accuracy

Core Tip: This single-center proof-of-concept study was the first to systematically compare five multimodal large language models for static-image-based Mayo Endoscopic Subscore grading in ulcerative colitis. The key innovation was the direct benchmarking against experienced inflammatory bowel disease physicians and the incorporation of segment-specific analyses that uncovered anatomical heterogeneity in model performance. Notably, GPT-5 achieved near-expert-level accuracy without specialized tuning, highlighting its potential role in standardized endoscopic assessment.

Citation: Zhao XY, Shen Y, He JJ, Zhou QY, Jiang LS, Zhou ZR, An FM, Zhan Q, Sun J, Feng W. Comparative evaluation of multimodal large language models for Mayo Endoscopic Subscore grading in ulcerative colitis. World J Gastroenterol 2026; 32(24): 118690
URL: https://www.wjgnet.com/1007-9327/full/v32/i24/118690.htm
DOI: https://dx.doi.org/10.3748/wjg.118690

INTRODUCTION

Ulcerative colitis (UC) is a chronic idiopathic inflammatory disorder that predominantly affects individuals aged 20-40 years and typically presents with hematochezia and diarrhea[1]. Endoscopic evaluation remains essential for assessing disease activity[2]. The Mayo Endoscopic Subscore (MES) grades the most inflamed mucosal segment on a four-point scale (0-3: Normal; mild; moderate; and severe)[3,4]. Owing to its simplicity and reproducibility, MES is endorsed by international guidelines and serves as a primary endpoint in clinical trials of UC[5,6]. However, MES assessment depends heavily on the endoscopist’s experience and remains prone to interobserver variability[7]. Reducing this subjectivity could improve the precision of MES-based endoscopic evaluation and support wider clinical adoption.

Recent advances in multimodal large language models (MLLMs), such as OpenAI’s GPT series and Google’s Gemini model[8,9], have enabled automated image and text interpretation across diverse medical domains. MLLMs have demonstrated potential in UC to deliver algorithm-driven, objective, and reproducible grading of disease severity[10-13]. Levartovsky et al[10] reported that ChatGPT-4 achieved diagnostic accuracy comparable with that of inflammatory bowel disease (IBD) specialists in MES grade evaluation[10]. However, the study was limited by a small sample size (n = 30) and the absence of segment-based subgroup analysis, restricting generalizability. Other emerging models, including Grok and Qwen, have shown promise in diagnostic and therapeutic applications across multiple diseases[13-15], but their performance in UC remains unexplored. Furthermore, anatomical heterogeneity among intestinal segments has been shown to affect lesion detection, such as for polyps and adenomas, yet its influence on MES grading and MLLM performance has not been systematically evaluated.

Artificial intelligence (AI), particularly deep learning algorithms, has been widely applied in gastrointestinal imaging[16]. Models based on convolutional neural networks (CNNs) and vision transformers have demonstrated diagnostic performance on par with experienced endoscopists across large-scale endoscopic datasets[17-19]. However, despite the emergence of deep learning models tailored for small-sample learning[20,21], clinical utility in UC remains limited by scarce high-quality annotated data, limited generalizability, and reliance on unimodal inputs[22-24]. Moreover, conventional CNNs primarily extract pixel-level features and lack the semantic reasoning required to interpret complex clinical guidelines, further constraining their role in real-world clinical decision-making.

In contrast, emerging MLLMs represent a paradigm shift by integrating visual encoders with extensive linguistic knowledge bases[25]. MLLMs enable joint processing of heterogeneous data by aligning cross-modal information within a shared representation space[26]. Consequently, they have been applied across diverse imaging modalities, including CT, magnetic resonance imaging, and endoscopy, in conjunction with textual data. Models such as GPT demonstrate robust reasoning and analytical capabilities, and in certain domains outperform human benchmarks[27]. Vision-language integration may, therefore, allow MLLMs to interpret endoscopic findings according to established clinical diagnostic criteria rather than relying solely on pattern recognition[28]. This capability is particularly relevant for distinguishing borderline disease activity states, such as between MES 0, MES 1, and MES 2. Accurate classification in these settings depends on subtle qualitative features, including mild friability and partially obscured vascular patterns, which require semantic understanding and cognitive-level integration rather than binary pixel-level classification[29].

This study was designed as a preliminary, single-center evaluation to compare the diagnostic capabilities, interobserver agreement, and error patterns of five leading MLLMs (GPT-5, Gemini-2.5-Pro, Grok-4, GPT-4o, and Qwen-VL-Max) with those of experienced IBD physicians in the context of static-image-based MES grading. Segment-based analyses were performed to assess the effect of anatomical variation on model performance. The objective was to provide proof-of-concept evidence regarding the utility of MLLMs in this domain, acknowledging that further multicenter validation is essential to establish generalizability. The findings may inform future strategies for model training, algorithm optimization, and clinical implementation of AI-assisted static-image endoscopic evaluation.

MATERIALS AND METHODS

Study design

This retrospective, real-world diagnostic accuracy study was conducted using clinical endoscopic data from patients with UC. The evaluation was conducted between September 1, 2025 and September 20, 2025. A schematic illustration of the study design is presented in Figure 1.

Open in New Tab Full Size Figure Download Figure

Figure 1 Overview of the study workflow, illustrating the collection of endoscopic images, expert consensus grading, model evaluation phases, and comparative performance analysis. UC: Ulcerative colitis; MES: Mayo Endoscopic Subscore; IBD: Inflammatory bowel disease; AI: Artificial intelligence.

Imaging dataset

A total of 402 high-resolution colonoscopy images were consecutively collected from patients diagnosed with UC at Wuxi People’s Hospital between January 2021 and December 2025. Inclusion criteria were as follows: (1) High-quality endoscopic images; and (2) Adequate bowel preparation, defined as a Boston Bowel Preparation Scale score > 5[30]. Exclusion criteria included: (1) Images with severe artifacts; (2) Inadequate visualization of the mucosal surface; or (3) Coexistence of colorectal cancer or other colonic diseases.

All patient data were de-identified prior to analysis in accordance with the Declaration of Helsinki. The study was reviewed and approved by the Ethics Committee of Wuxi People’s Hospital Affiliated to Nanjing Medical University. (Approval No. KY25207).

Three experienced IBD specialists, all members of the Chinese IBD Guidelines Development Committee, independently reviewed and graded each image according to the international MES criteria (Table 1). Only images for which all three experts assigned identical MES grades were included in the final dataset. Ultimately, 283 static high-resolution JPEG images (70.4% of the initial dataset) were retained. These images represented the following intestinal segments: Ileocecal region (47 images); ascending colon (49); transverse colon (51); descending colon (46); sigmoid colon (45); and rectum (45).

Table 1 Mayo Endoscopic Subscore.

Score	Description	Endoscopic findings
0	Normal or inactive disease	Normal mucosal appearance; intact vascular pattern; no friability, bleeding, or ulceration
1	Mild disease	Mild friability; decreased but visible vascular pattern; mild erythema; no erosions
2	Moderate disease	Marked erythema; absent vascular pattern; friability; erosions may be present
3	Severe disease	Spontaneous bleeding; ulceration; denuded mucosa; severe friability

Open in New Tab Full Size Table

To evaluate spectrum bias and assess model performance in ambiguous cases, we established a sensitivity analysis set derived from 119 images initially excluded due to lack of unanimous consensus. For these images a majority vote standard, defined as agreement by at least two of the three experts, was applied. Ten images without majority consensus (i.e. three different scores) were excluded, yielding a final set of 109 images for sensitivity analysis. Among these, 18 (16.5%) were MES 0, 42 (38.5%) were MES 1, 44 (40.4%) were MES 2, and 5 (4.6%) were MES 3.

Group composition

Physician group: This group comprised two independent attending physicians specializing in IBD. Each had more than 3 years of clinical experience in endoscopic evaluation and UC management.

Model group: The model group included five MLLMs: GPT-5 (OpenAI, United States; version 2025-0807), Gemini-2.5-Pro (Google, United States; version 2025-0605), Grok-4 (xAI, United States; version 2025-0709), GPT-4o (OpenAI, United States; version 2025-0326), and Qwen-VL-Max (Alibaba Cloud, China; version 2025-0526). The five MLLMs were selected to represent state-of-the-art approaches across distinct architectural lineages and development origins as of mid-2025. This cohort included global proprietary flagship models (GPT-5, GPT-4o, Gemini-2.5-Pro, and Grok-4) recognized for advanced multimodal reasoning along with a linguistically diverse model (Qwen-VL-Max) to evaluate performance across different training data distributions. Although specific architectures are proprietary, comparing these models enables evaluation of diagnostic accuracy in MES grading indirectly by depending on detecting subtle mucosal features such as vascular pattern obscuration. It is primarily driven by visual encoder resolution (fine-grained feature extraction) or by instruction-tuning effectiveness in adhering to standardized clinical scoring criteria. To ensure reproducibility we used fixed snapshot versions of each model accessed through their respective application programming interfaces (APIs). These versions are static and do not undergo continuous updating, thereby eliminating temporal performance drift.

Scoring procedure

The scoring process consisted of two sequential phases designed to evaluate the diagnostic performance of MLLMs under distinct informational conditions. To simulate a realistic clinical workflow and evaluate robustness to raw inputs, no manual preprocessing was performed. Instead, input standardization was ensured by submitting the same set of original high-resolution JPEG files to all five MLLMs.

Phase A represented a “random” condition without contextual information. All 283 endoscopic images were randomly and independently evaluated by the two physicians and the five MLLMs. For the model group prompt parameters were standardized. The system content was defined as “I need you to be a professional gastroenterologist specializing in IBD and endoscopic examination.” The task prompt was defined as “You will analyze colonoscopy images of patients with UC and classify them according to the internationally accepted MES. For each image, please assign a score from 0 to 3, following the original file order.” Each model independently assessed all 283 images with a new chat session initialized for each evaluation to minimize potential memory bias.

Phase B represented a “segment” condition with contextual information. Using the same image set as in phase A, each model was provided with additional contextual information specifying the intestinal segment. Specifically, the anatomical location was appended to the user prompt as a declarative statement [this image was captured from the (insert segment name)] before the grading instruction. This ensured that the visual features were interpreted within the specific anatomical context of that segment.

All models were accessed through their respective official APIs using the OpenAI-compatible Python 3.11 environment. Operational parameters were standardized (temperature = 1.00, max_tokens = 1500). Internet access was disabled, and no illustrative examples or reasoning traces (“chains of thought”) were provided to simulate a realistic zero-shot, task-framed inference environment. Each model was instructed to output only a single numerical MES score for each image to facilitate subsequent data aggregation and analysis.

Regulatory compliance and workflow integration

While this study evaluated static-image uploads via APIs, real-world clinical implementation requires on-premise model deployment to ensure patient data privacy and mandates formal regulatory approval as software as a medical device. Furthermore, the optimal clinical workflow will transition from the retrospective static-image analysis utilized in this benchmarking study to real-time video overlay integrated directly into active endoscopic procedures.

Evaluation metrics

All evaluations were benchmarked against the expert consensus, which served as the reference standard. The following quantitative metrics were calculated: Accuracy; precision; recall; macro-averaged F1 score (macro-F1); Cohen’s κ statistic (linearly weighted); mean absolute error (MAE); mean squared error (MSE); root mean squared error; standard deviation; coefficient of variation; and Pearson correlation coefficient (r). Stratified analyses were performed according to intestinal segment and MES grade. Interobserver agreement between the two physicians was quantified using Cohen’s κ statistic.

Statistical analysis

All analyses were conducted using the 283 endoscopic images for which consensus was reached among the three IBD experts. There were no missing data, and all interpretations represented paired observations.

In phase A, comparisons between models were conducted using the McNemar test for Accuracy and the Wilcoxon signed-rank test for MAE. To assess performance stability 95% confidence interval for macro-averaged F1 scores were estimated using bootstrap resampling with 1000 iterations. To account for multiple comparisons, the Bonferroni correction was applied. For comparisons between GPT-5 and the two human experts (physician 1 and physician 2), the significance threshold was set at α = 0.025 (0.05/2). For comparisons among the five MLLMs, the threshold was set at α = 0.01 (0.05/5). A two-tailed P value below these adjusted thresholds was considered indicative of statistical significance. Comparisons between model and physician performance were also assessed using the McNemar test for accuracy. A two-tailed P value < 0.05 was considered indicative of statistical significance. Analyses in phase B were primarily descriptive.

RESULTS

Overall performance comparison

The diagnostic accuracies of the two senior IBD physicians were 81.6% and 78.4%, respectively, demonstrating substantial interobserver agreement (κ = 0.692). As summarized in Table 2, among the five MLLMs, GPT-5 achieved the best overall performance (F1 = 0.720; κ = 0.609, 95% confidence interval not crossing 0; MAE = 0.297; P < 0.001; Figure 2). GPT-5 demonstrated lower diagnostic accuracy than physician 1 (71.7% vs 81.6%; P = 0.005), whereas no significant difference was observed compared with Physician 2 (78.4%; P = 0.068). These findings suggest that GPT-5 approaches expert-level performance but has not yet fully matched the diagnostic precision of the most experienced specialists.

Open in New Tab Full Size Figure Download Figure

Figure 2 Comprehensive performance analysis of five multimodal large language models. A: Overall ranking of five multimodal large language models based on composite performance scores; B: Comparison of core evaluation metrics (accuracy, F1 score, κ, precision, recall). Statistical tests: McNemar for accuracy, precision, recall, and F1 score; paired Wilcoxon for κ (P < 0.05 considered significant); C: Radar chart illustrating performance across five key dimensions; D: Heatmap summarizing six quantitative indicators (accuracy, precision, recall, F1 score, κ, and mean absolute error); E: Precision-recall scatter plot demonstrating the trade-off between precision and recall; F: Comparison of error metrics (mean absolute error and root mean squared error) where lower values indicate superior performance. MAE: Mean absolute error; RMSE: Root mean squared error.

Table 2 Overall performance comparison of five multimodal large language models in phase A.

	Gemini-2.5-Pro	Grok-4	GPT-4o	GPT-5	Qwen-VL-Max
Accuracy	0.502	0.417	0.594	0.717	0.353
Precision	0.639	0.552	0.635	0.731	0.488
Recall	0.502	0.417	0.594	0.717	0.353
F1 score (95%CI)	0.480 (0.452-0.574)	0.415 (0.369-0.481)	0.602 (0.522-0.648)	0.720 (0.665-0.773)	0.338 (0.263-0.382)
Cohen’s κ	0.343	0.239	0.449	0.608	0.133
MAE	0.611	0.770	0.452	0.297	0.767
MSE	0.866	1.194	0.544	0.325	1.021
RMSE	0.930	1.093	0.738	0.570	1.011
SD	0.781	0.918	0.697	0.564	0.953
CV	0.713	0.838	0.637	0.515	0.870
r value¹	0.681	0.590	0.777	0.852	0.478

¹Pearson correlation coefficient. MAE: Mean absolute error; MSE: Mean squared error; RMSE: Root mean squared error; CV: Coefficient of variation; SD: Standard deviation; CI: Confidence interval.

Open in New Tab Full Size Table

Performance across intestinal segments

Table 3 summarizes the mean F1 scores averaged across all five MLLMs for each intestinal segment: Sigmoid colon (0.631) > descending colon (0.574) approximately ascending colon (0.571) approximately transverse colon (0.567) > ileocecal region (0.493) > rectum (0.447). Overall, the models demonstrated the highest diagnostic consistency in the sigmoid colon, whereas the ileocecal region and rectum were the most challenging segments to classify accurately. As summarized in Table 4, GPT-5 consistently achieved the highest F1 scores across all intestinal segments, maintaining superior and stable performance throughout (Figure 3).

Open in New Tab Full Size Figure Download Figure

Figure 3 Segment-wise performance evaluation of multimodal large language models in phase B. A: Overall ranking of five multimodal large language models (Gemini-2.5-Pro, Grok-4, GPT-4o, GPT-5, and Qwen-VL-Max) based on weighted composite scores; B: Summary of key performance metrics (accuracy, F1 score, κ, precision, and recall); C and D: Heatmaps illustrating segment-specific accuracy and mean absolute error across six colonic regions; E: Bar chart showing average accuracy by segment with pairwise comparisons assessed using the McNemar test (P < 0.05); F: Variability analysis of accuracy across segments (lower values indicate greater stability; G and H: Radar plots depicting segment-level and overall performance profiles across the five models. MAE: Mean absolute error; ACC: Accuracy; Corr: Correlation.

Table 3 Overall performance comparison of five multimodal large language models in phase B.

	Gemini-2.5-Pro	Grok-4	GPT-4o	GPT-5	Qwen-VL-Max
Accuracy	0.541	0.420	0.583	0.710	0.406
Precision	0.621	0.480	0.590	0.724	0.515
Recall	0.541	0.420	0.583	0.710	0.406
F1 score (95%CI)	0.546 (0.490-0.611)	0.426 (0.363-0.481)	0.584 (0.502-0.618)	0.715 (0.664-0.768)	0.408 (0.313-0.439)
Cohen’s κ	0.381	0.229	0.425	0.596	0.190
MAE	0.565	0.753	0.473	0.300	0.707
MSE	0.799	1.141	0.587	0.322	0.940
RMSE	0.894	1.068	0.766	0.567	0.969
SD	0.843	0.973	0.755	0.566	0.951
CV	0.769	0.888	0.689	0.517	0.868
r value¹	0.644	0.573	0.749	0.850	0.499

¹Pearson correlation coefficient. MAE: Mean absolute error; MSE: Mean squared error; RMSE: Root mean squared error; CV: Coefficient of variation; SD: Standard deviation; CI: Confidence interval.

Open in New Tab Full Size Table

Table 4 Segment-wise performance comparison of five multimodal large language models in phase B.

Segments	Models	Accuracy	Precision	Recall	F1 score	Cohen’s κ	MAE	MSE	RMSE	SD	CV	r value¹
Ileocecal region	Gemini-2.5-Pro	0.426	0.752	0.426	0.478	0.226	0.787	1.213	1.101	0.77	1.508	0.606
	Grok-4	0.362	0.727	0.362	0.456	0.144	1.043	1.979	1.407	1.031	2.018	0.453
	GPT-4o	0.532	0.643	0.532	0.566	0.25	0.574	0.787	0.887	0.79	1.547	0.639
	GPT-5	0.596	0.718	0.596	0.625	0.356	0.426	0.468	0.684	0.593	1.162	0.751
	Qwen-VL-Max	0.298	0.547	0.298	0.339	0.058	0.957	1.468	1.212	0.956	1.872	0.315
Ascending colon	Gemini-2.5-Pro	0.49	0.739	0.49	0.514	0.309	0.673	1.041	1.02	0.766	1.211	0.665
	Grok-4	0.388	0.641	0.388	0.434	0.175	0.939	1.673	1.294	1.004	1.586	0.5
	GPT-4o	0.673	0.711	0.673	0.679	0.477	0.367	0.449	0.67	0.624	0.986	0.793
	GPT-5	0.796	0.812	0.796	0.798	0.657	0.204	0.204	0.452	0.435	0.687	0.894
	Qwen-VL-Max	0.367	0.727	0.367	0.432	0.144	0.776	1.061	1.03	0.857	1.355	0.513
Transverse colon	Gemini-2.5-Pro	0.549	0.72	0.549	0.573	0.394	0.471	0.51	0.714	0.621	0.621	0.803
	Grok-4	0.333	0.527	0.333	0.368	0.12	0.804	1.078	1.038	0.893	0.893	0.612
	GPT-4o	0.647	0.71	0.647	0.665	0.504	0.373	0.412	0.642	0.604	0.604	0.839
	GPT-5	0.784	0.818	0.784	0.796	0.688	0.216	0.216	0.464	0.454	0.454	0.9
	Qwen-VL-Max	0.392	0.543	0.392	0.431	0.155	0.725	0.961	0.98	0.956	0.956	0.495
Descending colon	Gemini-2.5-Pro	0.63	0.723	0.63	0.628	0.503	0.435	0.565	0.752	0.74	0.587	0.738
	Grok-4	0.413	0.432	0.413	0.416	0.211	0.739	1.087	1.043	1.034	0.82	0.518
	GPT-4o	0.63	0.626	0.63	0.615	0.491	0.435	0.565	0.752	0.72	0.571	0.777
	GPT-5	0.696	0.732	0.696	0.689	0.583	0.326	0.37	0.608	0.598	0.474	0.842
	Qwen-VL-Max	0.565	0.532	0.565	0.524	0.408	0.5	0.63	0.794	0.791	0.628	0.689
Sigmoid colon	Gemini-2.5-Pro	0.667	0.715	0.667	0.682	0.532	0.333	0.333	0.577	0.556	0.313	0.828
	Grok-4	0.689	0.696	0.689	0.683	0.57	0.311	0.311	0.558	0.558	0.314	0.855
	GPT-4o	0.556	0.571	0.556	0.541	0.378	0.489	0.578	0.76	0.739	0.416	0.686
	GPT-5	0.778	0.796	0.778	0.783	0.688	0.222	0.222	0.471	0.452	0.254	0.891
	Qwen-VL-Max	0.467	0.653	0.467	0.434	0.222	0.556	0.6	0.775	0.748	0.421	0.63
Rectum	Gemini-2.5-Pro	0.489	0.5	0.489	0.491	0.253	0.689	1.133	1.065	1.062	0.724	0.332
	Grok-4	0.356	0.387	0.356	0.35	0.125	0.644	0.644	0.803	0.788	0.537	0.665
	GPT-4o	0.444	0.502	0.444	0.453	0.238	0.622	0.756	0.869	0.865	0.59	0.606
	GPT-5	0.6	0.611	0.6	0.594	0.433	0.422	0.467	0.683	0.665	0.454	0.752
	Qwen-VL-Max	0.356	0.352	0.356	0.348	0.06	0.711	0.889	0.943	0.926	0.631	0.403

¹Pearson correlation coefficient. MAE: Mean absolute error; MSE: Mean squared error; RMSE: Root mean squared error; CV: Coefficient of variation; SD: Standard deviation.

Open in New Tab Full Size Table

Effect of segmental information on model performance

Inclusion of segmental information led to an overall increase in diagnostic accuracy by 1.55% across all models. The lower-performing models, Gemini-2.5-Pro and Qwen-VL-Max, demonstrated the largest relative gains (+3.89% and +5.30%, respectively), whereas Grok-4 demonstrated only a marginal gain (+0.354%). GPT-4o and GPT-5 exhibited negligible change, indicating performance stability regardless of contextual input.

Performance across MES grades

Model and physician performance across different MES grades is summarized in Table 5. Both senior IBD physicians achieved their highest diagnostic accuracy when identifying MES = 0 (mean accuracy = 99.1%) and their lowest when classifying MES = 1 (mean accuracy = 60.3%). Similarly, the MLLMs performed best on MES = 0 images (mean accuracy = 86.1%) and worst on MES = 1 (mean accuracy = 39.4%). The confusion-matrix distributions (Figures 4, 5, and 6) revealed that most misclassifications for both physicians and models occurred between adjacent grades (MES 0-1, 1-2, and 2-3). Notably, the models tended to overestimate disease severity (e.g., 0-1), whereas the physicians were more likely to underestimate it (e.g., 1-0). GPT-5 demonstrated accuracy comparable with Physician 1 in detecting moderate-to-severe inflammation (MES = 2/3, P = 0.710) but significantly outperformed physician 2 (P = 0.040).

Open in New Tab Full Size Figure Download Figure

Figure 4 Confusion matrices of expert scorings. A-D: Confusion matrices depicting the Mayo Endoscopic Subscore classification results for the two senior inflammatory bowel disease physicians. The horizontal axis represents the true Mayo Endoscopic Subscore grades, and the vertical axis indicates the predicted grades assigned by each physician. Color intensity corresponds to the number or proportion of images in each classification cell with deeper red tones denoting higher counts or frequencies.

Open in New Tab Full Size Figure Download Figure

Figure 5 Confusion matrices of multimodal large language models in phase A. A-J: Confusion matrices illustrating the Mayo Endoscopic Subscore classification outputs of five multimodal large language models: GPT-5 (A and B), GPT-4o (C and D), Gemini-2.5-Pro (E and F), Grok-4 (G and H), and Qwen-VL-Max (I and J). The horizontal axis denotes the true Mayo Endoscopic Subscore grades, and the vertical axis represents the predicted grades generated by each model. Color intensity corresponds to the number or proportion of images in each classification cell with deeper red tones indicating higher counts or frequencies.

Open in New Tab Full Size Figure Download Figure

Figure 6 Confusion matrices of multimodal large language models in phase B. A-J: Confusion matrices illustrating the Mayo Endoscopic Subscore classification outputs of five multimodal large language models under the segment-informed condition: GPT-5 (A and B), GPT-4o (C and D), Gemini-2.5-Pro (E and F), Grok-4 (G and H), and Qwen-VL-Max (I and J). The horizontal axis denotes the true Mayo Endoscopic Subscore grades, and the vertical axis represents the predicted grades generated by each model. Color intensity corresponds to the number or proportion of images in each classification cell, with deeper red tones indicating higher counts or frequencies.

Table 5 Diagnostic accuracy by Mayo Endoscopic Subscore grade among physicians and multimodal large language models.

MES grade	0	1	2	3
Expert 1	100	53.4	81.8	82.9
Expert 2	98.2	67.1	66.7	62.9
GPT-5	86.5	56.3	62.4	87.1
GPT-4o	85.3	45.2	53.8	52.2
Gemini-2.5-Pro	93.1	38.9	45.4	60.0
Grok-4	88.9	30.1	35.1	40.3
Qwen-VL-Max	76.7	26.3	33.3	38.5

MES: Mayo Endoscopic Subscore.

Open in New Tab Full Size Table

Sensitivity analysis on non-consensus images

Performance across all five MLLMs declined significantly when evaluated on the 109 non-consensus images in the sensitivity analysis set, confirming that these cases represent a more challenging disease spectrum. GPT-5 achieved an accuracy of 56.9% and a macro-F1 score of 0.529 (Supplementary Table 1), representing a marked decline from its performance on the consensus set (accuracy 71.7%, F1 0.720). Grok-4 showed a similar deterioration with accuracy decreasing to 31.2%. Notably, a large proportion of these ambiguous images were classified as MES 1 (38.5%) or MES 2 (40.4%), contributing to the lower agreement. This analysis quantified spectrum bias in the primary dataset and highlighted the difficulty AI models face when interpreting equivocal endoscopic features.

DISCUSSION

This study presented the first systematic evaluation of multiple state-of-the-art MLLMs for the specialized clinical task of MES grading directly compared with experienced IBD physicians. GPT-5 demonstrated the highest overall performance (Figure 2), achieving diagnostic accuracy comparable with that of experienced physicians. Model-level analyses revealed that the rectum and ileocecal region were the most challenging intestinal segments to interpret (Table 4 and Figure 3), and MES = 1 was the most difficult grade to classify accurately (Table 5). Providing segmental information markedly improved diagnostic accuracy for lower-performing models (Tables 3 and 4). To the best of our knowledge, this is the first study to evaluate the latest GPT-5 model in MES grading for UC, and the results demonstrated its superior performance compared with other contemporary MLLMs.

Substantial diagnostic agreement was observed among human physicians (κ = 0.692) although some interobserver variability persisted, consistent with previous reports[10,31]. This finding underscores the continued need for objective, reproducible tools to standardize endoscopic evaluation[7]. Compared with earlier CNN architectures, the MLLMs in this study achieved robust accuracy without task-specific fine-tuning. GPT-5 exhibited the highest overall performance (accuracy = 71.7%), confirming the potential of next-generation foundation models in endoscopic image interpretation. While these results remain slightly below the accuracy of some highly optimized CNN models[32-35], they highlight the key advantages of MLLMs, including strong generalization ability and minimal deployment requirements for clinical integration.

A previous study by Levartovsky et al[10] reported a diagnostic accuracy of 78.9% using GPT-4, a finding significantly higher than that observed in the present study. This discrepancy may partly reflect differences in dataset composition and evaluation difficulty. The images included in our dataset were likely more complex and diagnostically challenging, providing a more stringent evaluation of model robustness under real-world clinical conditions. In addition, their study[10] analyzed only 30 images, potentially introducing random variation or sampling bias. In contrast, our dataset comprised 283 images (nearly ten times larger), substantially reducing stochastic effects and improving the statistical reliability and generalizability of model performance comparisons.

Segment-wise analysis revealed that GPT-5 achieved scoring accuracy comparable with that of senior IBD physicians in the ascending and transverse colon, whereas the rectum and ileocecal region remained common interpretive bottlenecks for both human experts and models. The underlying causes likely relate to the distinctive anatomical and physiological characteristics of these regions[7]. In the rectum nonspecific erythema may arise from fecal irritation, prior enema procedures, or retroflexed observation, confounding grading accuracy. Similarly, the complex architecture of the ileocecal region and adjacent lymphoid tissue can obscure mucosal visualization and hinder reliable endoscopic interpretation.

Further stratified analysis demonstrated that incorporating additional anatomical contextual information generally enhanced model performance with the greatest improvements observed in lower-tier models. For instance, lymphoid hyperplasia in the ileocecal region can mimic mild inflammation (MES = 1); providing explicit segmental context allowed models to reference prior knowledge of benign regional features, thereby reducing misclassification[17]. In contrast, advanced models such as GPT-5 exhibited minimal benefit from additional contextual input, likely because extensive multimodal pre-training had already internalized segment-specific anatomical patterns. These findings suggest that the interpretative accuracy of GPT-5 is largely invariant to intestinal segment variability, underscoring its potential for robust deployment in dynamic, real-time endoscopic settings.

Stratified analysis by MES grade revealed that both GPT-5 and senior IBD physicians exhibited stable performance in recognizing the two extremes of disease activity (MES = 0 and MES = 3), whereas accuracy declined markedly for intermediate grades (MES = 1). This challenge likely reflects the inherent subjectivity of MES = 1, which is defined by ambiguous descriptors such as “mild erythema” and “blurred vascular pattern”. Notably, patients classified as MES = 1 have been shown to experience significantly worse clinical outcomes than those with MES = 0, influencing relapse risk and guiding therapeutic decisions[36]. These findings emphasize that future AI systems will require extensive training on large, diverse, and meticulously annotated datasets to more effectively capture and interpret these subtle, borderline endoscopic features.

Despite the strengths and clinical relevance of this study, several limitations should be considered. First, the data were derived from a single center with a relatively small sample size, potentially limiting statistical power and restricting generalizability. Multicenter studies with larger cohorts are required for external validation. Second, the analysis was confined to high-quality endoscopic images (Boston bowel preparation score > 5), introducing potential image-quality bias and possibly overestimating performance relative to real-world clinical practice where motion artifacts, residual stool, and mucus are frequently encountered. Future studies should include endoscopic images spanning a broad range of quality levels and perform stratified analyses to better reflect real-world clinical conditions. Third, all models were evaluated using a default temperature of 1.0 to assess performance under a more challenging, non-deterministic environment. Achieving accuracy comparable with that of senior clinicians under this condition (in the presence of random interference) suggests intrinsic robustness in handling complex medical classification tasks. However, this approach may introduce unnecessary stochasticity. Accordingly, future work will involve systematic hyperparameter tuning to identify optimal configurations and enhance diagnostic consistency for clinical deployment. Finally, limited model interpretability represents an important constraint. The commercial MLLMs evaluated do not currently provide visual explanatory outputs, such as attention maps or heatmaps, thereby limiting insight into the underlying rationale of model predictions and potentially hindering clinical trust and adoption.

All models evaluated in this study were general-purpose MLLMs, each offering distinct advantages in accessibility and ease of implementation. However, their performance remains constrained by several inherent limitations, including the relatively small size of available image datasets and reliance on static image inputs[37]. These limitations represent key targets for future model optimization and methodological advancement. First, targeted fine-tuning on large, high-quality, and expertly annotated endoscopic datasets is essential to enhance model capability in analyzing diagnostically challenging regions (such as the ileocecal region and rectum) and borderline inflammatory states (e.g., MES = 1). Second, because key indicators of disease activity, including mucosal friability and spontaneous hemorrhage, are often dynamic and episodic, static-image-based grading may underestimate or misclassify inflammatory severity. Incorporating full-length endoscopic video analysis would allow models to account for these transient features, thereby improving diagnostic accuracy and interobserver consistency. Third, true multimodal integration of endoscopic imagery with clinical data, including symptoms, laboratory markers, and histopathological findings, will enable a more comprehensive and clinically meaningful framework for disease activity assessment.

Optimized MLLMs hold substantial potential for clinical application. They could serve as educational tools helping physicians rapidly familiarize themselves with the MES scoring system. In clinical practice MLLMs may function as decision-support systems, offering real-time “second-opinion” guidance by highlighting suspicious mucosal regions during endoscopic examinations, thereby reducing missed lesions, minimizing interobserver variability, and improving procedural efficiency. Furthermore, in the context of large-scale clinical trials, MLLMs could act as centralized reviewers to enhance the objectivity, reproducibility, and consistency of endoscopic activity assessments.

CONCLUSION

MLLMs demonstrated promising potential for clinical translation in static-image-based MES grading. Incorporation of anatomical segment information served as an effective auxiliary strategy, particularly enhancing the accuracy of lower-performing models. Mild inflammation (MES = 1) and anatomically complex regions such as the rectum and ileocecal area remained common interpretive challenges for both human physicians and AI systems. Looking forward, targeted fine-tuning on large annotated datasets and comprehensive multimodal data integration are expected to further enhance MLLM performance, supporting their evolution into reliable tools for precision endoscopic diagnosis and physician training in UC.

References

1.	Voelker R. What Is Ulcerative Colitis? JAMA. 2024;331:716. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 133] [Cited by in RCA: 126] [Article Influence: 63.0] [Reference Citation Analysis (0)]

Mohammed Vashist N, Samaan M, Mosli MH, Parker CE, MacDonald JK, Nelson SA, Zou GY, Feagan BG, Khanna R, Jairath V. Endoscopic scoring indices for evaluation of disease activity in ulcerative colitis. Cochrane Database Syst Rev. 2018;1:CD011450. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 60] [Cited by in RCA: 98] [Article Influence: 12.3] [Reference Citation Analysis (0)]

Inflammatory Bowel Disease Group; Chinese Society of Gastroenterology; Chinese Medical Association; Inflammatory Bowel Disease Quality Control Center of China. 2023 Chinese national clinical practice guideline on diagnosis and management of ulcerative colitis. Chin Med J (Engl). 2024;137:1642-1646. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 21] [Reference Citation Analysis (1)]

Schroeder KW, Tremaine WJ, Ilstrup DM. Coated oral 5-aminosalicylic acid therapy for mildly to moderately active ulcerative colitis. A randomized study. N Engl J Med. 1987;317:1625-1629. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 2583] [Cited by in RCA: 2350] [Article Influence: 60.3] [Reference Citation Analysis (11)]

Yoon H, Jangi S, Dulai PS, Boland BS, Prokop LJ, Jairath V, Feagan BG, Sandborn WJ, Singh S. Incremental Benefit of Achieving Endoscopic and Histologic Remission in Patients With Ulcerative Colitis: A Systematic Review and Meta-Analysis. Gastroenterology. 2020;159:1262-1275.e7. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 194] [Cited by in RCA: 187] [Article Influence: 31.2] [Reference Citation Analysis (1)]

Shehab M, Alrashed F, Alsayegh A, Aldallal U, Ma C, Narula N, Jairath V, Singh S, Bessissow T. Comparative Efficacy of Biologics and Small Molecule in Ulcerative Colitis: A Systematic Review and Network Meta-analysis. Clin Gastroenterol Hepatol. 2025;23:250-262. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 5] [Cited by in RCA: 41] [Article Influence: 41.0] [Reference Citation Analysis (1)]

Fernandes SR, Pinto JSLD, Marques da Costa P, Correia L; GEDII. Disagreement Among Gastroenterologists Using the Mayo and Rutgeerts Endoscopic Scores. Inflamm Bowel Dis. 2018;24:254-260. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 14] [Cited by in RCA: 33] [Article Influence: 4.1] [Reference Citation Analysis (0)]

OpenAI; Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, Avila R, Babuschkin I, Balaji S, Balcom V, Baltescu P, Bao H, Bavarian M, Belgum J, Bello I, Berdine J, Bernadett-Shapiro G, Berner C, Bogdonoff L, Boiko O, Boyd M, Brakman AL, Brockman G, Brooks T, Brundage M, Button K, Cai T, Campbell R, Cann A, Carey B, Carlson C, Carmichael R, Chan B, Chang C, Chantzis F, Chen D, Chen S, Chen R, Chen J, Chen M, Chess B, Cho C, Chu C, Chung HW, Cummings D, Currier J, Dai Y, Decareaux C, Degry T, Deutsch N, Deville D, Dhar A, Dohan D, Dowling S, Dunning S, Ecoffet A, Eleti A, Eloundou T, Farhi D, Fedus L, Felix N, Fishman SP, Forte J, Fulford I, Gao L, Georges E, Gibson C, Goel V, Gogineni T, Goh G, Gontijo-Lopes R, Gordon J, Grafstein M, Gray S, Greene R, Gross J, Gu SS, Guo Y, Hallacy C, Han J, Harris J, He Y, Heaton M, Heidecke J, Hesse C, Hickey A, Hickey W, Hoeschele P, Houghton B, Hsu K, Hu S, Hu X, Huizinga J, Jain S, Jain S, Jang J, Jiang A, Jiang R, Jin H, Jin D, Jomoto S, Jonn B, Jun H, Kaftan T, Kaiser Ł, Kamali A, Kanitscheider I, Keskar NS, Khan T, Kilpatrick L, Kim JW, Kim C, Kim Y, Kirchner JH, Kiros J, Knight M, Kokotajlo D, Kondraciuk Ł, Kondrich A, Konstantinidis A, Kosic K, Krueger G, Kuo V, Lampe M, Lan I, Lee T, Leike J, Leung J, Levy D, Li CM, Lim R, Lin M, Lin S, Litwin M, Lopez T, Lowe R, Lue P, Makanju A, Malfacini K, Manning S, Markov T, Markovski Y, Martin B, Mayer K, Mayne A, McGrew B, McKinney SM, McLeavey C, McMillan P, McNeil J, Medina D, Mehta A, Menick J, Metz L, Mishchenko A, Mishkin P, Monaco V, Morikawa E, Mossing D, Mu T, Murati M, Murk O, Mély D, Nair A, Nakano R, Nayak R, Neelakantan A, Ngo R, Noh H, Ouyang L, O'Keefe C, Pachocki J, Paino A, Palermo J, Pantuliano A, Parascandolo G, Parish J, Parparita E, Passos A, Pavlov M, Peng A, Perelman A, Peres FDAB, Petrov M, Pinto HPDO, (Rai)Pokorny M, Pokrass M, Pong VH, Powell T, Power A, Power B, Proehl E, Puri R, Radford A, Rae J, Ramesh A, Raymond C, Real F, Rimbach K, Ross C, Rotsted B, Roussez H, Ryder N, Saltarelli M, Sanders T, Santurkar S, Sastry G, Schmidt H, Schnurr D, Schulman J, Selsam D, Sheppard K, Sherbakov T, Shieh J, Shoker S, Shyam P, Sidor S, Sigler E, Simens M, Sitkin J, Slama K, Sohl I, Sokolowsky B, Song Y, Staudacher N, Such FP, Summers N, Sutskever I, Tang J, Tezak N, Thompson MB, Tillet P, Tootoonchian A, Tseng E, Tuggle P, Turley N, Tworek J, Uribe JFC, Vallone A, Vijayvergiya A, Voss C, Wainwright C, Wang JJ, Wang A, Wang B, Ward J, Wei J, Weinmann C, Welihinda A, Welinder P, Weng J, Weng L, Wiethoff M, Willner D, Winter C, Wolrich S, Wong H, Workman L, Wu S, Wu J, Wu M, Xiao K, Xu T, Yoo S, Yu K, Yuan Q, Zaremba W, Zellers R, Zhang C, Zhang M, Zhao S, Zheng T, Zhuang J, Zhuk W, ZophB. GPT-4 Technical Report. 2023 Preprint. Available from: arXiv:2303.08774.

9.	Gemini Team Google. Gemini: A Family of Highly Capable Multimodal Models. 2023 Preprint. Available from: arXiv:2312.11805. [PubMed] [DOI]

10.

Levartovsky A, Albshesh A, Grinman A, Shachar E, Lahat A, Eliakim R, Kopylov U. Enhancing diagnostics: ChatGPT-4 performance in ulcerative colitis endoscopic assessment. Endosc Int Open. 2025;13:a25420943. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 2] [Cited by in RCA: 3] [Article Influence: 3.0] [Reference Citation Analysis (0)]

11.

Levartovsky A, Ben-Horin S, Kopylov U, Klang E, Barash Y. Towards AI-Augmented Clinical Decision-Making: An Examination of ChatGPT's Utility in Acute Ulcerative Colitis Presentations. Am J Gastroenterol. 2023;118:2283-2289. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 35] [Reference Citation Analysis (0)]

12.

Sciberras M, Farrugia Y, Gordon H, Furfaro F, Allocca M, Torres J, Arebi N, Fiorino G, Iacucci M, Verstockt B, Magro F, Katsanos K, Busuttil J, De Giovanni K, Fenech VA, Chetcuti Zammit S, Ellul P. Accuracy of Information given by ChatGPT for Patients with Inflammatory Bowel Disease in Relation to ECCO Guidelines. J Crohns Colitis. 2024;18:1215-1221. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 3] [Cited by in RCA: 39] [Article Influence: 19.5] [Reference Citation Analysis (0)]

13.

Shean R, Shah T, Pandiarajan A, Tang A, Bolo K, Nguyen V, Xu B. A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions. Sci Rep. 2025;15:23101. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 13] [Reference Citation Analysis (0)]

14.

Wu X, Cai G, Guo B, Ma L, Shao S, Yu J, Zheng Y, Wang L, Yang F. A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios. BMC Oral Health. 2025;25:1272. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 10] [Cited by in RCA: 15] [Article Influence: 15.0] [Reference Citation Analysis (0)]

15.

Sozer A, Sahin MC, Sozer B, Erol G, Tufek OY, Nernekli K, Demirtas Z, Celtikci E. Do LLMs Have 'the Eye' for MRI? Evaluating GPT-4o, Grok, and Gemini on Brain MRI Performance: First Evaluation of Grok in Medical Imaging and a Comparative Analysis. Diagnostics (Basel). 2025;15:1320. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 7] [Reference Citation Analysis (0)]

16.

Zhang Y, Han J, Chen H, Hu F, Huang Y, Tian G, Zhong D, Yang J. Deep learning-based fusion of nuclear segmentation features for microsatellite instability and tumor mutational burden prediction in digestive tract cancers: a multicenter validation study. Brief Bioinform. 2025;26:bbaf580. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 7] [Reference Citation Analysis (0)]

17.

Osada T, Ohkusa T, Yokoyama T, Shibuya T, Sakamoto N, Beppu K, Nagahara A, Otaka M, Ogihara T, Watanabe S. Comparison of several activity indices for the evaluation of endoscopic activity in UC: inter- and intraobserver consistency. Inflamm Bowel Dis. 2010;16:192-197. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 80] [Cited by in RCA: 73] [Article Influence: 4.6] [Reference Citation Analysis (3)]

18.

Stidham RW, Liu W, Bishu S, Rice MD, Higgins PDR, Zhu J, Nallamothu BK, Waljee AK. Performance of a Deep Learning Model vs Human Reviewers in Grading Endoscopic Disease Severity of Patients With Ulcerative Colitis. JAMA Netw Open. 2019;2:e193963. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 256] [Cited by in RCA: 205] [Article Influence: 29.3] [Reference Citation Analysis (6)]

19.

Shiku K, Nishimura K, Suehiro D, Tanaka K, Bise R. Ordinal Multiple-instance Learning for Ulcerative Colitis Severity Estimation with Selective Aggregated Transformer. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); 2025; Tucson, United States. IEEE Xplore, United States. [DOI] [Full Text]

20.	Dheivya I, Kumar GS. VSegNet – A Variant SegNet for Improving Segmentation Accuracy in Medical Images with Class Imbalance and Limited Data. Medinform. 2024;2:36-48. [PubMed] [DOI] [Full Text]

21.	Yang B, Xu S, Yin L, Liu C, Zheng W. Disparity estimation of stereo-endoscopic images using deep generative network. ICT Express. 2025;11:74-79. [PubMed] [DOI] [Full Text]

22.

Qin Y, Chang J, Li L, Wu M. Enhancing gastroenterology with multimodal learning: the role of large language model chatbots in digestive endoscopy. Front Med (Lausanne). 2025;12:1583514. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 13] [Cited by in RCA: 8] [Article Influence: 8.0] [Reference Citation Analysis (0)]

23.

Patel M, Gulati S, Iqbal F, Hayee B. Rapid development of accurate artificial intelligence scoring for colitis disease activity using applied data science techniques. Endosc Int Open. 2022;10:E539-E543. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 8] [Reference Citation Analysis (0)]

24.

Testoni SGG, Albertini Petroni G, Annunziata ML, Dell'Anna G, Puricelli M, Delogu C, Annese V. Artificial Intelligence in Inflammatory Bowel Disease Endoscopy. Diagnostics (Basel). 2025;15:905. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 7] [Reference Citation Analysis (1)]

25.

Nakase H, Hirano T, Wagatsuma K, Ichimiya T, Yamakawa T, Yokoyama Y, Hayashi Y, Hirayama D, Kazama T, Yoshii S, Yamano HO. Artificial intelligence-assisted endoscopy changes the definition of mucosal healing in ulcerative colitis. Dig Endosc. 2021;33:903-911. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 22] [Cited by in RCA: 20] [Article Influence: 4.0] [Reference Citation Analysis (0)]

26.	Yin S, Fu C, Zhao S, Li K, Sun X, Xu T, Chen E. A survey on multimodal large language models. Natl Sci Rev. 2024;11:nwae403. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 82] [Reference Citation Analysis (0)]

27.	Zhang J, Li Y, Fukuda T, Wang B. Urban safety perception assessments via integrating multimodal large language models with street view images. Cities. 2025;165:106122. [PubMed] [DOI] [Full Text]

28.	Qiu J, Yuan W, Lam K. The application of multimodal large language models in medicine. Lancet Reg Health West Pac. 2024;45:101048. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 10] [Reference Citation Analysis (0)]

29.

AlSaad R, Abd-Alrazaq A, Boughorbel S, Ahmed A, Renault MA, Damseh R, Sheikh J. Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook. J Med Internet Res. 2024;26:e59505. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 111] [Reference Citation Analysis (0)]

30.

Lai EJ, Calderwood AH, Doros G, Fix OK, Jacobson BC. The Boston bowel preparation scale: a valid and reliable instrument for colonoscopy-oriented research. Gastrointest Endosc. 2009;69:620-625. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 1033] [Cited by in RCA: 1009] [Article Influence: 59.4] [Reference Citation Analysis (9)]

31.

Chang YY, Yang HP, Chen YY, Yen HH. Comparison of the performance between an AI-based vision transformer and human endoscopists in predicting the endoscopic and histologic activities of ulcerative colitis. Digit Health. 2026;12:20552076251412694. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 3] [Reference Citation Analysis (0)]

32.

Kuroki T, Maeda Y, Kudo SE, Ogata N, Iacucci M, Takishima K, Ide Y, Shibuya T, Semba S, Kawashima J, Kato S, Ogawa Y, Ichimasa K, Nakamura H, Hayashi T, Wakamura K, Miyachi H, Baba T, Nemoto T, Ohtsuka K, Misawa M. A novel artificial intelligence-assisted "vascular healing" diagnosis for prediction of future clinical relapse in patients with ulcerative colitis: a prospective cohort study (with video). Gastrointest Endosc. 2024;100:97-108. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 23] [Cited by in RCA: 23] [Article Influence: 11.5] [Reference Citation Analysis (1)]

33.

Ogata N, Maeda Y, Misawa M, Takenaka K, Takabayashi K, Iacucci M, Kuroki T, Takishima K, Sasabe K, Niimura Y, Kawashima J, Ogawa Y, Ichimasa K, Nakamura H, Matsudaira S, Sasanuma S, Hayashi T, Wakamura K, Miyachi H, Baba T, Mori Y, Ohtsuka K, Ogata H, Kudo SE. Artificial Intelligence-assisted Video Colonoscopy for Disease Monitoring of Ulcerative Colitis: A Prospective Study. J Crohns Colitis. 2025;19:jjae080. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 19] [Cited by in RCA: 22] [Article Influence: 22.0] [Reference Citation Analysis (0)]

34.

Takabayashi K, Kobayashi T, Matsuoka K, Levesque BG, Kawamura T, Tanaka K, Kadota T, Bise R, Uchida S, Kanai T, Ogata H. Artificial intelligence quantifying endoscopic severity of ulcerative colitis in gradation scale. Dig Endosc. 2024;36:582-590. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 2] [Cited by in RCA: 20] [Article Influence: 10.0] [Reference Citation Analysis (0)]

35.

Kuroki T, Maeda Y, Kudo SE, Ogata N, Takabayashi K, Takenaka K, Kawashima J, Kawabata Y, Iwasaki S, Shiina O, Morita Y, Kouyama Y, Sakurai T, Ogawa Y, Baba T, Mori Y, Iacucci M, Ogata H, Ohtsuka K, Misawa M. Combination of white-light imaging-based and narrow-band imaging-based artificial intelligence models during colonoscopy in patients with ulcerative colitis. J Crohns Colitis. 2025;19:jjaf014. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 1] [Cited by in RCA: 6] [Article Influence: 6.0] [Reference Citation Analysis (0)]

36.	Jin X, You Y, Ruan G, Zhou W, Li J, Li J. Deep mucosal healing in ulcerative colitis: how deep is better? Front Med (Lausanne). 2024;11:1429427. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 5] [Reference Citation Analysis (0)]

37.	Moradi M, Samwald M. Explaining Black-Box Models for Biomedical Text Classification. IEEE J Biomed Health Inform. 2021;25:3112-3120. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 3] [Cited by in RCA: 7] [Article Influence: 1.4] [Reference Citation Analysis (0)]

Footnotes

Peer review: Externally peer reviewed.

Peer-review model: Single blind

Specialty type: Gastroenterology and hepatology

Country of origin: China

Peer-review report’s classification

Scientific quality: Grade A, Grade A, Grade B, Grade C, Grade C

Novelty: Grade A, Grade A, Grade B, Grade C, Grade C

Creativity or innovation: Grade A, Grade A, Grade B, Grade C, Grade C

Scientific significance: Grade A, Grade A, Grade B, Grade C, Grade C

P-Reviewer: Gugulothu D, PhD, Assistant Professor, India; Rizwan M, PhD, Pakistan; Zhou S, PhD, Postdoctoral Fellow, United States S-Editor: Lin C L-Editor: Filipodia P-Editor: Zhang YL