BPG is committed to discovery and dissemination of knowledge
Retrospective Study
Copyright: ©Author(s) 2026. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license. No commercial re-use. See permissions. Published by Baishideng Publishing Group Inc.
World J Gastroenterol. Jun 28, 2026; 32(24): 118690
Published online Jun 28, 2026. doi: 10.3748/wjg.v32.i24.118690
Comparative evaluation of multimodal large language models for Mayo Endoscopic Subscore grading in ulcerative colitis
Xiao-Yi Zhao, Yue Shen, Jun-Jie He, Qun-Yan Zhou, Li-Sha Jiang, Zi-Ru Zhou, Fang-Mei An, Qiang Zhan, Jing Sun, Wei Feng
Xiao-Yi Zhao, Yue Shen, Qun-Yan Zhou, Li-Sha Jiang, Zi-Ru Zhou, Fang-Mei An, Qiang Zhan, Jing Sun, Department of Gastroenterology, The Affiliated Wuxi People’s Hospital of Nanjing Medical University, Wuxi 214023, Jiangsu Province, China
Jun-Jie He, State Key Laboratory of Epigenetic Regulation and Intervention, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China
Wei Feng, Department of Information, The Affiliated Wuxi People’s Hospital of Nanjing Medical University, Wuxi 214023, Jiangsu Province, China
Co-first authors: Xiao-Yi Zhao and Yue Shen.
Co-corresponding authors: Jing Sun and Wei Feng.
Author contributions: Zhao XY and Shen Y performed the research and contributed equally to this study as co-first authors; He JJ analyzed the data and wrote the manuscript; Zhou QY, Jiang LS, Zhou ZR, An FM, and Zhan Q participated in parts of the research as a doctor or an expert and assisted in data collection; Sun J and Feng W designed the study and provided overall supervision throughout the research process and contributed equally to this study as co-corresponding authors; All authors read and approved the final manuscript.
AI contribution statement: We used AI tool ChatGPT/DeepL to polish the language during the preparation of this manuscript. All scientific content, ideas, and original drafting are entirely the work of the human authors. AI tools were used exclusively for preliminary language polishing to improve the readability and grammar of the initial draft. It was not used for data analysis, or substantive writing assistance. AI tools did not participate in the study design, data analysis, or the interpretation of the results. All intellectual and scientific contributions were independently made by the authors. None of the images or figures in the manuscript were generated by AI.
Supported by the Program of Jiangsu Branch of the National Clinical Research Center for Digestive Diseases, No. JSZX202301; Basic Research Program of Jiangsu, No. BK20231146; Key Program of Wuxi Medical Center, Nanjing Medical University, No. WMCM202305; Doctoral Talent Fund of Wuxi People’s Hospital, No. BSRC202408; and Scientific Research Project of Wuxi Municipal Health Commission, No. M202503.
Institutional review board statement: The study was reviewed and approved by the Ethics Committee of Wuxi People’s Hospital Affiliated to Nanjing Medical University. (Approval No. KY25207).
Informed consent statement: This retrospective study involved minimal risk to participants and did not compromise their health or rights. Personal privacy and identifying information were fully protected, and obtaining informed consent was practically unfeasible. Accordingly, the ethics committee granted a waiver of informed consent.
Conflict-of-interest statement: There is no conflict of interest associated with the senior author or other coauthors who contributed their efforts in this manuscript.
Data sharing statement: Statistical code and dataset available from the corresponding author at fengwei@njmu.edu.cn.
Corresponding author: Wei Feng, PhD, Department of Information, The Affiliated Wuxi People’s Hospital of Nanjing Medical University, No. 299 Qingyang Road, Wuxi 214023, Jiangsu Province, China. fengwei@njmu.edu.cn
Received: January 12, 2026
Revised: February 17, 2026
Accepted: February 28, 2026
Published online: June 28, 2026
Processing time: 149 Days and 13.5 Hours
Abstract
BACKGROUND

Ulcerative colitis (UC) is a chronic inflammatory disorder that typically affects adults aged 20-40 years and presents with hematochezia and diarrhea. Endoscopic evaluation using the Mayo Endoscopic Subscore (MES; 0-3) is standard for assessing disease severity, but grading depends heavily on endoscopist experience and is prone to interobserver variability. Recent advances in multimodal large language models (MLLMs), such as GPT and Gemini, enable automated interpretation of medical images and text, offering potential for objective and reproducible MES grading. However, their accuracy and consistency across intestinal segments remain largely untested.

AIM

To compare the diagnostic accuracy of five MLLMs in UC MES evaluation and to assess consistency across segments and grades.

METHODS

We collected 402 endoscopic images from patients with UC covering the entire colon. Three experienced experts independently graded all images according to MES criteria, and 283 images with a unanimous consensus were included as the reference standard. These images were evaluated by five MLLMs and two senior physicians under two conditions: Without segmental context and with anatomical segmental information. Model and physician performance were compared, and stratified analyses were conducted by intestinal segment and MES grade.

RESULTS

The diagnostic accuracies of the two inflammatory bowel disease physicians were 81.6% and 78.4%, respectively, with strong interobserver agreement (κ = 0.692). Among the MLLMs, GPT-5 achieved the highest overall performance (F1 = 0.720) with accuracy comparable with that of physician 2 (71.7% vs 78.4%, P = 0.068). Other models exhibited significantly lower performance (all P < 0.001). The sigmoid colon was the most accurately assessed region (mean F1 = 0.682), whereas the rectum and ileocecal region were the most challenging (0.447 and 0.493, respectively). The addition of segmental information improved the accuracy of lower-performing models. Both models and physicians showed the lowest accuracy for MES = 1, reflecting the subjective nature of mild disease activity.

CONCLUSION

This study suggested that GPT-5 holds potential for static-image-based MES grading, but performance varies and requires external validation and optimization. Future work will optimize artificial intelligence endoscopy through tuning and multimodality.

Keywords: Ulcerative colitis; Mayo Endoscopic Subscore; Endoscopic assessment; Artificial intelligence; Multimodal large language models; Diagnostic accuracy

Core Tip: This single-center proof-of-concept study was the first to systematically compare five multimodal large language models for static-image-based Mayo Endoscopic Subscore grading in ulcerative colitis. The key innovation was the direct benchmarking against experienced inflammatory bowel disease physicians and the incorporation of segment-specific analyses that uncovered anatomical heterogeneity in model performance. Notably, GPT-5 achieved near-expert-level accuracy without specialized tuning, highlighting its potential role in standardized endoscopic assessment.

Write to the Help Desk