Published online Jun 28, 2026. doi: 10.3748/wjg.v32.i24.118690
Revised: February 17, 2026
Accepted: February 28, 2026
Published online: June 28, 2026
Processing time: 149 Days and 13.5 Hours
Ulcerative colitis (UC) is a chronic inflammatory disorder that typically affects adults aged 20-40 years and presents with hematochezia and diarrhea. Endoscopic evaluation using the Mayo Endoscopic Subscore (MES; 0-3) is standard for asse
To compare the diagnostic accuracy of five MLLMs in UC MES evaluation and to assess consistency across segments and grades.
We collected 402 endoscopic images from patients with UC covering the entire colon. Three experienced experts independently graded all images according to MES criteria, and 283 images with a unanimous consensus were included as the reference standard. These images were evaluated by five MLLMs and two senior physicians under two conditions: Without segmental context and with anatomical segmental information. Model and physician performance were compared, and stratified analyses were conducted by intestinal segment and MES grade.
The diagnostic accuracies of the two inflammatory bowel disease physicians were 81.6% and 78.4%, respectively, with strong interobserver agreement (κ = 0.692). Among the MLLMs, GPT-5 achieved the highest overall performance (F1 = 0.720) with accuracy comparable with that of physician 2 (71.7% vs 78.4%, P = 0.068). Other models exhibited significantly lower performance (all P < 0.001). The sigmoid colon was the most accurately assessed region (mean F1 = 0.682), whereas the rectum and ileocecal region were the most challenging (0.447 and 0.493, respectively). The addition of segmental information improved the accuracy of lower-performing models. Both models and physicians showed the lowest accuracy for MES = 1, reflecting the subjective nature of mild disease activity.
This study suggested that GPT-5 holds potential for static-image-based MES grading, but performance varies and requires external validation and optimization. Future work will optimize artificial intelligence endoscopy through tuning and multimodality.
Core Tip: This single-center proof-of-concept study was the first to systematically compare five multimodal large language models for static-image-based Mayo Endoscopic Subscore grading in ulcerative colitis. The key innovation was the direct benchmarking against experienced inflammatory bowel disease physicians and the incorporation of segment-specific analyses that uncovered anatomical heterogeneity in model performance. Notably, GPT-5 achieved near-expert-level accuracy without specialized tuning, highlighting its potential role in standardized endoscopic assessment.