Copyright: ©Author(s) 2026.
World J Gastroenterol. Jun 28, 2026; 32(24): 118690
Published online Jun 28, 2026. doi: 10.3748/wjg.118690
Published online Jun 28, 2026. doi: 10.3748/wjg.118690
Table 1 Mayo Endoscopic Subscore
| Score | Description | Endoscopic findings |
| 0 | Normal or inactive disease | Normal mucosal appearance; intact vascular pattern; no friability, bleeding, or ulceration |
| 1 | Mild disease | Mild friability; decreased but visible vascular pattern; mild erythema; no erosions |
| 2 | Moderate disease | Marked erythema; absent vascular pattern; friability; erosions may be present |
| 3 | Severe disease | Spontaneous bleeding; ulceration; denuded mucosa; severe friability |
Table 2 Overall performance comparison of five multimodal large language models in phase A
| Gemini-2.5-Pro | Grok-4 | GPT-4o | GPT-5 | Qwen-VL-Max | |
| Accuracy | 0.502 | 0.417 | 0.594 | 0.717 | 0.353 |
| Precision | 0.639 | 0.552 | 0.635 | 0.731 | 0.488 |
| Recall | 0.502 | 0.417 | 0.594 | 0.717 | 0.353 |
| F1 score (95%CI) | 0.480 (0.452-0.574) | 0.415 (0.369-0.481) | 0.602 (0.522-0.648) | 0.720 (0.665-0.773) | 0.338 (0.263-0.382) |
| Cohen’s κ | 0.343 | 0.239 | 0.449 | 0.608 | 0.133 |
| MAE | 0.611 | 0.770 | 0.452 | 0.297 | 0.767 |
| MSE | 0.866 | 1.194 | 0.544 | 0.325 | 1.021 |
| RMSE | 0.930 | 1.093 | 0.738 | 0.570 | 1.011 |
| SD | 0.781 | 0.918 | 0.697 | 0.564 | 0.953 |
| CV | 0.713 | 0.838 | 0.637 | 0.515 | 0.870 |
| r value1 | 0.681 | 0.590 | 0.777 | 0.852 | 0.478 |
Table 3 Overall performance comparison of five multimodal large language models in phase B
| Gemini-2.5-Pro | Grok-4 | GPT-4o | GPT-5 | Qwen-VL-Max | |
| Accuracy | 0.541 | 0.420 | 0.583 | 0.710 | 0.406 |
| Precision | 0.621 | 0.480 | 0.590 | 0.724 | 0.515 |
| Recall | 0.541 | 0.420 | 0.583 | 0.710 | 0.406 |
| F1 score (95%CI) | 0.546 (0.490-0.611) | 0.426 (0.363-0.481) | 0.584 (0.502-0.618) | 0.715 (0.664-0.768) | 0.408 (0.313-0.439) |
| Cohen’s κ | 0.381 | 0.229 | 0.425 | 0.596 | 0.190 |
| MAE | 0.565 | 0.753 | 0.473 | 0.300 | 0.707 |
| MSE | 0.799 | 1.141 | 0.587 | 0.322 | 0.940 |
| RMSE | 0.894 | 1.068 | 0.766 | 0.567 | 0.969 |
| SD | 0.843 | 0.973 | 0.755 | 0.566 | 0.951 |
| CV | 0.769 | 0.888 | 0.689 | 0.517 | 0.868 |
| r value1 | 0.644 | 0.573 | 0.749 | 0.850 | 0.499 |
Table 4 Segment-wise performance comparison of five multimodal large language models in phase B
| Segments | Models | Accuracy | Precision | Recall | F1 score | Cohen’s κ | MAE | MSE | RMSE | SD | CV | r value1 |
| Ileocecal region | Gemini-2.5-Pro | 0.426 | 0.752 | 0.426 | 0.478 | 0.226 | 0.787 | 1.213 | 1.101 | 0.77 | 1.508 | 0.606 |
| Grok-4 | 0.362 | 0.727 | 0.362 | 0.456 | 0.144 | 1.043 | 1.979 | 1.407 | 1.031 | 2.018 | 0.453 | |
| GPT-4o | 0.532 | 0.643 | 0.532 | 0.566 | 0.25 | 0.574 | 0.787 | 0.887 | 0.79 | 1.547 | 0.639 | |
| GPT-5 | 0.596 | 0.718 | 0.596 | 0.625 | 0.356 | 0.426 | 0.468 | 0.684 | 0.593 | 1.162 | 0.751 | |
| Qwen-VL-Max | 0.298 | 0.547 | 0.298 | 0.339 | 0.058 | 0.957 | 1.468 | 1.212 | 0.956 | 1.872 | 0.315 | |
| Ascending colon | Gemini-2.5-Pro | 0.49 | 0.739 | 0.49 | 0.514 | 0.309 | 0.673 | 1.041 | 1.02 | 0.766 | 1.211 | 0.665 |
| Grok-4 | 0.388 | 0.641 | 0.388 | 0.434 | 0.175 | 0.939 | 1.673 | 1.294 | 1.004 | 1.586 | 0.5 | |
| GPT-4o | 0.673 | 0.711 | 0.673 | 0.679 | 0.477 | 0.367 | 0.449 | 0.67 | 0.624 | 0.986 | 0.793 | |
| GPT-5 | 0.796 | 0.812 | 0.796 | 0.798 | 0.657 | 0.204 | 0.204 | 0.452 | 0.435 | 0.687 | 0.894 | |
| Qwen-VL-Max | 0.367 | 0.727 | 0.367 | 0.432 | 0.144 | 0.776 | 1.061 | 1.03 | 0.857 | 1.355 | 0.513 | |
| Transverse colon | Gemini-2.5-Pro | 0.549 | 0.72 | 0.549 | 0.573 | 0.394 | 0.471 | 0.51 | 0.714 | 0.621 | 0.621 | 0.803 |
| Grok-4 | 0.333 | 0.527 | 0.333 | 0.368 | 0.12 | 0.804 | 1.078 | 1.038 | 0.893 | 0.893 | 0.612 | |
| GPT-4o | 0.647 | 0.71 | 0.647 | 0.665 | 0.504 | 0.373 | 0.412 | 0.642 | 0.604 | 0.604 | 0.839 | |
| GPT-5 | 0.784 | 0.818 | 0.784 | 0.796 | 0.688 | 0.216 | 0.216 | 0.464 | 0.454 | 0.454 | 0.9 | |
| Qwen-VL-Max | 0.392 | 0.543 | 0.392 | 0.431 | 0.155 | 0.725 | 0.961 | 0.98 | 0.956 | 0.956 | 0.495 | |
| Descending colon | Gemini-2.5-Pro | 0.63 | 0.723 | 0.63 | 0.628 | 0.503 | 0.435 | 0.565 | 0.752 | 0.74 | 0.587 | 0.738 |
| Grok-4 | 0.413 | 0.432 | 0.413 | 0.416 | 0.211 | 0.739 | 1.087 | 1.043 | 1.034 | 0.82 | 0.518 | |
| GPT-4o | 0.63 | 0.626 | 0.63 | 0.615 | 0.491 | 0.435 | 0.565 | 0.752 | 0.72 | 0.571 | 0.777 | |
| GPT-5 | 0.696 | 0.732 | 0.696 | 0.689 | 0.583 | 0.326 | 0.37 | 0.608 | 0.598 | 0.474 | 0.842 | |
| Qwen-VL-Max | 0.565 | 0.532 | 0.565 | 0.524 | 0.408 | 0.5 | 0.63 | 0.794 | 0.791 | 0.628 | 0.689 | |
| Sigmoid colon | Gemini-2.5-Pro | 0.667 | 0.715 | 0.667 | 0.682 | 0.532 | 0.333 | 0.333 | 0.577 | 0.556 | 0.313 | 0.828 |
| Grok-4 | 0.689 | 0.696 | 0.689 | 0.683 | 0.57 | 0.311 | 0.311 | 0.558 | 0.558 | 0.314 | 0.855 | |
| GPT-4o | 0.556 | 0.571 | 0.556 | 0.541 | 0.378 | 0.489 | 0.578 | 0.76 | 0.739 | 0.416 | 0.686 | |
| GPT-5 | 0.778 | 0.796 | 0.778 | 0.783 | 0.688 | 0.222 | 0.222 | 0.471 | 0.452 | 0.254 | 0.891 | |
| Qwen-VL-Max | 0.467 | 0.653 | 0.467 | 0.434 | 0.222 | 0.556 | 0.6 | 0.775 | 0.748 | 0.421 | 0.63 | |
| Rectum | Gemini-2.5-Pro | 0.489 | 0.5 | 0.489 | 0.491 | 0.253 | 0.689 | 1.133 | 1.065 | 1.062 | 0.724 | 0.332 |
| Grok-4 | 0.356 | 0.387 | 0.356 | 0.35 | 0.125 | 0.644 | 0.644 | 0.803 | 0.788 | 0.537 | 0.665 | |
| GPT-4o | 0.444 | 0.502 | 0.444 | 0.453 | 0.238 | 0.622 | 0.756 | 0.869 | 0.865 | 0.59 | 0.606 | |
| GPT-5 | 0.6 | 0.611 | 0.6 | 0.594 | 0.433 | 0.422 | 0.467 | 0.683 | 0.665 | 0.454 | 0.752 | |
| Qwen-VL-Max | 0.356 | 0.352 | 0.356 | 0.348 | 0.06 | 0.711 | 0.889 | 0.943 | 0.926 | 0.631 | 0.403 |
Table 5 Diagnostic accuracy by Mayo Endoscopic Subscore grade among physicians and multimodal large language models
| MES grade | 0 | 1 | 2 | 3 |
| Expert 1 | 100 | 53.4 | 81.8 | 82.9 |
| Expert 2 | 98.2 | 67.1 | 66.7 | 62.9 |
| GPT-5 | 86.5 | 56.3 | 62.4 | 87.1 |
| GPT-4o | 85.3 | 45.2 | 53.8 | 52.2 |
| Gemini-2.5-Pro | 93.1 | 38.9 | 45.4 | 60.0 |
| Grok-4 | 88.9 | 30.1 | 35.1 | 40.3 |
| Qwen-VL-Max | 76.7 | 26.3 | 33.3 | 38.5 |
- Citation: Zhao XY, Shen Y, He JJ, Zhou QY, Jiang LS, Zhou ZR, An FM, Zhan Q, Sun J, Feng W. Comparative evaluation of multimodal large language models for Mayo Endoscopic Subscore grading in ulcerative colitis. World J Gastroenterol 2026; 32(24): 118690
- URL: https://www.wjgnet.com/1007-9327/full/v32/i24/118690.htm
- DOI: https://dx.doi.org/10.3748/wjg.118690