Comparative evaluation of multimodal large language models for Mayo Endoscopic Subscore grading in ulcerative colitis

doi:10.3748/wjg.118690

Advanced Search

BPG is committed to discovery and dissemination of knowledge

Home / Archive / Volume 32, Issue 24

This Article

(17)

(16)

(0)

(1)

(475)

Peer-Review Report of This Article

CrossCheck and Google Search of This Article

Academic Rules and Norms of This Article

Supplementary Materials of This Article

Citation of this article

Corresponding Author of This Article

Research Domain of This Article

Article-Type of This Article

Open-Access Policy of This Article

Times Cited Counts in Google of This Article

Journal Information of This Article

Publication Name

World Journal of Gastroenterology

ISSN

1007-9327

Publisher of This Article

Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA

Retrospective Study

Copyright: ©Author(s) 2026. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license. No commercial re-use. See permissions. Published by Baishideng Publishing Group Inc.

World J Gastroenterol. Jun 28, 2026; 32(24): 118690
Published online Jun 28, 2026. doi: 10.3748/wjg.118690

Table 1 Mayo Endoscopic Subscore

Score	Description	Endoscopic findings
0	Normal or inactive disease	Normal mucosal appearance; intact vascular pattern; no friability, bleeding, or ulceration
1	Mild disease	Mild friability; decreased but visible vascular pattern; mild erythema; no erosions
2	Moderate disease	Marked erythema; absent vascular pattern; friability; erosions may be present
3	Severe disease	Spontaneous bleeding; ulceration; denuded mucosa; severe friability

Full Size Table

Table 2 Overall performance comparison of five multimodal large language models in phase A

	Gemini-2.5-Pro	Grok-4	GPT-4o	GPT-5	Qwen-VL-Max
Accuracy	0.502	0.417	0.594	0.717	0.353
Precision	0.639	0.552	0.635	0.731	0.488
Recall	0.502	0.417	0.594	0.717	0.353
F1 score (95%CI)	0.480 (0.452-0.574)	0.415 (0.369-0.481)	0.602 (0.522-0.648)	0.720 (0.665-0.773)	0.338 (0.263-0.382)
Cohen’s κ	0.343	0.239	0.449	0.608	0.133
MAE	0.611	0.770	0.452	0.297	0.767
MSE	0.866	1.194	0.544	0.325	1.021
RMSE	0.930	1.093	0.738	0.570	1.011
SD	0.781	0.918	0.697	0.564	0.953
CV	0.713	0.838	0.637	0.515	0.870
r value¹	0.681	0.590	0.777	0.852	0.478

¹Pearson correlation coefficient. MAE: Mean absolute error; MSE: Mean squared error; RMSE: Root mean squared error; CV: Coefficient of variation; SD: Standard deviation; CI: Confidence interval.

Full Size Table

Table 3 Overall performance comparison of five multimodal large language models in phase B

	Gemini-2.5-Pro	Grok-4	GPT-4o	GPT-5	Qwen-VL-Max
Accuracy	0.541	0.420	0.583	0.710	0.406
Precision	0.621	0.480	0.590	0.724	0.515
Recall	0.541	0.420	0.583	0.710	0.406
F1 score (95%CI)	0.546 (0.490-0.611)	0.426 (0.363-0.481)	0.584 (0.502-0.618)	0.715 (0.664-0.768)	0.408 (0.313-0.439)
Cohen’s κ	0.381	0.229	0.425	0.596	0.190
MAE	0.565	0.753	0.473	0.300	0.707
MSE	0.799	1.141	0.587	0.322	0.940
RMSE	0.894	1.068	0.766	0.567	0.969
SD	0.843	0.973	0.755	0.566	0.951
CV	0.769	0.888	0.689	0.517	0.868
r value¹	0.644	0.573	0.749	0.850	0.499

¹Pearson correlation coefficient. MAE: Mean absolute error; MSE: Mean squared error; RMSE: Root mean squared error; CV: Coefficient of variation; SD: Standard deviation; CI: Confidence interval.

Full Size Table

Table 4 Segment-wise performance comparison of five multimodal large language models in phase B

Segments	Models	Accuracy	Precision	Recall	F1 score	Cohen’s κ	MAE	MSE	RMSE	SD	CV	r value¹
Ileocecal region	Gemini-2.5-Pro	0.426	0.752	0.426	0.478	0.226	0.787	1.213	1.101	0.77	1.508	0.606
	Grok-4	0.362	0.727	0.362	0.456	0.144	1.043	1.979	1.407	1.031	2.018	0.453
	GPT-4o	0.532	0.643	0.532	0.566	0.25	0.574	0.787	0.887	0.79	1.547	0.639
	GPT-5	0.596	0.718	0.596	0.625	0.356	0.426	0.468	0.684	0.593	1.162	0.751
	Qwen-VL-Max	0.298	0.547	0.298	0.339	0.058	0.957	1.468	1.212	0.956	1.872	0.315
Ascending colon	Gemini-2.5-Pro	0.49	0.739	0.49	0.514	0.309	0.673	1.041	1.02	0.766	1.211	0.665
	Grok-4	0.388	0.641	0.388	0.434	0.175	0.939	1.673	1.294	1.004	1.586	0.5
	GPT-4o	0.673	0.711	0.673	0.679	0.477	0.367	0.449	0.67	0.624	0.986	0.793
	GPT-5	0.796	0.812	0.796	0.798	0.657	0.204	0.204	0.452	0.435	0.687	0.894
	Qwen-VL-Max	0.367	0.727	0.367	0.432	0.144	0.776	1.061	1.03	0.857	1.355	0.513
Transverse colon	Gemini-2.5-Pro	0.549	0.72	0.549	0.573	0.394	0.471	0.51	0.714	0.621	0.621	0.803
	Grok-4	0.333	0.527	0.333	0.368	0.12	0.804	1.078	1.038	0.893	0.893	0.612
	GPT-4o	0.647	0.71	0.647	0.665	0.504	0.373	0.412	0.642	0.604	0.604	0.839
	GPT-5	0.784	0.818	0.784	0.796	0.688	0.216	0.216	0.464	0.454	0.454	0.9
	Qwen-VL-Max	0.392	0.543	0.392	0.431	0.155	0.725	0.961	0.98	0.956	0.956	0.495
Descending colon	Gemini-2.5-Pro	0.63	0.723	0.63	0.628	0.503	0.435	0.565	0.752	0.74	0.587	0.738
	Grok-4	0.413	0.432	0.413	0.416	0.211	0.739	1.087	1.043	1.034	0.82	0.518
	GPT-4o	0.63	0.626	0.63	0.615	0.491	0.435	0.565	0.752	0.72	0.571	0.777
	GPT-5	0.696	0.732	0.696	0.689	0.583	0.326	0.37	0.608	0.598	0.474	0.842
	Qwen-VL-Max	0.565	0.532	0.565	0.524	0.408	0.5	0.63	0.794	0.791	0.628	0.689
Sigmoid colon	Gemini-2.5-Pro	0.667	0.715	0.667	0.682	0.532	0.333	0.333	0.577	0.556	0.313	0.828
	Grok-4	0.689	0.696	0.689	0.683	0.57	0.311	0.311	0.558	0.558	0.314	0.855
	GPT-4o	0.556	0.571	0.556	0.541	0.378	0.489	0.578	0.76	0.739	0.416	0.686
	GPT-5	0.778	0.796	0.778	0.783	0.688	0.222	0.222	0.471	0.452	0.254	0.891
	Qwen-VL-Max	0.467	0.653	0.467	0.434	0.222	0.556	0.6	0.775	0.748	0.421	0.63
Rectum	Gemini-2.5-Pro	0.489	0.5	0.489	0.491	0.253	0.689	1.133	1.065	1.062	0.724	0.332
	Grok-4	0.356	0.387	0.356	0.35	0.125	0.644	0.644	0.803	0.788	0.537	0.665
	GPT-4o	0.444	0.502	0.444	0.453	0.238	0.622	0.756	0.869	0.865	0.59	0.606
	GPT-5	0.6	0.611	0.6	0.594	0.433	0.422	0.467	0.683	0.665	0.454	0.752
	Qwen-VL-Max	0.356	0.352	0.356	0.348	0.06	0.711	0.889	0.943	0.926	0.631	0.403

¹Pearson correlation coefficient. MAE: Mean absolute error; MSE: Mean squared error; RMSE: Root mean squared error; CV: Coefficient of variation; SD: Standard deviation.

Full Size Table

Table 5 Diagnostic accuracy by Mayo Endoscopic Subscore grade among physicians and multimodal large language models

MES grade	0	1	2	3
Expert 1	100	53.4	81.8	82.9
Expert 2	98.2	67.1	66.7	62.9
GPT-5	86.5	56.3	62.4	87.1
GPT-4o	85.3	45.2	53.8	52.2
Gemini-2.5-Pro	93.1	38.9	45.4	60.0
Grok-4	88.9	30.1	35.1	40.3
Qwen-VL-Max	76.7	26.3	33.3	38.5

MES: Mayo Endoscopic Subscore.

Full Size Table

Citation: Zhao XY, Shen Y, He JJ, Zhou QY, Jiang LS, Zhou ZR, An FM, Zhan Q, Sun J, Feng W. Comparative evaluation of multimodal large language models for Mayo Endoscopic Subscore grading in ulcerative colitis. World J Gastroenterol 2026; 32(24): 118690
URL: https://www.wjgnet.com/1007-9327/full/v32/i24/118690.htm
DOI: https://dx.doi.org/10.3748/wjg.118690

Zhao XY, Shen Y, He JJ, Zhou QY, Jiang LS, Zhou ZR, An FM, Zhan Q, Sun J, Feng W. Comparative evaluation of multimodal large language models for Mayo Endoscopic Subscore grading in ulcerative colitis. World J Gastroenterol 2026; 32(24): 118690 [DOI: 10.3748/wjg.118690]

All content on this site: Copyright © 1993-2026 Baishideng Publishing Group Inc, its licensors, and contributors. All rights are reserved, including those for text and data mining, AI training, and similar technologies. For all open access content, the relevant licensing terms apply.