Published online Jun 28, 2026. doi: 10.3748/wjg.118690
Revised: February 17, 2026
Accepted: February 28, 2026
Published online: June 28, 2026
Processing time: 154 Days and 4.5 Hours
Ulcerative colitis (UC) is a chronic inflammatory disorder that typically affects adults aged 20-40 years and presents with hematochezia and diarrhea. Endoscopic evaluation using the Mayo Endoscopic Subscore (MES; 0-3) is standard for asse
To compare the diagnostic accuracy of five MLLMs in UC MES evaluation and to assess consistency across segments and grades.
We collected 402 endoscopic images from patients with UC covering the entire colon. Three experienced experts independently graded all images according to MES criteria, and 283 images with a unanimous consensus were included as the reference standard. These images were evaluated by five MLLMs and two senior physicians under two conditions: Without segmental context and with anatomical segmental information. Model and physician performance were compared, and stratified analyses were conducted by intestinal segment and MES grade.
The diagnostic accuracies of the two inflammatory bowel disease physicians were 81.6% and 78.4%, respectively, with strong interobserver agreement (κ = 0.692). Among the MLLMs, GPT-5 achieved the highest overall performance (F1 = 0.720) with accuracy comparable with that of physician 2 (71.7% vs 78.4%, P = 0.068). Other models exhibited significantly lower performance (all P < 0.001). The sigmoid colon was the most accurately assessed region (mean F1 = 0.682), whereas the rectum and ileocecal region were the most challenging (0.447 and 0.493, respectively). The addition of segmental information improved the accuracy of lower-performing models. Both models and physicians showed the lowest accuracy for MES = 1, reflecting the subjective nature of mild disease activity.
This study suggested that GPT-5 holds potential for static-image-based MES grading, but performance varies and requires external validation and optimization. Future work will optimize artificial intelligence endoscopy through tuning and multimodality.
Core Tip: This single-center proof-of-concept study was the first to systematically compare five multimodal large language models for static-image-based Mayo Endoscopic Subscore grading in ulcerative colitis. The key innovation was the direct benchmarking against experienced inflammatory bowel disease physicians and the incorporation of segment-specific analyses that uncovered anatomical heterogeneity in model performance. Notably, GPT-5 achieved near-expert-level accuracy without specialized tuning, highlighting its potential role in standardized endoscopic assessment.
- Citation: Zhao XY, Shen Y, He JJ, Zhou QY, Jiang LS, Zhou ZR, An FM, Zhan Q, Sun J, Feng W. Comparative evaluation of multimodal large language models for Mayo Endoscopic Subscore grading in ulcerative colitis. World J Gastroenterol 2026; 32(24): 118690
- URL: https://www.wjgnet.com/1007-9327/full/v32/i24/118690.htm
- DOI: https://dx.doi.org/10.3748/wjg.118690
Ulcerative colitis (UC) is a chronic idiopathic inflammatory disorder that predominantly affects individuals aged 20-40 years and typically presents with hematochezia and diarrhea[1]. Endoscopic evaluation remains essential for assessing disease activity[2]. The Mayo Endoscopic Subscore (MES) grades the most inflamed mucosal segment on a four-point scale (0-3: Normal; mild; moderate; and severe)[3,4]. Owing to its simplicity and reproducibility, MES is endorsed by international guidelines and serves as a primary endpoint in clinical trials of UC[5,6]. However, MES assessment depends heavily on the endoscopist’s experience and remains prone to interobserver variability[7]. Reducing this subjectivity could improve the precision of MES-based endoscopic evaluation and support wider clinical adoption.
Recent advances in multimodal large language models (MLLMs), such as OpenAI’s GPT series and Google’s Gemini model[8,9], have enabled automated image and text interpretation across diverse medical domains. MLLMs have demonstrated potential in UC to deliver algorithm-driven, objective, and reproducible grading of disease severity[10-13]. Levartovsky et al[10] reported that ChatGPT-4 achieved diagnostic accuracy comparable with that of inflammatory bowel disease (IBD) specialists in MES grade evaluation[10]. However, the study was limited by a small sample size (n = 30) and the absence of segment-based subgroup analysis, restricting generalizability. Other emerging models, including Grok and Qwen, have shown promise in diagnostic and therapeutic applications across multiple diseases[13-15], but their performance in UC remains unexplored. Furthermore, anatomical heterogeneity among intestinal segments has been shown to affect lesion detection, such as for polyps and adenomas, yet its influence on MES grading and MLLM performance has not been systematically evaluated.
Artificial intelligence (AI), particularly deep learning algorithms, has been widely applied in gastrointestinal imaging[16]. Models based on convolutional neural networks (CNNs) and vision transformers have demonstrated diagnostic performance on par with experienced endoscopists across large-scale endoscopic datasets[17-19]. However, despite the emergence of deep learning models tailored for small-sample learning[20,21], clinical utility in UC remains limited by scarce high-quality annotated data, limited generalizability, and reliance on unimodal inputs[22-24]. Moreover, conventional CNNs primarily extract pixel-level features and lack the semantic reasoning required to interpret complex clinical guidelines, further constraining their role in real-world clinical decision-making.
In contrast, emerging MLLMs represent a paradigm shift by integrating visual encoders with extensive linguistic knowledge bases[25]. MLLMs enable joint processing of heterogeneous data by aligning cross-modal information within a shared representation space[26]. Consequently, they have been applied across diverse imaging modalities, including CT, magnetic resonance imaging, and endoscopy, in conjunction with textual data. Models such as GPT demonstrate robust reasoning and analytical capabilities, and in certain domains outperform human benchmarks[27]. Vision-language integration may, therefore, allow MLLMs to interpret endoscopic findings according to established clinical diagnostic criteria rather than relying solely on pattern recognition[28]. This capability is particularly relevant for distinguishing borderline disease activity states, such as between MES 0, MES 1, and MES 2. Accurate classification in these settings depends on subtle qualitative features, including mild friability and partially obscured vascular patterns, which require semantic understanding and cognitive-level integration rather than binary pixel-level classification[29].
This study was designed as a preliminary, single-center evaluation to compare the diagnostic capabilities, interobserver agreement, and error patterns of five leading MLLMs (GPT-5, Gemini-2.5-Pro, Grok-4, GPT-4o, and Qwen-VL-Max) with those of experienced IBD physicians in the context of static-image-based MES grading. Segment-based analyses were performed to assess the effect of anatomical variation on model performance. The objective was to provide proof-of-concept evidence regarding the utility of MLLMs in this domain, acknowledging that further multicenter validation is essential to establish generalizability. The findings may inform future strategies for model training, algorithm optimization, and clinical implementation of AI-assisted static-image endoscopic evaluation.
This retrospective, real-world diagnostic accuracy study was conducted using clinical endoscopic data from patients with UC. The evaluation was conducted between September 1, 2025 and September 20, 2025. A schematic illustration of the study design is presented in Figure 1.
A total of 402 high-resolution colonoscopy images were consecutively collected from patients diagnosed with UC at Wuxi People’s Hospital between January 2021 and December 2025. Inclusion criteria were as follows: (1) High-quality endoscopic images; and (2) Adequate bowel preparation, defined as a Boston Bowel Preparation Scale score > 5[30]. Exclusion criteria included: (1) Images with severe artifacts; (2) Inadequate visualization of the mucosal surface; or (3) Coexistence of colorectal cancer or other colonic diseases.
All patient data were de-identified prior to analysis in accordance with the Declaration of Helsinki. The study was reviewed and approved by the Ethics Committee of Wuxi People’s Hospital Affiliated to Nanjing Medical University. (Approval No. KY25207).
Three experienced IBD specialists, all members of the Chinese IBD Guidelines Development Committee, independently reviewed and graded each image according to the international MES criteria (Table 1). Only images for which all three experts assigned identical MES grades were included in the final dataset. Ultimately, 283 static high-resolution JPEG images (70.4% of the initial dataset) were retained. These images represented the following intestinal segments: Ileocecal region (47 images); ascending colon (49); transverse colon (51); descending colon (46); sigmoid colon (45); and rectum (45).
| Score | Description | Endoscopic findings |
| 0 | Normal or inactive disease | Normal mucosal appearance; intact vascular pattern; no friability, bleeding, or ulceration |
| 1 | Mild disease | Mild friability; decreased but visible vascular pattern; mild erythema; no erosions |
| 2 | Moderate disease | Marked erythema; absent vascular pattern; friability; erosions may be present |
| 3 | Severe disease | Spontaneous bleeding; ulceration; denuded mucosa; severe friability |
To evaluate spectrum bias and assess model performance in ambiguous cases, we established a sensitivity analysis set derived from 119 images initially excluded due to lack of unanimous consensus. For these images a majority vote standard, defined as agreement by at least two of the three experts, was applied. Ten images without majority consensus (i.e. three different scores) were excluded, yielding a final set of 109 images for sensitivity analysis. Among these, 18 (16.5%) were MES 0, 42 (38.5%) were MES 1, 44 (40.4%) were MES 2, and 5 (4.6%) were MES 3.
Physician group: This group comprised two independent attending physicians specializing in IBD. Each had more than 3 years of clinical experience in endoscopic evaluation and UC management.
Model group: The model group included five MLLMs: GPT-5 (OpenAI, United States; version 2025-0807), Gemini-2.5-Pro (Google, United States; version 2025-0605), Grok-4 (xAI, United States; version 2025-0709), GPT-4o (OpenAI, United States; version 2025-0326), and Qwen-VL-Max (Alibaba Cloud, China; version 2025-0526). The five MLLMs were selected to represent state-of-the-art approaches across distinct architectural lineages and development origins as of mid-2025. This cohort included global proprietary flagship models (GPT-5, GPT-4o, Gemini-2.5-Pro, and Grok-4) recognized for advanced multimodal reasoning along with a linguistically diverse model (Qwen-VL-Max) to evaluate performance across different training data distributions. Although specific architectures are proprietary, comparing these models enables evaluation of diagnostic accuracy in MES grading indirectly by depending on detecting subtle mucosal features such as vascular pattern obscuration. It is primarily driven by visual encoder resolution (fine-grained feature extraction) or by instruction-tuning effectiveness in adhering to standardized clinical scoring criteria. To ensure reproducibility we used fixed snapshot versions of each model accessed through their respective application programming interfaces (APIs). These versions are static and do not undergo continuous updating, thereby eliminating temporal performance drift.
The scoring process consisted of two sequential phases designed to evaluate the diagnostic performance of MLLMs under distinct informational conditions. To simulate a realistic clinical workflow and evaluate robustness to raw inputs, no manual preprocessing was performed. Instead, input standardization was ensured by submitting the same set of original high-resolution JPEG files to all five MLLMs.
Phase A represented a “random” condition without contextual information. All 283 endoscopic images were randomly and independently evaluated by the two physicians and the five MLLMs. For the model group prompt parameters were standardized. The system content was defined as “I need you to be a professional gastroenterologist specializing in IBD and endoscopic examination.” The task prompt was defined as “You will analyze colonoscopy images of patients with UC and classify them according to the internationally accepted MES. For each image, please assign a score from 0 to 3, following the original file order.” Each model independently assessed all 283 images with a new chat session initialized for each evaluation to minimize potential memory bias.
Phase B represented a “segment” condition with contextual information. Using the same image set as in phase A, each model was provided with additional contextual information specifying the intestinal segment. Specifically, the ana
All models were accessed through their respective official APIs using the OpenAI-compatible Python 3.11 envi
While this study evaluated static-image uploads via APIs, real-world clinical implementation requires on-premise model deployment to ensure patient data privacy and mandates formal regulatory approval as software as a medical device. Furthermore, the optimal clinical workflow will transition from the retrospective static-image analysis utilized in this benchmarking study to real-time video overlay integrated directly into active endoscopic procedures.
All evaluations were benchmarked against the expert consensus, which served as the reference standard. The following quantitative metrics were calculated: Accuracy; precision; recall; macro-averaged F1 score (macro-F1); Cohen’s κ statistic (linearly weighted); mean absolute error (MAE); mean squared error (MSE); root mean squared error; standard deviation; coefficient of variation; and Pearson correlation coefficient (r). Stratified analyses were performed according to intestinal segment and MES grade. Interobserver agreement between the two physicians was quantified using Cohen’s κ statistic.
All analyses were conducted using the 283 endoscopic images for which consensus was reached among the three IBD experts. There were no missing data, and all interpretations represented paired observations.
In phase A, comparisons between models were conducted using the McNemar test for Accuracy and the Wilcoxon signed-rank test for MAE. To assess performance stability 95% confidence interval for macro-averaged F1 scores were estimated using bootstrap resampling with 1000 iterations. To account for multiple comparisons, the Bonferroni correction was applied. For comparisons between GPT-5 and the two human experts (physician 1 and physician 2), the significance threshold was set at α = 0.025 (0.05/2). For comparisons among the five MLLMs, the threshold was set at α = 0.01 (0.05/5). A two-tailed P value below these adjusted thresholds was considered indicative of statistical significance. Comparisons between model and physician performance were also assessed using the McNemar test for accuracy. A two-tailed P value < 0.05 was considered indicative of statistical significance. Analyses in phase B were primarily descriptive.
The diagnostic accuracies of the two senior IBD physicians were 81.6% and 78.4%, respectively, demonstrating substantial interobserver agreement (κ = 0.692). As summarized in Table 2, among the five MLLMs, GPT-5 achieved the best overall performance (F1 = 0.720; κ = 0.609, 95% confidence interval not crossing 0; MAE = 0.297; P < 0.001; Figure 2). GPT-5 demonstrated lower diagnostic accuracy than physician 1 (71.7% vs 81.6%; P = 0.005), whereas no significant difference was observed compared with Physician 2 (78.4%; P = 0.068). These findings suggest that GPT-5 approaches expert-level performance but has not yet fully matched the diagnostic precision of the most experienced specialists.
| Gemini-2.5-Pro | Grok-4 | GPT-4o | GPT-5 | Qwen-VL-Max | |
| Accuracy | 0.502 | 0.417 | 0.594 | 0.717 | 0.353 |
| Precision | 0.639 | 0.552 | 0.635 | 0.731 | 0.488 |
| Recall | 0.502 | 0.417 | 0.594 | 0.717 | 0.353 |
| F1 score (95%CI) | 0.480 (0.452-0.574) | 0.415 (0.369-0.481) | 0.602 (0.522-0.648) | 0.720 (0.665-0.773) | 0.338 (0.263-0.382) |
| Cohen’s κ | 0.343 | 0.239 | 0.449 | 0.608 | 0.133 |
| MAE | 0.611 | 0.770 | 0.452 | 0.297 | 0.767 |
| MSE | 0.866 | 1.194 | 0.544 | 0.325 | 1.021 |
| RMSE | 0.930 | 1.093 | 0.738 | 0.570 | 1.011 |
| SD | 0.781 | 0.918 | 0.697 | 0.564 | 0.953 |
| CV | 0.713 | 0.838 | 0.637 | 0.515 | 0.870 |
| r value1 | 0.681 | 0.590 | 0.777 | 0.852 | 0.478 |
Table 3 summarizes the mean F1 scores averaged across all five MLLMs for each intestinal segment: Sigmoid colon (0.631) > descending colon (0.574) approximately ascending colon (0.571) approximately transverse colon (0.567) > ileocecal region (0.493) > rectum (0.447). Overall, the models demonstrated the highest diagnostic consistency in the sigmoid colon, whereas the ileocecal region and rectum were the most challenging segments to classify accurately. As summarized in Table 4, GPT-5 consistently achieved the highest F1 scores across all intestinal segments, maintaining superior and stable performance throughout (Figure 3).
| Gemini-2.5-Pro | Grok-4 | GPT-4o | GPT-5 | Qwen-VL-Max | |
| Accuracy | 0.541 | 0.420 | 0.583 | 0.710 | 0.406 |
| Precision | 0.621 | 0.480 | 0.590 | 0.724 | 0.515 |
| Recall | 0.541 | 0.420 | 0.583 | 0.710 | 0.406 |
| F1 score (95%CI) | 0.546 (0.490-0.611) | 0.426 (0.363-0.481) | 0.584 (0.502-0.618) | 0.715 (0.664-0.768) | 0.408 (0.313-0.439) |
| Cohen’s κ | 0.381 | 0.229 | 0.425 | 0.596 | 0.190 |
| MAE | 0.565 | 0.753 | 0.473 | 0.300 | 0.707 |
| MSE | 0.799 | 1.141 | 0.587 | 0.322 | 0.940 |
| RMSE | 0.894 | 1.068 | 0.766 | 0.567 | 0.969 |
| SD | 0.843 | 0.973 | 0.755 | 0.566 | 0.951 |
| CV | 0.769 | 0.888 | 0.689 | 0.517 | 0.868 |
| r value1 | 0.644 | 0.573 | 0.749 | 0.850 | 0.499 |
| Segments | Models | Accuracy | Precision | Recall | F1 score | Cohen’s κ | MAE | MSE | RMSE | SD | CV | r value1 |
| Ileocecal region | Gemini-2.5-Pro | 0.426 | 0.752 | 0.426 | 0.478 | 0.226 | 0.787 | 1.213 | 1.101 | 0.77 | 1.508 | 0.606 |
| Grok-4 | 0.362 | 0.727 | 0.362 | 0.456 | 0.144 | 1.043 | 1.979 | 1.407 | 1.031 | 2.018 | 0.453 | |
| GPT-4o | 0.532 | 0.643 | 0.532 | 0.566 | 0.25 | 0.574 | 0.787 | 0.887 | 0.79 | 1.547 | 0.639 | |
| GPT-5 | 0.596 | 0.718 | 0.596 | 0.625 | 0.356 | 0.426 | 0.468 | 0.684 | 0.593 | 1.162 | 0.751 | |
| Qwen-VL-Max | 0.298 | 0.547 | 0.298 | 0.339 | 0.058 | 0.957 | 1.468 | 1.212 | 0.956 | 1.872 | 0.315 | |
| Ascending colon | Gemini-2.5-Pro | 0.49 | 0.739 | 0.49 | 0.514 | 0.309 | 0.673 | 1.041 | 1.02 | 0.766 | 1.211 | 0.665 |
| Grok-4 | 0.388 | 0.641 | 0.388 | 0.434 | 0.175 | 0.939 | 1.673 | 1.294 | 1.004 | 1.586 | 0.5 | |
| GPT-4o | 0.673 | 0.711 | 0.673 | 0.679 | 0.477 | 0.367 | 0.449 | 0.67 | 0.624 | 0.986 | 0.793 | |
| GPT-5 | 0.796 | 0.812 | 0.796 | 0.798 | 0.657 | 0.204 | 0.204 | 0.452 | 0.435 | 0.687 | 0.894 | |
| Qwen-VL-Max | 0.367 | 0.727 | 0.367 | 0.432 | 0.144 | 0.776 | 1.061 | 1.03 | 0.857 | 1.355 | 0.513 | |
| Transverse colon | Gemini-2.5-Pro | 0.549 | 0.72 | 0.549 | 0.573 | 0.394 | 0.471 | 0.51 | 0.714 | 0.621 | 0.621 | 0.803 |
| Grok-4 | 0.333 | 0.527 | 0.333 | 0.368 | 0.12 | 0.804 | 1.078 | 1.038 | 0.893 | 0.893 | 0.612 | |
| GPT-4o | 0.647 | 0.71 | 0.647 | 0.665 | 0.504 | 0.373 | 0.412 | 0.642 | 0.604 | 0.604 | 0.839 | |
| GPT-5 | 0.784 | 0.818 | 0.784 | 0.796 | 0.688 | 0.216 | 0.216 | 0.464 | 0.454 | 0.454 | 0.9 | |
| Qwen-VL-Max | 0.392 | 0.543 | 0.392 | 0.431 | 0.155 | 0.725 | 0.961 | 0.98 | 0.956 | 0.956 | 0.495 | |
| Descending colon | Gemini-2.5-Pro | 0.63 | 0.723 | 0.63 | 0.628 | 0.503 | 0.435 | 0.565 | 0.752 | 0.74 | 0.587 | 0.738 |
| Grok-4 | 0.413 | 0.432 | 0.413 | 0.416 | 0.211 | 0.739 | 1.087 | 1.043 | 1.034 | 0.82 | 0.518 | |
| GPT-4o | 0.63 | 0.626 | 0.63 | 0.615 | 0.491 | 0.435 | 0.565 | 0.752 | 0.72 | 0.571 | 0.777 | |
| GPT-5 | 0.696 | 0.732 | 0.696 | 0.689 | 0.583 | 0.326 | 0.37 | 0.608 | 0.598 | 0.474 | 0.842 | |
| Qwen-VL-Max | 0.565 | 0.532 | 0.565 | 0.524 | 0.408 | 0.5 | 0.63 | 0.794 | 0.791 | 0.628 | 0.689 | |
| Sigmoid colon | Gemini-2.5-Pro | 0.667 | 0.715 | 0.667 | 0.682 | 0.532 | 0.333 | 0.333 | 0.577 | 0.556 | 0.313 | 0.828 |
| Grok-4 | 0.689 | 0.696 | 0.689 | 0.683 | 0.57 | 0.311 | 0.311 | 0.558 | 0.558 | 0.314 | 0.855 | |
| GPT-4o | 0.556 | 0.571 | 0.556 | 0.541 | 0.378 | 0.489 | 0.578 | 0.76 | 0.739 | 0.416 | 0.686 | |
| GPT-5 | 0.778 | 0.796 | 0.778 | 0.783 | 0.688 | 0.222 | 0.222 | 0.471 | 0.452 | 0.254 | 0.891 | |
| Qwen-VL-Max | 0.467 | 0.653 | 0.467 | 0.434 | 0.222 | 0.556 | 0.6 | 0.775 | 0.748 | 0.421 | 0.63 | |
| Rectum | Gemini-2.5-Pro | 0.489 | 0.5 | 0.489 | 0.491 | 0.253 | 0.689 | 1.133 | 1.065 | 1.062 | 0.724 | 0.332 |
| Grok-4 | 0.356 | 0.387 | 0.356 | 0.35 | 0.125 | 0.644 | 0.644 | 0.803 | 0.788 | 0.537 | 0.665 | |
| GPT-4o | 0.444 | 0.502 | 0.444 | 0.453 | 0.238 | 0.622 | 0.756 | 0.869 | 0.865 | 0.59 | 0.606 | |
| GPT-5 | 0.6 | 0.611 | 0.6 | 0.594 | 0.433 | 0.422 | 0.467 | 0.683 | 0.665 | 0.454 | 0.752 | |
| Qwen-VL-Max | 0.356 | 0.352 | 0.356 | 0.348 | 0.06 | 0.711 | 0.889 | 0.943 | 0.926 | 0.631 | 0.403 |
Inclusion of segmental information led to an overall increase in diagnostic accuracy by 1.55% across all models. The lower-performing models, Gemini-2.5-Pro and Qwen-VL-Max, demonstrated the largest relative gains (+3.89% and +5.30%, respectively), whereas Grok-4 demonstrated only a marginal gain (+0.354%). GPT-4o and GPT-5 exhibited negligible change, indicating performance stability regardless of contextual input.
Model and physician performance across different MES grades is summarized in Table 5. Both senior IBD physicians achieved their highest diagnostic accuracy when identifying MES = 0 (mean accuracy = 99.1%) and their lowest when classifying MES = 1 (mean accuracy = 60.3%). Similarly, the MLLMs performed best on MES = 0 images (mean accuracy = 86.1%) and worst on MES = 1 (mean accuracy = 39.4%). The confusion-matrix distributions (Figures 4, 5, and 6) revealed that most misclassifications for both physicians and models occurred between adjacent grades (MES 0-1, 1-2, and 2-3). Notably, the models tended to overestimate disease severity (e.g., 0-1), whereas the physicians were more likely to underestimate it (e.g., 1-0). GPT-5 demonstrated accuracy comparable with Physician 1 in detecting moderate-to-severe inflammation (MES = 2/3, P = 0.710) but significantly outperformed physician 2 (P = 0.040).
| MES grade | 0 | 1 | 2 | 3 |
| Expert 1 | 100 | 53.4 | 81.8 | 82.9 |
| Expert 2 | 98.2 | 67.1 | 66.7 | 62.9 |
| GPT-5 | 86.5 | 56.3 | 62.4 | 87.1 |
| GPT-4o | 85.3 | 45.2 | 53.8 | 52.2 |
| Gemini-2.5-Pro | 93.1 | 38.9 | 45.4 | 60.0 |
| Grok-4 | 88.9 | 30.1 | 35.1 | 40.3 |
| Qwen-VL-Max | 76.7 | 26.3 | 33.3 | 38.5 |
Performance across all five MLLMs declined significantly when evaluated on the 109 non-consensus images in the sensitivity analysis set, confirming that these cases represent a more challenging disease spectrum. GPT-5 achieved an accuracy of 56.9% and a macro-F1 score of 0.529 (Supplementary Table 1), representing a marked decline from its performance on the consensus set (accuracy 71.7%, F1 0.720). Grok-4 showed a similar deterioration with accuracy decreasing to 31.2%. Notably, a large proportion of these ambiguous images were classified as MES 1 (38.5%) or MES 2 (40.4%), contributing to the lower agreement. This analysis quantified spectrum bias in the primary dataset and highlighted the difficulty AI models face when interpreting equivocal endoscopic features.
This study presented the first systematic evaluation of multiple state-of-the-art MLLMs for the specialized clinical task of MES grading directly compared with experienced IBD physicians. GPT-5 demonstrated the highest overall performance (Figure 2), achieving diagnostic accuracy comparable with that of experienced physicians. Model-level analyses revealed that the rectum and ileocecal region were the most challenging intestinal segments to interpret (Table 4 and Figure 3), and MES = 1 was the most difficult grade to classify accurately (Table 5). Providing segmental information markedly improved diagnostic accuracy for lower-performing models (Tables 3 and 4). To the best of our knowledge, this is the first study to evaluate the latest GPT-5 model in MES grading for UC, and the results demonstrated its superior per
Substantial diagnostic agreement was observed among human physicians (κ = 0.692) although some interobserver variability persisted, consistent with previous reports[10,31]. This finding underscores the continued need for objective, reproducible tools to standardize endoscopic evaluation[7]. Compared with earlier CNN architectures, the MLLMs in this study achieved robust accuracy without task-specific fine-tuning. GPT-5 exhibited the highest overall performance (accuracy = 71.7%), confirming the potential of next-generation foundation models in endoscopic image interpretation. While these results remain slightly below the accuracy of some highly optimized CNN models[32-35], they highlight the key advantages of MLLMs, including strong generalization ability and minimal deployment requirements for clinical integration.
A previous study by Levartovsky et al[10] reported a diagnostic accuracy of 78.9% using GPT-4, a finding significantly higher than that observed in the present study. This discrepancy may partly reflect differences in dataset composition and evaluation difficulty. The images included in our dataset were likely more complex and diagnostically challenging, providing a more stringent evaluation of model robustness under real-world clinical conditions. In addition, their study[10] analyzed only 30 images, potentially introducing random variation or sampling bias. In contrast, our dataset com
Segment-wise analysis revealed that GPT-5 achieved scoring accuracy comparable with that of senior IBD physicians in the ascending and transverse colon, whereas the rectum and ileocecal region remained common interpretive bottlenecks for both human experts and models. The underlying causes likely relate to the distinctive anatomical and physiological characteristics of these regions[7]. In the rectum nonspecific erythema may arise from fecal irritation, prior enema procedures, or retroflexed observation, confounding grading accuracy. Similarly, the complex architecture of the ileocecal region and adjacent lymphoid tissue can obscure mucosal visualization and hinder reliable endoscopic interpretation.
Further stratified analysis demonstrated that incorporating additional anatomical contextual information generally enhanced model performance with the greatest improvements observed in lower-tier models. For instance, lymphoid hyperplasia in the ileocecal region can mimic mild inflammation (MES = 1); providing explicit segmental context allowed models to reference prior knowledge of benign regional features, thereby reducing misclassification[17]. In contrast, advanced models such as GPT-5 exhibited minimal benefit from additional contextual input, likely because extensive multimodal pre-training had already internalized segment-specific anatomical patterns. These findings suggest that the interpretative accuracy of GPT-5 is largely invariant to intestinal segment variability, underscoring its potential for robust deployment in dynamic, real-time endoscopic settings.
Stratified analysis by MES grade revealed that both GPT-5 and senior IBD physicians exhibited stable performance in recognizing the two extremes of disease activity (MES = 0 and MES = 3), whereas accuracy declined markedly for intermediate grades (MES = 1). This challenge likely reflects the inherent subjectivity of MES = 1, which is defined by ambiguous descriptors such as “mild erythema” and “blurred vascular pattern”. Notably, patients classified as MES = 1 have been shown to experience significantly worse clinical outcomes than those with MES = 0, influencing relapse risk and guiding therapeutic decisions[36]. These findings emphasize that future AI systems will require extensive training on large, diverse, and meticulously annotated datasets to more effectively capture and interpret these subtle, borderline endoscopic features.
Despite the strengths and clinical relevance of this study, several limitations should be considered. First, the data were derived from a single center with a relatively small sample size, potentially limiting statistical power and restricting generalizability. Multicenter studies with larger cohorts are required for external validation. Second, the analysis was confined to high-quality endoscopic images (Boston bowel preparation score > 5), introducing potential image-quality bias and possibly overestimating performance relative to real-world clinical practice where motion artifacts, residual stool, and mucus are frequently encountered. Future studies should include endoscopic images spanning a broad range of quality levels and perform stratified analyses to better reflect real-world clinical conditions. Third, all models were evaluated using a default temperature of 1.0 to assess performance under a more challenging, non-deterministic environment. Achieving accuracy comparable with that of senior clinicians under this condition (in the presence of random interference) suggests intrinsic robustness in handling complex medical classification tasks. However, this approach may introduce unnecessary stochasticity. Accordingly, future work will involve systematic hyperparameter tuning to identify optimal configurations and enhance diagnostic consistency for clinical deployment. Finally, limited model interpretability represents an important constraint. The commercial MLLMs evaluated do not currently provide visual explanatory outputs, such as attention maps or heatmaps, thereby limiting insight into the underlying rationale of model predictions and potentially hindering clinical trust and adoption.
All models evaluated in this study were general-purpose MLLMs, each offering distinct advantages in accessibility and ease of implementation. However, their performance remains constrained by several inherent limitations, including the relatively small size of available image datasets and reliance on static image inputs[37]. These limitations represent key targets for future model optimization and methodological advancement. First, targeted fine-tuning on large, high-quality, and expertly annotated endoscopic datasets is essential to enhance model capability in analyzing diagnostically challenging regions (such as the ileocecal region and rectum) and borderline inflammatory states (e.g., MES = 1). Second, because key indicators of disease activity, including mucosal friability and spontaneous hemorrhage, are often dynamic and episodic, static-image-based grading may underestimate or misclassify inflammatory severity. Incorporating full-length endoscopic video analysis would allow models to account for these transient features, thereby improving diagnostic accuracy and interobserver consistency. Third, true multimodal integration of endoscopic imagery with clinical data, including symptoms, laboratory markers, and histopathological findings, will enable a more comprehensive and clinically meaningful framework for disease activity assessment.
Optimized MLLMs hold substantial potential for clinical application. They could serve as educational tools helping physicians rapidly familiarize themselves with the MES scoring system. In clinical practice MLLMs may function as decision-support systems, offering real-time “second-opinion” guidance by highlighting suspicious mucosal regions during endoscopic examinations, thereby reducing missed lesions, minimizing interobserver variability, and improving procedural efficiency. Furthermore, in the context of large-scale clinical trials, MLLMs could act as centralized reviewers to enhance the objectivity, reproducibility, and consistency of endoscopic activity assessments.
MLLMs demonstrated promising potential for clinical translation in static-image-based MES grading. Incorporation of anatomical segment information served as an effective auxiliary strategy, particularly enhancing the accuracy of lower-performing models. Mild inflammation (MES = 1) and anatomically complex regions such as the rectum and ileocecal area remained common interpretive challenges for both human physicians and AI systems. Looking forward, targeted fine-tuning on large annotated datasets and comprehensive multimodal data integration are expected to further enhance MLLM performance, supporting their evolution into reliable tools for precision endoscopic diagnosis and physician training in UC.
| 1. | Voelker R. What Is Ulcerative Colitis? JAMA. 2024;331:716. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 133] [Cited by in RCA: 123] [Article Influence: 61.5] [Reference Citation Analysis (0)] |
| 2. | Mohammed Vashist N, Samaan M, Mosli MH, Parker CE, MacDonald JK, Nelson SA, Zou GY, Feagan BG, Khanna R, Jairath V. Endoscopic scoring indices for evaluation of disease activity in ulcerative colitis. Cochrane Database Syst Rev. 2018;1:CD011450. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 60] [Cited by in RCA: 98] [Article Influence: 12.3] [Reference Citation Analysis (0)] |
| 3. | Inflammatory Bowel Disease Group; Chinese Society of Gastroenterology; Chinese Medical Association; Inflammatory Bowel Disease Quality Control Center of China. 2023 Chinese national clinical practice guideline on diagnosis and management of ulcerative colitis. Chin Med J (Engl). 2024;137:1642-1646. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 20] [Reference Citation Analysis (1)] |
| 4. | Schroeder KW, Tremaine WJ, Ilstrup DM. Coated oral 5-aminosalicylic acid therapy for mildly to moderately active ulcerative colitis. A randomized study. N Engl J Med. 1987;317:1625-1629. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 2583] [Cited by in RCA: 2346] [Article Influence: 60.2] [Reference Citation Analysis (10)] |
| 5. | Yoon H, Jangi S, Dulai PS, Boland BS, Prokop LJ, Jairath V, Feagan BG, Sandborn WJ, Singh S. Incremental Benefit of Achieving Endoscopic and Histologic Remission in Patients With Ulcerative Colitis: A Systematic Review and Meta-Analysis. Gastroenterology. 2020;159:1262-1275.e7. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 194] [Cited by in RCA: 183] [Article Influence: 30.5] [Reference Citation Analysis (0)] |
| 6. | Shehab M, Alrashed F, Alsayegh A, Aldallal U, Ma C, Narula N, Jairath V, Singh S, Bessissow T. Comparative Efficacy of Biologics and Small Molecule in Ulcerative Colitis: A Systematic Review and Network Meta-analysis. Clin Gastroenterol Hepatol. 2025;23:250-262. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 5] [Cited by in RCA: 38] [Article Influence: 38.0] [Reference Citation Analysis (1)] |
| 7. | Fernandes SR, Pinto JSLD, Marques da Costa P, Correia L; GEDII. Disagreement Among Gastroenterologists Using the Mayo and Rutgeerts Endoscopic Scores. Inflamm Bowel Dis. 2018;24:254-260. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 14] [Cited by in RCA: 33] [Article Influence: 4.1] [Reference Citation Analysis (0)] |
| 8. | OpenAI; Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, Avila R, Babuschkin I, Balaji S, Balcom V, Baltescu P, Bao H, Bavarian M, Belgum J, Bello I, Berdine J, Bernadett-Shapiro G, Berner C, Bogdonoff L, Boiko O, Boyd M, Brakman AL, Brockman G, Brooks T, Brundage M, Button K, Cai T, Campbell R, Cann A, Carey B, Carlson C, Carmichael R, Chan B, Chang C, Chantzis F, Chen D, Chen S, Chen R, Chen J, Chen M, Chess B, Cho C, Chu C, Chung HW, Cummings D, Currier J, Dai Y, Decareaux C, Degry T, Deutsch N, Deville D, Dhar A, Dohan D, Dowling S, Dunning S, Ecoffet A, Eleti A, Eloundou T, Farhi D, Fedus L, Felix N, Fishman SP, Forte J, Fulford I, Gao L, Georges E, Gibson C, Goel V, Gogineni T, Goh G, Gontijo-Lopes R, Gordon J, Grafstein M, Gray S, Greene R, Gross J, Gu SS, Guo Y, Hallacy C, Han J, Harris J, He Y, Heaton M, Heidecke J, Hesse C, Hickey A, Hickey W, Hoeschele P, Houghton B, Hsu K, Hu S, Hu X, Huizinga J, Jain S, Jain S, Jang J, Jiang A, Jiang R, Jin H, Jin D, Jomoto S, Jonn B, Jun H, Kaftan T, Kaiser Ł, Kamali A, Kanitscheider I, Keskar NS, Khan T, Kilpatrick L, Kim JW, Kim C, Kim Y, Kirchner JH, Kiros J, Knight M, Kokotajlo D, Kondraciuk Ł, Kondrich A, Konstantinidis A, Kosic K, Krueger G, Kuo V, Lampe M, Lan I, Lee T, Leike J, Leung J, Levy D, Li CM, Lim R, Lin M, Lin S, Litwin M, Lopez T, Lowe R, Lue P, Makanju A, Malfacini K, Manning S, Markov T, Markovski Y, Martin B, Mayer K, Mayne A, McGrew B, McKinney SM, McLeavey C, McMillan P, McNeil J, Medina D, Mehta A, Menick J, Metz L, Mishchenko A, Mishkin P, Monaco V, Morikawa E, Mossing D, Mu T, Murati M, Murk O, Mély D, Nair A, Nakano R, Nayak R, Neelakantan A, Ngo R, Noh H, Ouyang L, O'Keefe C, Pachocki J, Paino A, Palermo J, Pantuliano A, Parascandolo G, Parish J, Parparita E, Passos A, Pavlov M, Peng A, Perelman A, Peres FDAB, Petrov M, Pinto HPDO, (Rai)Pokorny M, Pokrass M, Pong VH, Powell T, Power A, Power B, Proehl E, Puri R, Radford A, Rae J, Ramesh A, Raymond C, Real F, Rimbach K, Ross C, Rotsted B, Roussez H, Ryder N, Saltarelli M, Sanders T, Santurkar S, Sastry G, Schmidt H, Schnurr D, Schulman J, Selsam D, Sheppard K, Sherbakov T, Shieh J, Shoker S, Shyam P, Sidor S, Sigler E, Simens M, Sitkin J, Slama K, Sohl I, Sokolowsky B, Song Y, Staudacher N, Such FP, Summers N, Sutskever I, Tang J, Tezak N, Thompson MB, Tillet P, Tootoonchian A, Tseng E, Tuggle P, Turley N, Tworek J, Uribe JFC, Vallone A, Vijayvergiya A, Voss C, Wainwright C, Wang JJ, Wang A, Wang B, Ward J, Wei J, Weinmann C, Welihinda A, Welinder P, Weng J, Weng L, Wiethoff M, Willner D, Winter C, Wolrich S, Wong H, Workman L, Wu S, Wu J, Wu M, Xiao K, Xu T, Yoo S, Yu K, Yuan Q, Zaremba W, Zellers R, Zhang C, Zhang M, Zhao S, Zheng T, Zhuang J, Zhuk W, ZophB. GPT-4 Technical Report. 2023 Preprint. Available from: arXiv:2303.08774. |
| 9. | Gemini Team Google. Gemini: A Family of Highly Capable Multimodal Models. 2023 Preprint. Available from: arXiv:2312.11805. |
| 10. | Levartovsky A, Albshesh A, Grinman A, Shachar E, Lahat A, Eliakim R, Kopylov U. Enhancing diagnostics: ChatGPT-4 performance in ulcerative colitis endoscopic assessment. Endosc Int Open. 2025;13:a25420943. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 2] [Cited by in RCA: 2] [Article Influence: 2.0] [Reference Citation Analysis (0)] |
| 11. | Levartovsky A, Ben-Horin S, Kopylov U, Klang E, Barash Y. Towards AI-Augmented Clinical Decision-Making: An Examination of ChatGPT's Utility in Acute Ulcerative Colitis Presentations. Am J Gastroenterol. 2023;118:2283-2289. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 34] [Reference Citation Analysis (0)] |
| 12. | Sciberras M, Farrugia Y, Gordon H, Furfaro F, Allocca M, Torres J, Arebi N, Fiorino G, Iacucci M, Verstockt B, Magro F, Katsanos K, Busuttil J, De Giovanni K, Fenech VA, Chetcuti Zammit S, Ellul P. Accuracy of Information given by ChatGPT for Patients with Inflammatory Bowel Disease in Relation to ECCO Guidelines. J Crohns Colitis. 2024;18:1215-1221. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 3] [Cited by in RCA: 38] [Article Influence: 19.0] [Reference Citation Analysis (0)] |
| 13. | Shean R, Shah T, Pandiarajan A, Tang A, Bolo K, Nguyen V, Xu B. A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions. Sci Rep. 2025;15:23101. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 13] [Reference Citation Analysis (0)] |
| 14. | Wu X, Cai G, Guo B, Ma L, Shao S, Yu J, Zheng Y, Wang L, Yang F. A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios. BMC Oral Health. 2025;25:1272. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 10] [Cited by in RCA: 14] [Article Influence: 14.0] [Reference Citation Analysis (0)] |
| 15. | Sozer A, Sahin MC, Sozer B, Erol G, Tufek OY, Nernekli K, Demirtas Z, Celtikci E. Do LLMs Have 'the Eye' for MRI? Evaluating GPT-4o, Grok, and Gemini on Brain MRI Performance: First Evaluation of Grok in Medical Imaging and a Comparative Analysis. Diagnostics (Basel). 2025;15:1320. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 7] [Reference Citation Analysis (0)] |
| 16. | Zhang Y, Han J, Chen H, Hu F, Huang Y, Tian G, Zhong D, Yang J. Deep learning-based fusion of nuclear segmentation features for microsatellite instability and tumor mutational burden prediction in digestive tract cancers: a multicenter validation study. Brief Bioinform. 2025;26:bbaf580. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 6] [Reference Citation Analysis (0)] |
| 17. | Osada T, Ohkusa T, Yokoyama T, Shibuya T, Sakamoto N, Beppu K, Nagahara A, Otaka M, Ogihara T, Watanabe S. Comparison of several activity indices for the evaluation of endoscopic activity in UC: inter- and intraobserver consistency. Inflamm Bowel Dis. 2010;16:192-197. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 80] [Cited by in RCA: 73] [Article Influence: 4.6] [Reference Citation Analysis (3)] |
| 18. | Stidham RW, Liu W, Bishu S, Rice MD, Higgins PDR, Zhu J, Nallamothu BK, Waljee AK. Performance of a Deep Learning Model vs Human Reviewers in Grading Endoscopic Disease Severity of Patients With Ulcerative Colitis. JAMA Netw Open. 2019;2:e193963. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 256] [Cited by in RCA: 205] [Article Influence: 29.3] [Reference Citation Analysis (6)] |
| 19. | Shiku K, Nishimura K, Suehiro D, Tanaka K, Bise R. Ordinal Multiple-instance Learning for Ulcerative Colitis Severity Estimation with Selective Aggregated Transformer. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); 2025; Tucson, United States. IEEE Xplore, United States. [DOI] [Full Text] |
| 20. | Dheivya I, Kumar GS. VSegNet – A Variant SegNet for Improving Segmentation Accuracy in Medical Images with Class Imbalance and Limited Data. Medinform. 2024;2:36-48. [DOI] [Full Text] |
| 21. | Yang B, Xu S, Yin L, Liu C, Zheng W. Disparity estimation of stereo-endoscopic images using deep generative network. ICT Express. 2025;11:74-79. [DOI] [Full Text] |
| 22. | Qin Y, Chang J, Li L, Wu M. Enhancing gastroenterology with multimodal learning: the role of large language model chatbots in digestive endoscopy. Front Med (Lausanne). 2025;12:1583514. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 13] [Cited by in RCA: 7] [Article Influence: 7.0] [Reference Citation Analysis (0)] |
| 23. | Patel M, Gulati S, Iqbal F, Hayee B. Rapid development of accurate artificial intelligence scoring for colitis disease activity using applied data science techniques. Endosc Int Open. 2022;10:E539-E543. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 8] [Reference Citation Analysis (0)] |
| 24. | Testoni SGG, Albertini Petroni G, Annunziata ML, Dell'Anna G, Puricelli M, Delogu C, Annese V. Artificial Intelligence in Inflammatory Bowel Disease Endoscopy. Diagnostics (Basel). 2025;15:905. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 7] [Reference Citation Analysis (0)] |
| 25. | Nakase H, Hirano T, Wagatsuma K, Ichimiya T, Yamakawa T, Yokoyama Y, Hayashi Y, Hirayama D, Kazama T, Yoshii S, Yamano HO. Artificial intelligence-assisted endoscopy changes the definition of mucosal healing in ulcerative colitis. Dig Endosc. 2021;33:903-911. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 22] [Cited by in RCA: 20] [Article Influence: 4.0] [Reference Citation Analysis (0)] |
| 26. | Yin S, Fu C, Zhao S, Li K, Sun X, Xu T, Chen E. A survey on multimodal large language models. Natl Sci Rev. 2024;11:nwae403. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 78] [Reference Citation Analysis (0)] |
| 27. | Zhang J, Li Y, Fukuda T, Wang B. Urban safety perception assessments via integrating multimodal large language models with street view images. Cities. 2025;165:106122. [DOI] [Full Text] |
| 28. | Qiu J, Yuan W, Lam K. The application of multimodal large language models in medicine. Lancet Reg Health West Pac. 2024;45:101048. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 10] [Reference Citation Analysis (0)] |
| 29. | AlSaad R, Abd-Alrazaq A, Boughorbel S, Ahmed A, Renault MA, Damseh R, Sheikh J. Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook. J Med Internet Res. 2024;26:e59505. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 110] [Reference Citation Analysis (0)] |
| 30. | Lai EJ, Calderwood AH, Doros G, Fix OK, Jacobson BC. The Boston bowel preparation scale: a valid and reliable instrument for colonoscopy-oriented research. Gastrointest Endosc. 2009;69:620-625. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 1033] [Cited by in RCA: 1006] [Article Influence: 59.2] [Reference Citation Analysis (6)] |
| 31. | Chang YY, Yang HP, Chen YY, Yen HH. Comparison of the performance between an AI-based vision transformer and human endoscopists in predicting the endoscopic and histologic activities of ulcerative colitis. Digit Health. 2026;12:20552076251412694. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 3] [Reference Citation Analysis (0)] |
| 32. | Kuroki T, Maeda Y, Kudo SE, Ogata N, Iacucci M, Takishima K, Ide Y, Shibuya T, Semba S, Kawashima J, Kato S, Ogawa Y, Ichimasa K, Nakamura H, Hayashi T, Wakamura K, Miyachi H, Baba T, Nemoto T, Ohtsuka K, Misawa M. A novel artificial intelligence-assisted "vascular healing" diagnosis for prediction of future clinical relapse in patients with ulcerative colitis: a prospective cohort study (with video). Gastrointest Endosc. 2024;100:97-108. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 23] [Cited by in RCA: 23] [Article Influence: 11.5] [Reference Citation Analysis (1)] |
| 33. | Ogata N, Maeda Y, Misawa M, Takenaka K, Takabayashi K, Iacucci M, Kuroki T, Takishima K, Sasabe K, Niimura Y, Kawashima J, Ogawa Y, Ichimasa K, Nakamura H, Matsudaira S, Sasanuma S, Hayashi T, Wakamura K, Miyachi H, Baba T, Mori Y, Ohtsuka K, Ogata H, Kudo SE. Artificial Intelligence-assisted Video Colonoscopy for Disease Monitoring of Ulcerative Colitis: A Prospective Study. J Crohns Colitis. 2025;19:jjae080. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 19] [Cited by in RCA: 21] [Article Influence: 21.0] [Reference Citation Analysis (0)] |
| 34. | Takabayashi K, Kobayashi T, Matsuoka K, Levesque BG, Kawamura T, Tanaka K, Kadota T, Bise R, Uchida S, Kanai T, Ogata H. Artificial intelligence quantifying endoscopic severity of ulcerative colitis in gradation scale. Dig Endosc. 2024;36:582-590. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 2] [Cited by in RCA: 19] [Article Influence: 9.5] [Reference Citation Analysis (0)] |
| 35. | Kuroki T, Maeda Y, Kudo SE, Ogata N, Takabayashi K, Takenaka K, Kawashima J, Kawabata Y, Iwasaki S, Shiina O, Morita Y, Kouyama Y, Sakurai T, Ogawa Y, Baba T, Mori Y, Iacucci M, Ogata H, Ohtsuka K, Misawa M. Combination of white-light imaging-based and narrow-band imaging-based artificial intelligence models during colonoscopy in patients with ulcerative colitis. J Crohns Colitis. 2025;19:jjaf014. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 1] [Cited by in RCA: 5] [Article Influence: 5.0] [Reference Citation Analysis (0)] |
| 36. | Jin X, You Y, Ruan G, Zhou W, Li J, Li J. Deep mucosal healing in ulcerative colitis: how deep is better? Front Med (Lausanne). 2024;11:1429427. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 5] [Reference Citation Analysis (0)] |
| 37. | Moradi M, Samwald M. Explaining Black-Box Models for Biomedical Text Classification. IEEE J Biomed Health Inform. 2021;25:3112-3120. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 3] [Cited by in RCA: 7] [Article Influence: 1.4] [Reference Citation Analysis (0)] |