Copyright
©The Author(s) 2025.
World J Gastroenterol. Oct 21, 2025; 31(39): 111353
Published online Oct 21, 2025. doi: 10.3748/wjg.v31.i39.111353
Published online Oct 21, 2025. doi: 10.3748/wjg.v31.i39.111353
Table 1 Role of artificial intelligence-based endoscopy in the evaluation of patients with inflammatory bowel diseases
Ref. | Disease/number of patients | Type of study | Endoscopic technique | Number of training samples | Number of test samples | AI/model | Main findings |
Stidham et al[22] | UC/3082 patients | Retrospective, single center | WLE | 14862 images | 1652 images | DL-CNN | Discriminating ER (MES ≤ 1) from moderate-severe disease (MES ≥ 2) (AUC = 0.966, sensitivity = 83.0%, specificity = 96.0%). AI and pathologist agreement (κ = 0.840 vs κ = 0.860) |
Maeda et al[38] | UC/187 patients | Retrospective, single center | Endocytoscopy | 12900 still images | 525 segments | CAD | Prediction of HR (GS ≥ 3.1) (sensitivity = 74.0%, specificity = 97.0%, precision = 91.0%, κ = 1.000) |
Ozawa et al[40] | UC/955 patients | Retrospective, single center | WLE | 26304 still images | 3981 still images | CAD-CNN | AI performance for mucosal healing (MES ≤ 1, AUC = 0.980) |
Takenaka et al[25] | UC/875 patients | Prospective, single center | WLE | 40789 still images | 4187 still images | DNUC | Evaluation of ER (UCEIS ≤ 2) (accuracy = 90.1%, ICC = 0.917). Evaluation of HR (GS < 3.1) (accuracy = 92.9%, κ = 0.859) |
Bossuyt et al[41] | UC/35 patients | Prospective, multicenter | Prototype endoscope | NR | NR | CAD | RD for endoscopic/histological inflammation: Correlation with MES (r = 0.76), UCEIS (r = 0.74), RHI (r = 0.74). RD score (≤ 60) predicts HR (AUC = 0.950, sensitivity = 96.0%, specificity = 80.0%) |
Yao et al[27] | UC/157 patients | Prospective, multicenter | WLE | NR | 264 videos of high resolution | DL-CNN | The still image informative classifier had excellent performance (sensitivity = 0.902, specificity = 0.870). Correct prediction of MES: 78% of videos (κ = 0.840) |
Gottlieb et al[30] | UC/249 patients | Prospective, multicenter | WLE | 629 videos | 157 videos | RNN | Endoscopic healing evaluation according to UCEIS (accuracy = 97.0%) and MES (accuracy = 95.5%). Agreement of the model with human experts for MES (QWK = 0.844) and UCEIS (QWK = 0.855) |
Bossuyt et al[42] | UC/58 patients | Prospective, single center | SWE | NR | 113 still images | CAD | AI algorithm yielded better HR accuracy (86.0%) than MES (74.0%) or UCEIS (79.0%) |
Huang et al[43] | UC/54 patients | Retrospective, single center | Endoscopy HD | 600 still images | 256 still images | DNN, SVM, k-NN | Performance of the combined model for differentiation between MES ≤ 1 and MES 2 (AUC = 0.927, accuracy = 94.5%, sensitivity = 89.2%, specificity = 96.3%) |
Takenaka et al[44] | UC/770 patients | Prospective, multicenter | WLE | NR | NR | DNUC | Prediction of HR (sensitivity = 97.9%, specificity = 94.6%). Agreement between the DNUC and experts for endoscopic assessment (ICC = 0.927) |
Patel et al[45] | UC/73 patients | Prospective, single center | Endoscopy HD | 55 video images | 18 video images | MLA | Differentiation between: Remission (UCEIS: 0-1) and active inflammation (UCEIS ≥ 2) (accuracy = 90.0%, κ = 0.900); Mild (UCEIS: 2-3); And moderate-to-severe inflammation (UCEIS ≥ 4) (accuracy = 98.0%, κ = 0.960) |
Kim et al[19] | UC/492 patients | Retrospective, single center | WLE | 904 still images | 80 still images | DL-CNN | Difference between MES 0 vs MES 1: Internal test. IBD experts (F1 score = 0.92, AUC = 0.970); External test. Hyper Kvasir dataset (F1 score = 0.89, AUC = 0.860) |
Polat et al[20] | UC/564 patients | Retrospective, single center | WLE | 11276 still images | 1658 still images | DL-CNN | Excellent concordance between the five CNN networks and endoscopists for: MES evaluation (QWK: 0.847-0.854); And classification of remission cases (QWK: 0.834-0.852) |
Wang et al[21] | UC/308 patients | Retrospective, single center | WLE | 37515 still images | 3191 still images | CNN | Diagnosis of ER (MES ≤ 1) (AUC = 0.980, accuracy = 95.1%, sensitivity = 92.9%, specificity = 95.4%, κ = 0.884) |
Iacucci et al[46] | UC/283 patients | Prospective, multicenter | WLE, VCE | 239 video images; 245 video images | 242 video images; 244 video images | CNN | Detection of ER using VCE (PICaSSO ≤ 3) (AUC = 0.940, sensitivity = 79.0%, specificity = 95.0%, κ = 0.730) achieved better performance than WLE (UCEIS ≤ 1) (AUC = 0.850, sensitivity = 72.0%, specificity = 87.0%, κ = 0.510) |
Byrne et al[47] | UC/NR | Prospective, single center | HD endoscopy | 134 video images | NR | DL-CNN | Performance for disease severity discrimination: MES ≤ 1 vs MES ≥ 2 (AUC = 0.941, accuracy = 94.0%, sensitivity = 96.7%, specificity = 91.3%, QWK = 0.880); UCEIS ≤ 3 vs UCEIS > 3 (AUC = 0.936, accuracy = 94.0%, sensitivity = 93.9%, specificity = 93.4%, QWK = 0.870) |
Stidham et al[31] | UC/748 patients | Prospective, multicenter | WLE | NR | NR | ML | CDS had better performance for detecting endoscopic changes than MES (Hedges’ g: 0.743 vs 0.460, P < 0.001) |
Takabayashi et al[48] | UC/812 patients | Retrospective, multicenter | WLE | 14208 still images | 13826 still images | CNN | Disease severity grading-correlation between: UCEGS and MES (ρ = 0.890, P < 0.001); UCEGS and IBD experts (ρ: 0.960-0.987, P < 0.001) |
Ogata et al[49] | UC/110 patients | Prospective, single center | WLE | 74713 still images | 11452 still images | CNN | Performance for evaluation ER based MES (sensitivity = 96.9%, specificity = 78.4%, accuracy = 93.4%). Interobserver/intraobservator agreement with AI/without AI (ICC: 0.84-0.86/0.89 vs 0.64-0.76/0.76) |
Sinonquel et al[50] | UC/36 patients | Prospective, single center | SWE | NR | NR | CAD | Histological assessment using SWE-CAD (sensitivity = 96.1%, specificity = 85.5%, accuracy = 96.4%). The accuracy of classification into mild, moderate, and severe disease was 97.7%, 62.8% and 95.0%, respectively |
Aoki et al[51] | CD/131 patients | Retrospective, single center | CE | 5360 images | 10440 images | CNN | Ulcer recognition in small bowel video frames (AUC = 0.958, sensitivity = 88.2%, specificity = 90.9%, accuracy = 90.8%) |
Klang et al[52] | CD/49 patients | Retrospective, single center | CE | 14112 images | 3528 images | DL-CNN | Increased performance in ulcer detection (AUC = 0.990, accuracy: 95.4%-96.7%) |
Klang et al[32] | CD/145 patients | Retrospective, single center | CE | 27892 images | 1449 images | DNN | Performance for: Stricture detection (AUC = 0.971, accuracy = 93.5%); Differential diagnosis between strictures and normal mucosa (AUC = 0.989); Discrimination between strictures and ulcers (AUC = 0.942) |
Barash et al[53] | CD/49 patients | Retrospective, single center | CE | 1242 images | 248 images | CNN | Ability of ulcerative lesion classification: Grade 1 vs 3 (AUC = 0.958, accuracy = 91.0%, κ = 0.910); Grade 2 vs 3 (AUC = 0.939, accuracy = 79.0%, κ = 0.790); Grade 1 vs 2 (AUC = 0.565, accuracy = 62.4%, κ = 0.670) |
Majtner et al[54] | CD/38 patients | Retrospective, single center | CE | 5421 images | 1549 images | DL | Performance in ulcer detection (sensitivity = 95.7%, specificity = 99.8%, accuracy = 98.4%). Agreement between the model and manual reading of ulcerations (κ = 0.720) |
Udristoiu et al[55] | CD/54 patients | Retrospective, single center | pCLE | 5081 images | 1124 images | CNN | Differentiation between inflammation and intact colonic mucosa (AUC = 0.980, accuracy = 95.3%, specificity = 92.8%, sensitivity = 94.6%) |
de Maissin et al[56] | CD/63 patients | Retrospective, multicenter | CE | 2449 images | 700 images | RNN | Performance for discriminating pathological vs non-pathological images (accuracy = 93.7%, sensitivity = 93.0%, specificity = 95.0%, κ = 0.790) |
Ribeiro et al[57] | CD/124 patients | Retrospective, multicenter | CE | 37319 images | 124 images | CNN | Identification of colonic ulcerations and erosions (AUC = 1.000, accuracy = 99.6%, sensitivity = 96.9%, specificity = 99.9%) |
Ferreira et al[58] | CD/NR | Retrospective, multicenter | CE | 19740 images | 4935 images | DL-CNN | Performance of the model for lesion detection (sensitivity = 90.0%, specificity = 96.0%, precision = 97.1%, accuracy = 92.4%) |
Afonso et al[59] | CD/NR | Retrospective, single center | CE | 4904 images | 1226 images | CNN | Detection of ulcers and erosions in the small intestine mucosa (accuracy = 95.6%, sensitivity = 90.8%, specificity = 97.1%) |
Martins et al[60] | CD/250 patients | Retrospective, single center | DAE | 250 DAE images | 6772 images | CNN | Identification of colonic ulcerations and erosions (AUC = 1.000, accuracy = 98.7%, sensitivity = 88.5%, specificity = 99.7%) |
Brodersen et al[34] | CD/131 patients | Prospective, multicenter | CE | NR | NR | DL | The identification capacity for CD (sensitivity: 92.0%-96.0% and specificity: 90.0%-93.0%) and IBD (sensitivity: 97.0% and specificity: 90.0%-91.0%) |
Xie et al[61] | CD/628 patients | Retrospective, single center | DBE | NR | 28155 images | DL | The accuracy for detection of ulcers (96.3%), inflammatory stenosis (95.7%), and non-inflammatory stenosis (96.7%). The grading of ulcers based on surface area, size, and depth (precision between 85.2% and 87.8%) |
Table 2 Artificial intelligence-enabled digital pathology for histological assessment in inflammatory bowel diseases
Ref. | Disease/number of patients | Type of study | Number of training samples | Number of test samples | AI/model | Main findings |
Vande Casteele et al[66] | UC/88 patients | Retrospective, single center | 20 tissue regions | 88 biopsies | DL | Performance in identifying eosinophil counts: WSI (sensitivity = 86.4%, accuracy = 91.8%, F1 score = 0.89); Strong agreement with four human experts (ICC: 0.805-0.917) |
Gui et al[69] | UC/307 patients | Prospective, multicenter | 97 biopsies | 41 biopsies | CAD-CNN | Discrimination between HR (PHRI < 1) and non-remission (PHRI ≥ 1) based on the presence of neutrophils (sensitivity = 78.0%, specificity = 91.7%, accuracy = 86.0%, ICC = 0.840) |
Ohara et al[71] | UC/114 patients | Retrospective, single center | 2300 WSIs | 114 biopsies | DL | Rate of relapse higher for GCR ≤ 12% compared to GCR > 12% (45.0% vs 6.5%, P < 0.010) |
Najdawi et al[68] | UC/577 patients | Retrospective, single center | 512 WSIs | 308 WSIs | CNN-RFC | HR prediction (NHI ≤ 1, accuracy = 97.0%) was comparable to expert pathologist assessments (κ = 0.910, Spearman’s correlation ρ = 0.890, P < 0.010) |
Iacucci et al[70] | UC/273 patients | Prospective, multicenter | 118 biopsies | 375 biopsies (1); 154 biopsies (2) | CAD-CNN | (1) Performance in distinguishing HR from active inflammation: RHI (AUC = 0.850, accuracy = 80.0%, sensitivity = 94.0%, specificity = 76.0%); NHI (AUC = 0.860, accuracy = 81.0%, sensitivity = 89.0%, specificity = 79.0%); PHRI (AUC = 0.870, accuracy = 87.0%, sensitivity = 89.0%, specificity = 85.0%); and (2) The hazard ratio for disease recurrence according to PHRI was higher for AI assessment compared with pathologist evaluation (4.64 vs 3.56, P < 0.001) |
Peyrin-Biroulet et al[73] | UC/NR | Retrospective, single center | 160 WSIs | 40 WSIs | CNN | The average ICC between histopathologists and the AI tool for histological assessment based on NHI (ICC = 0.872) |
Ohara et al[72] | UC/96 patients | Prospective, single center | 11260 patches | 135 WSIs | DL | Histological evaluation based on neutrophil quantification in WSI (accuracy = 77.0%, F1 score = 79.0%). Prediction of histological scores (PHRI, NHI) by AI showed strong correlation with pathological diagnoses (Spearman’s ρ = 0.680-0.800, P < 0.050) |
Klein et al[78] | CD/105 patients | Retrospective, single center | Biopsies NR | Biopsies NR | NNET | Differentiation of clinical phenotypes (sensitivity = 78.0%, specificity = 77.0%). Prediction of surgical intervention (sensitivity = 80.0%, specificity = 91.0%) |
Kiyokawa et al[76] | CD/68 patients | Retrospective, single center | 619464 tile images | 308705 tile images | CNN | Adipocyte shrinkage and increased mast cell infiltration in sub-serosal adipose tissue anticipate postoperative recurrence (AUC = 0.995, accuracy = 96.9%, precision = 96.4%, sensitivity = 96.5%) |
Wang et al[77] | CD/205 patients | Retrospective, multicenter | 310 WSIs | 278 WSIs | DL-CNN | Severity of myenteric plexitis (accuracy = 83.3%) and postoperative recurrence prediction (AUC = 0.980) |
Rymarczyk et al[79] | UC/887 patients; CD/302 patients | Retrospective, multicenter | 2696 biopsies | 800 biopsies | RNN, FV + RF, SA-AbMILP | SA-AbMILP for automated histological assessment of: GS: Accuracy: 65.0%-85.0% (κ = 0.440-0.680); GHAS: Colon accuracy: 80.0%-89.0% (κ = 0.540-0.650); Ileum accuracy 65.0%-82.0% (κ = 0.460-0.670) |
Furlanello et al[74] | IBD/52 patients | Prospective, single center | 4981 WSIs | 356 biopsies | DL | Automated quantification of basal plasmacytosis discriminates IBD from non-IBD (accuracy = 90.0%) |
Table 3 Artificial intelligence-assisted prediction of therapeutic response in inflammatory bowel diseases
Ref. | Disease/number of patients | Study design | Therapy | AI/model | Main findings |
Waljee et al[91] | UC/491 patients | Retrospective, multicenter | VDZ | ML-RF | Long-term steroid-free ER prediction (AUC = 0.730) using laboratory data from first 6 weeks of VDZ |
Waljee et al[88] | CD/401 patients | Retrospective, multicenter | UST | ML-RF | Week 8 CRP/ALB ratio predicts UST non-response (AUC = 0.780, sensitivity = 79.0%, specificity = 67.0%) vs baseline data (AUC = 0.590, sensitivity = 63.0%, specificity = 64.0%) |
Con et al[81] | CD/146 patients | Retrospective, single center | IFX, ADA | DL-RNN | AI model using CRP < 5 mg/L better predicts post-therapy remission than conventional model (AUC: 0.754 vs 0.659, P = 0.036) |
Li et al[92] | CD/174 patients | Retrospective, single center | IFX | ML-RF | Response to IFX predicted by clinical/serological data (AUC = 0.900, accuracy = 85.0%, sensitivity = 81.0%, specificity = 94.0%) |
He et al[93] | CD/86 patients | Retrospective, single center | UST | ML | UST response prediction based on HSD3B1, MUC4, CF1, and CCL11 expression (AUC: 0.734-0.746) |
Park et al[94] | CD/234 patients | Prospective, multicenter | anti-TNF | ML | The likelihood of a non-durable response associated with hyperexpression of DPY19 L3 (β = 2.703) and GSTT1 (β = 1.735), and decreasing NUCB1 concentration (β = -2.142) |
Kellerman et al[33] | CD/101 patients | Retrospective, single center | ADA, IFX, VDZ | DL- TimeSformer | Prediction of biologic initiation in newly diagnosed patients (AUC = 0.860, accuracy: 81.0%-82.0%), outperforming human reader (AUC = 0.700) and FC (AUC = 0.740) |
Iacucci et al[36] | IBD/29 patients | Prospective, single center | anti-TNF, anti-α4β7 | CAD | pCLE-detected crypt/vessel abnormalities and fluorescein leakage predict therapy response in UC (AUC = 0.930, accuracy = 85.0%) and CD (AUC = 0.790, accuracy = 80.0%); better anti-TNF prediction in UC (AUC = 0.830) than in CD (AUC = 0.580) |
Stidham et al[31] | UC/748 patients with induction; 348 patients with maintenance | Prospective, single center | UST | CAD | CDSs were significantly lower in UST vs placebo both at week 8 (141.9 vs 184.3, P < 0.0001) and week 44 (78.2 vs 151.5, P < 0.0001). Stratification by baseline CDS showed increased UST efficacy in patients with severe disease compared with mild disease (-85.0 vs -55.4, P < 0.0001) |
Harun et al[87] | UC/1684 patients with induction; 463 patients with maintenance | Prospective, multicenter | Etrolizumab | ML-SHAP | Remission prediction post-induction (AUC = 0.740) and maintenance (AUC = 0.750) using combined demographic, clinical, physiologic, and histological data |
Qiu et al[89] | CD/746 patients | Retrospective, single center | IFX | ML-SHAP | Response discrimination using integrated predictors (HB, WBC, ESR, ALB, PLT, age at diagnosis, Montreal classification) (training set AUC = 0.910, si-test set AUC = 0.710) |
Table 4 Overview of machine learning models for diagnosis, prognosis, and treatment optimization in inflammatory bowel disease
Ref. | AI model | Field of application | Disease | Outcomes | Performance |
Najdawi et al[68] | ML-RF | Histological assessment | UC | Evaluation of HR | Strong agreement with pathologists in relation to the NHI score (κ = 0.910, Spearman coefficient of ρ = 0.890) (P < 0.001) |
Waljee et al[91] | Therapy | UC | Predicting corticosteroid-free ER with VDZ at week 52 | AUC = 0.730, sensitivity = 72.0%, specificity = 68% according to the results at week 6 | |
Waljee et al[88] | Therapy | CD | Anticipation of UST response at week 42 | AUC = 0.780, sensitivity = 79.0%, specificity = 67.0% based on demographic and laboratory data up to week 8 | |
Li et al[92] | Assessment therapeutic response to IFX | AUC = 0.900, accuracy = 85.0%, sensitivity = 81.0%, specificity = 94.0% | |||
He et al[93] | Prediction of therapeutic response to UST based on expression profile of four genes | AUC: 0.734–0.746 | |||
Stidham et al[103] | Risk stratification | CD | Evaluation of surgical outcomes | AUC = 0.780 | |
Maeda et al[90] | ML-SVM | Endoscopic assessment | UC | Evaluation of persistent inflammation | Sensitivity = 74.0%, specificity = 97.0%, precision = 91.0% |
Risk stratification | UC | Assessment of relapse risk | Increased rate in patients with active form (28.4%) compared with those in clinical remission (4.9%, P < 0.001) | ||
Park et al[94] | ML-XGBoost | Therapy | UC | Remission prediction post-induction and maintenance for etrolizumab | AUC: 0.740-0.750 |
Harun et al[87] | Therapy | CD | Prediction of therapeutic response to anti-TNF | Non-response associated with hyperexpression of DPY19 L3 (β = 2.703) and GSTT1 (β = 1.735), and decreased NUCB1 concentration (β = -2.142) | |
Qiu et al[89] | Therapy | CD | Prediction of therapeutic response to IFX | AUC = 0.91 | |
Takenaka et al[44] | DL-DNN | Endoscopic assessment | UC | Prediction of HR | Sensitivity = 97.9%, specificity = 94.6%, ICC = 0.927 |
Huang et al[43] | Endoscopic assessment | UC | Evaluation of mucosal healing | AUC = 0.927, accuracy = 93.8%, sensitivity = 84.6%, specificity = 96.9% | |
Klang et al[32] | Endoscopic assessment | CD | Identification of strictures | AUC = 0.989, precision = 93.5% | |
Grading the severity of ulcerations | AUC = 0.992 (mild cases); AUC = 0.975 (moderate cases); AUC = 0.889 (severe cases) | ||||
Ozawa et al[40] | DL-CNN | Endoscopic assessment | UC | Discrimination between MES ≤ 1 and MES 2; diagnosis of ER (MES ≤ 1) | AUC = 0.980 |
Wang et al[21] | Endoscopic assessment | UC | Discrimination between MES ≤ 1 and MES 2; diagnosis of ER (MES ≤ 1) | AUC = 0.980, accuracy = 95.1%, sensitivity = 92.9%, specificity = 95.4%, κ = 0.884 | |
Stidham et al[22] | Endoscopic assessment | UC | Discriminating ER from active endoscopic disease | AUC = 0.966, sensitivity = 83.0%, specificity = 96.0%. Excellent agreement between expert reviewers (κ = 0.860) | |
Gottlieb et al[30] | Endoscopic assessment | UC | Evaluation of mucosal healing | Accuracy: 95.5%-97.0%. Agreement with expert readers for MES (κ = 0.844) and UCEIS (0.855) | |
Takenaka et al[25] | Endoscopic assessment | UC | Predict of ER and HR | Accuracy = 90.1%, κ = 0.917 (UCEIS ≤ 2). Accuracy = 92.9%, κ = 0.859 (GS < 3.1) | |
Yao et al[27] | DL-CNN (Inception-V3) | Endoscopic assessment | UC | Assessment of disease severity | AUC = 0.939, sensitivity = 90.2%, specificity = 87.0% |
Gui et al[69] | DL-CNN | Histological assessment | UC | Prediction of HR (PHRI < 1) according to the presence or absence of neutrophils | Sensitivity = 78.0%, specificity = 91.7%, accuracy = 86.0%, ICC = 0.84 |
Iacucci et al[70] | Histological assessment | UC | Prediction of HR (PHRI < 1) according to the presence or absence of neutrophils | AUC = 0.870, accuracy = 87.0%, sensitivity = 89.0%, specificity = 85.0% | |
Vande Casteele et al[66] | Histological assessment | UC | Quantification of eosinophils in colonic biopsies | The model had sensitivity = 0.86, specificity = 0.91, accuracy = 0.89 | |
Udristoiu et al[55] | DL-CNN | Endoscopic assessment | CD | Differentiation between inflammation and intact colonic mucosa | AUC = 0.980, accuracy = 95.3%, specificity = 92.8%, sensitivity = 94.6% |
Majtner et al[54] | DL-CNN (ResNet-50) | Endoscopic assessment | CD | Ulcer detection | The diagnostic accuracy was 98.5% for the small bowel and 98.1% for the colon |
Kellerman et al[33] | DL-TimeSformer | Endoscopic assessment | CD | Prediction of biologic initiation in newly diagnosed patients | AUC = 0.860, accuracy: 81.0%-82.0% |
Rymarczyk et al[79] | DL-CNN (SA-AbMILP) | Histological assessment | CD | Automatic histological assessment for GHAS and GS | Accuracy between 65.0%-89.0% |
Furlanello et al[74] | DL-CNN (StarDist) | Histological assessment | IBD | Discriminates IBD from non-IBD mucosa | Accuracy = 90.0% |
Kiyokawa et al[76] | DL-CNN (EfficientNet-b5) | Risk stratification | CD | Prediction of postoperative recurrence | AUC = 0.995, accuracy = 96.9%, precision = 96.4% |
Con et al[81] | DL-RNN | Therapy | CD | Predicts post-therapy remission to anti-TNF | AUC = 0.754 |
- Citation: Minea H, Singeap AM, Minea M, Chiriac S, Stanciu C, Trifan A. Artificial intelligence in inflammatory bowel disease: Current applications and future directions. World J Gastroenterol 2025; 31(39): 111353
- URL: https://www.wjgnet.com/1007-9327/full/v31/i39/111353.htm
- DOI: https://dx.doi.org/10.3748/wjg.v31.i39.111353