BPG is committed to discovery and dissemination of knowledge
Review
Copyright ©The Author(s) 2025.
World J Gastroenterol. Oct 21, 2025; 31(39): 111353
Published online Oct 21, 2025. doi: 10.3748/wjg.v31.i39.111353
Table 1 Role of artificial intelligence-based endoscopy in the evaluation of patients with inflammatory bowel diseases
Ref.
Disease/number of patients
Type of study
Endoscopic technique
Number of training samples
Number of test samples
AI/model
Main findings
Stidham et al[22]UC/3082 patientsRetrospective, single centerWLE14862 images1652 imagesDL-CNNDiscriminating ER (MES ≤ 1) from moderate-severe disease (MES ≥ 2) (AUC = 0.966, sensitivity = 83.0%, specificity = 96.0%). AI and pathologist agreement (κ = 0.840 vs κ = 0.860)
Maeda et al[38]UC/187 patientsRetrospective, single centerEndocytoscopy12900 still images525 segmentsCADPrediction of HR (GS ≥ 3.1) (sensitivity = 74.0%, specificity = 97.0%, precision = 91.0%, κ = 1.000)
Ozawa et al[40]UC/955 patientsRetrospective, single centerWLE26304 still images3981 still imagesCAD-CNNAI performance for mucosal healing (MES ≤ 1, AUC = 0.980)
Takenaka et al[25]UC/875 patientsProspective, single centerWLE40789 still images4187 still imagesDNUCEvaluation of ER (UCEIS ≤ 2) (accuracy = 90.1%, ICC = 0.917). Evaluation of HR (GS < 3.1) (accuracy = 92.9%, κ = 0.859)
Bossuyt et al[41]UC/35 patientsProspective, multicenterPrototype endoscopeNR NRCADRD for endoscopic/histological inflammation: Correlation with MES (r = 0.76), UCEIS (r = 0.74), RHI (r = 0.74). RD score (≤ 60) predicts HR (AUC = 0.950, sensitivity = 96.0%, specificity = 80.0%)
Yao et al[27]UC/157 patientsProspective, multicenterWLENR264 videos of high resolutionDL-CNNThe still image informative classifier had excellent performance (sensitivity = 0.902, specificity = 0.870). Correct prediction of MES: 78% of videos (κ = 0.840)
Gottlieb et al[30]UC/249 patientsProspective, multicenterWLE629 videos157 videosRNNEndoscopic healing evaluation according to UCEIS (accuracy = 97.0%) and MES (accuracy = 95.5%). Agreement of the model with human experts for MES (QWK = 0.844) and UCEIS (QWK = 0.855)
Bossuyt et al[42]UC/58 patientsProspective, single centerSWENR113 still imagesCADAI algorithm yielded better HR accuracy (86.0%) than MES (74.0%) or UCEIS (79.0%)
Huang et al[43]UC/54 patientsRetrospective, single centerEndoscopy HD600 still images256 still imagesDNN, SVM, k-NNPerformance of the combined model for differentiation between MES ≤ 1 and MES 2 (AUC = 0.927, accuracy = 94.5%, sensitivity = 89.2%, specificity = 96.3%)
Takenaka et al[44]UC/770 patientsProspective, multicenterWLENRNRDNUCPrediction of HR (sensitivity = 97.9%, specificity = 94.6%). Agreement between the DNUC and experts for endoscopic assessment (ICC = 0.927)
Patel et al[45]UC/73 patientsProspective, single centerEndoscopy HD55 video images18 video imagesMLADifferentiation between: Remission (UCEIS: 0-1) and active inflammation (UCEIS ≥ 2) (accuracy = 90.0%, κ = 0.900); Mild (UCEIS: 2-3); And moderate-to-severe inflammation (UCEIS ≥ 4) (accuracy = 98.0%, κ = 0.960)
Kim et al[19]UC/492 patientsRetrospective, single centerWLE904 still images80 still imagesDL-CNNDifference between MES 0 vs MES 1: Internal test. IBD experts (F1 score = 0.92, AUC = 0.970); External test. Hyper Kvasir dataset (F1 score = 0.89, AUC = 0.860)
Polat et al[20]UC/564 patientsRetrospective, single centerWLE11276 still images1658 still imagesDL-CNNExcellent concordance between the five CNN networks and endoscopists for: MES evaluation (QWK: 0.847-0.854); And classification of remission cases (QWK: 0.834-0.852)
Wang et al[21]UC/308 patientsRetrospective, single centerWLE37515 still images3191 still imagesCNNDiagnosis of ER (MES ≤ 1) (AUC = 0.980, accuracy = 95.1%, sensitivity = 92.9%, specificity = 95.4%, κ = 0.884)
Iacucci et al[46]UC/283 patientsProspective, multicenterWLE, VCE239 video images; 245 video images242 video images; 244 video imagesCNNDetection of ER using VCE (PICaSSO ≤ 3) (AUC = 0.940, sensitivity = 79.0%, specificity = 95.0%, κ = 0.730) achieved better performance than WLE (UCEIS ≤ 1) (AUC = 0.850, sensitivity = 72.0%, specificity = 87.0%, κ = 0.510)
Byrne et al[47]UC/NRProspective, single centerHD endoscopy134 video imagesNRDL-CNNPerformance for disease severity discrimination: MES ≤ 1 vs MES ≥ 2 (AUC = 0.941, accuracy = 94.0%, sensitivity = 96.7%, specificity = 91.3%, QWK = 0.880); UCEIS ≤ 3 vs UCEIS > 3 (AUC = 0.936, accuracy = 94.0%, sensitivity = 93.9%, specificity = 93.4%, QWK = 0.870)
Stidham et al[31]UC/748 patientsProspective, multicenterWLENRNRMLCDS had better performance for detecting endoscopic changes than MES (Hedges’ g: 0.743 vs 0.460, P < 0.001)
Takabayashi et al[48]UC/812 patientsRetrospective, multicenterWLE14208 still images13826 still imagesCNNDisease severity grading-correlation between: UCEGS and MES (ρ = 0.890, P < 0.001); UCEGS and IBD experts (ρ: 0.960-0.987, P < 0.001)
Ogata et al[49]UC/110 patientsProspective, single centerWLE74713 still images11452 still imagesCNNPerformance for evaluation ER based MES (sensitivity = 96.9%, specificity = 78.4%, accuracy = 93.4%). Interobserver/intraobservator agreement with AI/without AI (ICC: 0.84-0.86/0.89 vs 0.64-0.76/0.76)
Sinonquel et al[50]UC/36 patientsProspective, single centerSWENRNRCADHistological assessment using SWE-CAD (sensitivity = 96.1%, specificity = 85.5%, accuracy = 96.4%). The accuracy of classification into mild, moderate, and severe disease was 97.7%, 62.8% and 95.0%, respectively
Aoki et al[51]CD/131 patientsRetrospective, single centerCE5360 images10440 imagesCNNUlcer recognition in small bowel video frames (AUC = 0.958, sensitivity = 88.2%, specificity = 90.9%, accuracy = 90.8%)
Klang et al[52]CD/49 patientsRetrospective, single centerCE14112 images3528 imagesDL-CNNIncreased performance in ulcer detection (AUC = 0.990, accuracy: 95.4%-96.7%)
Klang et al[32]CD/145 patientsRetrospective, single centerCE27892 images1449 imagesDNNPerformance for: Stricture detection (AUC = 0.971, accuracy = 93.5%); Differential diagnosis between strictures and normal mucosa (AUC = 0.989); Discrimination between strictures and ulcers (AUC = 0.942)
Barash et al[53]CD/49 patientsRetrospective, single centerCE1242 images248 imagesCNNAbility of ulcerative lesion classification: Grade 1 vs 3 (AUC = 0.958, accuracy = 91.0%, κ = 0.910); Grade 2 vs 3 (AUC = 0.939, accuracy = 79.0%, κ = 0.790); Grade 1 vs 2 (AUC = 0.565, accuracy = 62.4%, κ = 0.670)
Majtner et al[54]CD/38 patientsRetrospective, single centerCE5421 images1549 imagesDLPerformance in ulcer detection (sensitivity = 95.7%, specificity = 99.8%, accuracy = 98.4%). Agreement between the model and manual reading of ulcerations (κ = 0.720)
Udristoiu et al[55]CD/54 patientsRetrospective, single centerpCLE5081 images1124 imagesCNNDifferentiation between inflammation and intact colonic mucosa (AUC = 0.980, accuracy = 95.3%, specificity = 92.8%, sensitivity = 94.6%)
de Maissin et al[56]CD/63 patientsRetrospective, multicenterCE2449 images700 imagesRNNPerformance for discriminating pathological vs non-pathological images (accuracy = 93.7%, sensitivity = 93.0%, specificity = 95.0%, κ = 0.790)
Ribeiro et al[57]CD/124 patientsRetrospective, multicenterCE37319 images124 imagesCNNIdentification of colonic ulcerations and erosions (AUC = 1.000, accuracy = 99.6%, sensitivity = 96.9%, specificity = 99.9%)
Ferreira et al[58]CD/NRRetrospective, multicenterCE19740 images4935 imagesDL-CNNPerformance of the model for lesion detection (sensitivity = 90.0%, specificity = 96.0%, precision = 97.1%, accuracy = 92.4%)
Afonso et al[59]CD/NRRetrospective, single centerCE4904 images1226 imagesCNNDetection of ulcers and erosions in the small intestine mucosa (accuracy = 95.6%, sensitivity = 90.8%, specificity = 97.1%)
Martins et al[60]CD/250 patientsRetrospective, single centerDAE250 DAE images6772 imagesCNNIdentification of colonic ulcerations and erosions (AUC = 1.000, accuracy = 98.7%, sensitivity = 88.5%, specificity = 99.7%)
Brodersen et al[34]CD/131 patientsProspective, multicenterCENRNRDLThe identification capacity for CD (sensitivity: 92.0%-96.0% and specificity: 90.0%-93.0%) and IBD (sensitivity: 97.0% and specificity: 90.0%-91.0%)
Xie et al[61]CD/628 patientsRetrospective, single centerDBENR28155 imagesDLThe accuracy for detection of ulcers (96.3%), inflammatory stenosis (95.7%), and non-inflammatory stenosis (96.7%). The grading of ulcers based on surface area, size, and depth (precision between 85.2% and 87.8%)
Table 2 Artificial intelligence-enabled digital pathology for histological assessment in inflammatory bowel diseases
Ref.
Disease/number of patients
Type of study
Number of training samples
Number of test samples
AI/model
Main findings
Vande Casteele et al[66]UC/88 patientsRetrospective, single center20 tissue regions 88 biopsiesDLPerformance in identifying eosinophil counts: WSI (sensitivity = 86.4%, accuracy = 91.8%, F1 score = 0.89); Strong agreement with four human experts (ICC: 0.805-0.917)
Gui et al[69]UC/307 patientsProspective, multicenter97 biopsies41 biopsiesCAD-CNNDiscrimination between HR (PHRI < 1) and non-remission (PHRI ≥ 1) based on the presence of neutrophils (sensitivity = 78.0%, specificity = 91.7%, accuracy = 86.0%, ICC = 0.840)
Ohara et al[71]UC/114 patientsRetrospective, single center2300 WSIs114 biopsiesDLRate of relapse higher for GCR ≤ 12% compared to GCR > 12% (45.0% vs 6.5%, P < 0.010)
Najdawi et al[68]UC/577 patientsRetrospective, single center512 WSIs308 WSIsCNN-RFCHR prediction (NHI ≤ 1, accuracy = 97.0%) was comparable to expert pathologist assessments (κ = 0.910, Spearman’s correlation ρ = 0.890, P < 0.010)
Iacucci et al[70]UC/273 patientsProspective, multicenter118 biopsies375 biopsies (1); 154 biopsies (2)CAD-CNN(1) Performance in distinguishing HR from active inflammation: RHI (AUC = 0.850, accuracy = 80.0%, sensitivity = 94.0%, specificity = 76.0%); NHI (AUC = 0.860, accuracy = 81.0%, sensitivity = 89.0%, specificity = 79.0%); PHRI (AUC = 0.870, accuracy = 87.0%, sensitivity = 89.0%, specificity = 85.0%); and (2) The hazard ratio for disease recurrence according to PHRI was higher for AI assessment compared with pathologist evaluation (4.64 vs 3.56, P < 0.001)
Peyrin-Biroulet et al[73]UC/NRRetrospective, single center160 WSIs40 WSIsCNNThe average ICC between histopathologists and the AI tool for histological assessment based on NHI (ICC = 0.872)
Ohara et al[72]UC/96 patientsProspective, single center11260 patches135 WSIsDLHistological evaluation based on neutrophil quantification in WSI (accuracy = 77.0%, F1 score = 79.0%). Prediction of histological scores (PHRI, NHI) by AI showed strong correlation with pathological diagnoses (Spearman’s ρ = 0.680-0.800, P < 0.050)
Klein et al[78]CD/105 patientsRetrospective, single centerBiopsies NRBiopsies NRNNETDifferentiation of clinical phenotypes (sensitivity = 78.0%, specificity = 77.0%). Prediction of surgical intervention (sensitivity = 80.0%, specificity = 91.0%)
Kiyokawa et al[76]CD/68 patientsRetrospective, single center619464 tile images308705 tile imagesCNNAdipocyte shrinkage and increased mast cell infiltration in sub-serosal adipose tissue anticipate postoperative recurrence (AUC = 0.995, accuracy = 96.9%, precision = 96.4%, sensitivity = 96.5%)
Wang et al[77]CD/205 patientsRetrospective, multicenter310 WSIs278 WSIsDL-CNNSeverity of myenteric plexitis (accuracy = 83.3%) and postoperative recurrence prediction (AUC = 0.980)
Rymarczyk et al[79]UC/887 patients; CD/302 patientsRetrospective, multicenter2696 biopsies800 biopsiesRNN, FV + RF, SA-AbMILPSA-AbMILP for automated histological assessment of: GS: Accuracy: 65.0%-85.0% (κ = 0.440-0.680); GHAS: Colon accuracy: 80.0%-89.0% (κ = 0.540-0.650); Ileum accuracy 65.0%-82.0% (κ = 0.460-0.670)
Furlanello et al[74]IBD/52 patientsProspective, single center4981 WSIs356 biopsiesDLAutomated quantification of basal plasmacytosis discriminates IBD from non-IBD (accuracy = 90.0%)
Table 3 Artificial intelligence-assisted prediction of therapeutic response in inflammatory bowel diseases
Ref.
Disease/number of patients
Study design
Therapy
AI/model
Main findings
Waljee et al[91]UC/491 patientsRetrospective, multicenterVDZML-RFLong-term steroid-free ER prediction (AUC = 0.730) using laboratory data from first 6 weeks of VDZ
Waljee et al[88]CD/401 patientsRetrospective, multicenterUSTML-RFWeek 8 CRP/ALB ratio predicts UST non-response (AUC = 0.780, sensitivity = 79.0%, specificity = 67.0%) vs baseline data (AUC = 0.590, sensitivity = 63.0%, specificity = 64.0%)
Con et al[81]CD/146 patientsRetrospective, single centerIFX, ADADL-RNNAI model using CRP < 5 mg/L better predicts post-therapy remission than conventional model (AUC: 0.754 vs 0.659, P = 0.036)
Li et al[92]CD/174 patientsRetrospective, single centerIFXML-RFResponse to IFX predicted by clinical/serological data (AUC = 0.900, accuracy = 85.0%, sensitivity = 81.0%, specificity = 94.0%)
He et al[93]CD/86 patientsRetrospective, single centerUSTMLUST response prediction based on HSD3B1, MUC4, CF1, and CCL11 expression (AUC: 0.734-0.746)
Park et al[94]CD/234 patientsProspective, multicenteranti-TNFMLThe likelihood of a non-durable response associated with hyperexpression of DPY19 L3 (β = 2.703) and GSTT1 (β = 1.735), and decreasing NUCB1 concentration (β = -2.142)
Kellerman et al[33]CD/101 patientsRetrospective, single centerADA, IFX, VDZDL- TimeSformerPrediction of biologic initiation in newly diagnosed patients (AUC = 0.860, accuracy: 81.0%-82.0%), outperforming human reader (AUC = 0.700) and FC (AUC = 0.740)
Iacucci et al[36]IBD/29 patientsProspective, single centeranti-TNF, anti-α4β7CADpCLE-detected crypt/vessel abnormalities and fluorescein leakage predict therapy response in UC (AUC = 0.930, accuracy = 85.0%) and CD (AUC = 0.790, accuracy = 80.0%); better anti-TNF prediction in UC (AUC = 0.830) than in CD (AUC = 0.580)
Stidham et al[31]UC/748 patients with induction; 348 patients with maintenanceProspective, single centerUSTCADCDSs were significantly lower in UST vs placebo both at week 8 (141.9 vs 184.3, P < 0.0001) and week 44 (78.2 vs 151.5, P < 0.0001). Stratification by baseline CDS showed increased UST efficacy in patients with severe disease compared with mild disease (-85.0 vs -55.4, P < 0.0001)
Harun et al[87]UC/1684 patients with induction; 463 patients with maintenanceProspective, multicenterEtrolizumabML-SHAPRemission prediction post-induction (AUC = 0.740) and maintenance (AUC = 0.750) using combined demographic, clinical, physiologic, and histological data
Qiu et al[89]CD/746 patientsRetrospective, single centerIFXML-SHAPResponse discrimination using integrated predictors (HB, WBC, ESR, ALB, PLT, age at diagnosis, Montreal classification) (training set AUC = 0.910, si-test set AUC = 0.710)
Table 4 Overview of machine learning models for diagnosis, prognosis, and treatment optimization in inflammatory bowel disease
Ref.
AI model
Field of application
Disease
Outcomes
Performance
Najdawi et al[68]ML-RFHistological assessmentUCEvaluation of HRStrong agreement with pathologists in relation to the NHI score (κ = 0.910, Spearman coefficient of ρ = 0.890) (P < 0.001)
Waljee et al[91]TherapyUCPredicting corticosteroid-free ER with VDZ at week 52AUC = 0.730, sensitivity = 72.0%, specificity = 68% according to the results at week 6
Waljee et al[88]TherapyCDAnticipation of UST response at week 42AUC = 0.780, sensitivity = 79.0%, specificity = 67.0% based on demographic and laboratory data up to week 8
Li et al[92]Assessment therapeutic response to IFXAUC = 0.900, accuracy = 85.0%, sensitivity = 81.0%, specificity = 94.0%
He et al[93]Prediction of therapeutic response to UST based on expression profile of four genesAUC: 0.734–0.746
Stidham et al[103]Risk stratificationCDEvaluation of surgical outcomesAUC = 0.780
Maeda et al[90]ML-SVMEndoscopic assessmentUCEvaluation of persistent inflammationSensitivity = 74.0%, specificity = 97.0%, precision = 91.0%
Risk stratificationUCAssessment of relapse riskIncreased rate in patients with active form (28.4%) compared with those in clinical remission (4.9%, P < 0.001)
Park et al[94]ML-XGBoostTherapyUCRemission prediction post-induction and maintenance for etrolizumabAUC: 0.740-0.750
Harun et al[87]TherapyCDPrediction of therapeutic response to anti-TNFNon-response associated with hyperexpression of DPY19 L3 (β = 2.703) and GSTT1 (β = 1.735), and decreased NUCB1 concentration (β = -2.142)
Qiu et al[89]TherapyCDPrediction of therapeutic response to IFXAUC = 0.91
Takenaka et al[44]DL-DNNEndoscopic assessmentUCPrediction of HRSensitivity = 97.9%, specificity = 94.6%, ICC = 0.927
Huang et al[43]Endoscopic assessmentUCEvaluation of mucosal healingAUC = 0.927, accuracy = 93.8%, sensitivity = 84.6%, specificity = 96.9%
Klang et al[32]Endoscopic assessmentCDIdentification of stricturesAUC = 0.989, precision = 93.5%
Grading the severity of ulcerationsAUC = 0.992 (mild cases); AUC = 0.975 (moderate cases); AUC = 0.889 (severe cases)
Ozawa et al[40]DL-CNNEndoscopic assessmentUCDiscrimination between MES ≤ 1 and MES 2; diagnosis of ER (MES ≤ 1)AUC = 0.980
Wang et al[21]Endoscopic assessmentUCDiscrimination between MES ≤ 1 and MES 2; diagnosis of ER (MES ≤ 1)AUC = 0.980, accuracy = 95.1%, sensitivity = 92.9%, specificity = 95.4%, κ = 0.884
Stidham et al[22]Endoscopic assessmentUCDiscriminating ER from active endoscopic diseaseAUC = 0.966, sensitivity = 83.0%, specificity = 96.0%. Excellent agreement between expert reviewers (κ = 0.860)
Gottlieb et al[30]Endoscopic assessmentUCEvaluation of mucosal healingAccuracy: 95.5%-97.0%. Agreement with expert readers for MES (κ = 0.844) and UCEIS (0.855)
Takenaka et al[25]Endoscopic assessmentUCPredict of ER and HRAccuracy = 90.1%, κ = 0.917 (UCEIS ≤ 2). Accuracy = 92.9%, κ = 0.859 (GS < 3.1)
Yao et al[27]DL-CNN (Inception-V3)Endoscopic assessmentUCAssessment of disease severityAUC = 0.939, sensitivity = 90.2%, specificity = 87.0%
Gui et al[69]DL-CNNHistological assessmentUCPrediction of HR (PHRI < 1) according to the presence or absence of neutrophilsSensitivity = 78.0%, specificity = 91.7%, accuracy = 86.0%, ICC = 0.84
Iacucci et al[70]Histological assessmentUCPrediction of HR (PHRI < 1) according to the presence or absence of neutrophilsAUC = 0.870, accuracy = 87.0%, sensitivity = 89.0%, specificity = 85.0%
Vande Casteele et al[66]Histological assessmentUCQuantification of eosinophils in colonic biopsiesThe model had sensitivity = 0.86, specificity = 0.91, accuracy = 0.89
Udristoiu et al[55]DL-CNNEndoscopic assessmentCDDifferentiation between inflammation and intact colonic mucosaAUC = 0.980, accuracy = 95.3%, specificity = 92.8%, sensitivity = 94.6%
Majtner et al[54]DL-CNN (ResNet-50)Endoscopic assessmentCDUlcer detectionThe diagnostic accuracy was 98.5% for the small bowel and 98.1% for the colon
Kellerman et al[33]DL-TimeSformerEndoscopic assessmentCDPrediction of biologic initiation in newly diagnosed patientsAUC = 0.860, accuracy: 81.0%-82.0%
Rymarczyk et al[79]DL-CNN (SA-AbMILP)Histological assessmentCDAutomatic histological assessment for GHAS and GSAccuracy between 65.0%-89.0%
Furlanello et al[74]DL-CNN (StarDist)Histological assessmentIBDDiscriminates IBD from non-IBD mucosaAccuracy = 90.0%
Kiyokawa et al[76]DL-CNN (EfficientNet-b5)Risk stratificationCDPrediction of postoperative recurrenceAUC = 0.995, accuracy = 96.9%, precision = 96.4%
Con et al[81]DL-RNNTherapyCDPredicts post-therapy remission to anti-TNFAUC = 0.754