Copyright
©The Author(s) 2025.
World J Gastroenterol. Nov 7, 2025; 31(41): 111184
Published online Nov 7, 2025. doi: 10.3748/wjg.v31.i41.111184
Published online Nov 7, 2025. doi: 10.3748/wjg.v31.i41.111184
Table 1 Details of the self-collected dataset
| Aspect | Details |
| Source hospitals | Kiang Wu Hospital, Macao; Xiangyang Central Hospital, Xiangyang, Hubei Province, China |
| Data collection period | 2019-2024 |
| Total images | 3313 endoscopic images |
| Imaging modality | White light endoscopy (majority); narrow band imaging (subset) |
| Imaging equipment | Olympus EVIS X1 CV-1500 with GIF EZ1500 gastroscopes (Kiang Wu Hospital); Olympus CF-HQ290-I and PENTAX EG29-i10 gastroscopes (Xiangyang Central Hospital) |
| Disease classes | Normal: 1014 images; esophageal neoplasm: 256 images; esophageal varices: 228 images; GERD: 143 images; gastric neoplasm: 486 images; gastric polyp: 526 images; gastric ulcer: 366 images; gastric varices: 83 images; duodenal ulcer: 211 images |
| Data acquisition | Images were retrieved from the two hospital databases: (1) Normal images were selected based on chronic superficial gastritis cases exhibiting visually normal appearance on white light endoscopy and no significant pathological findings; and (2) Disease images were identified using international classification of diseases codes under the supervision of gastroenterologists and biomedical engineering postgraduate students |
| Annotation process | Disease labels were verified by gastroenterologists; Mask annotations were created using Anylabeling by a PhD student and a post-doctoral researcher in biomedical engineering under the supervision of a gastroenterologist; annotations were exported as JSON and converted to binary PNG masks; Masks were independently reviewed by an experienced gastroenterologist |
| Ethical approval identifiers | Medical Ethics Committee of Xiangyang Central Hospital (No. 2024-145); and Medical Ethics Committee of Kiang Wu Hospital, Macao (No. 2019-005) |
| Compliance | Conducted in accordance with the Declaration of Helsinki |
| Pre-processing pipeline | Black borders and text metadata are removed; Images are cropped to resolutions ranging from 268 × 217 pixels to 1545 × 1156 pixels; images are divided into training-and-validation (80%) and test (20%) sets, and further split training-and-validation into training (80%) and validation (20%); multi-class segmentation masks are generated with nine channels: All pixels in normal images are assigned to the “normal” channel. While in disease images, non-lesion/background areas are assigned to the “normal” channel and lesion areas to their respective disease channels |
Table 2 Details of the EDD2020 dataset
| Aspect | Details |
| Source hospitals | Ambroise Paré Hospital, France; Centro Riferimento Oncologico IRCCS, Italy; Istituto Oncologico Veneto, Italy; John Radcliffe Hospital, United Kingdom |
| Total images | 386 still images with 502 segmentation masks |
| Disease classes | Barrett’s esophagus: 160 masks; suspicious precancerous lesions: 88 masks; high-grade dysplasia: 74 masks; cancer: 53 masks; polyps: 127 masks |
| Annotation process | Performed by two clinical experts and two post-doctoral researchers using the open-source VGG image annotator annotation tool |
| Pre-processing pipeline | Stratified split into training (81%), validation (9%), and test (10%) using scikit-learn[34]; specialized multi-class segmentation mask generation with six channels (five disease classes + background): Non-lesion/background areas in disease images are assigned to the “normal” channel, while lesions are assigned to their respective disease channels |
Table 3 Distribution of disease labels across dataset splits
| Disease class | Training | Validation | Test | Total |
| BE | 130 | 14 | 16 | 160 |
| HGD | 61 | 6 | 7 | 74 |
| Cancer | 43 | 5 | 5 | 53 |
| Polyp | 103 | 11 | 13 | 127 |
| Suspicious | 68 | 11 | 9 | 99 |
| Total | 405 | 47 | 50 | 502 |
Table 4 Model selection for this comparative study
| Model | Model name | Encoder size | Applications |
| CNN-based | U-Net[36] | NA | General segmentation, Biomedical segmentation |
| ResNet[37] + U-Net | ResNet501 | General segmentation, Biomedical segmentation | |
| ConvNeXt[38] + UPerNet[39] | ConvNeXt-T1 | General segmentation | |
| M2SNet[40] | Res2Net50-v1b-26w-4s1[41] | Medical segmentation | |
| Dilated SegNet[42] | ResNet501 | Polyp segmentation | |
| PraNet[27] | Res2Net50-v1b-26w-4s1 | Polyp segmentation | |
| Transformer-based | SwinV2[43] + UPerNet | SwinV2-T1 | General segmentation |
| SETR-MLA[44] | ViT-B-161[22] | General segmentation | |
| SegFormer[45] | MiT-B21 | General segmentation | |
| TransUNet[46] | ResNet501 and ViT-B-161 | Medical segmentation | |
| PVTV2[47] + EMCAD[48] | PVTV2-B21 | Medical segmentation | |
| FCBFormer[26] | PVTV2-B21 | Polyp segmentation | |
| Mamba-based | Swin-UMamba[49] | VSSM-encoder1 | Medical segmentation |
| Swin-UMamba-D[49] | VSSM-encoder1 | Medical segmentation | |
| U-Mamba-Bot[50] | NA | Biomedical segmentation | |
| U-Mamba-Enc[50] | NA | Biomedical segmentation | |
| VM-UNETV2[51] | VM-UNET-encoder1 | Medical segmentation |
Table 5 Performance metrics of different segmentation models on the self-collected dataset and the EDD2020 dataset, mean ± SD
| Model | Encoder size | Self-collected dataset | EDD2020 dataset | ||||
| Pixel accuracy (%) | IoU (%) | Dice score (%) | Pixel accuracy (%) | IoU (%) | Dice score (%) | ||
| CNN-based | |||||||
| U-Net | NA | 89.22 ± 0.16 | 82.14 ± 0.21 | 88.47 ± 0.17 | 93.41 ± 0.07 | 67.63 ± 0.48 | 79.37 ± 0.44 |
| ResNet +U-Net | ResNet501 | 92.20 ± 0.14 | 87.30 ± 0.30 | 91.95 ± 0.19 | 94.52 ± 0.31 | 73.97 ± 1.29 | 83.59 ± 1.05 |
| ConvNeXt + UPerNet | ConvNeXt-T1 | 93.05 ± 0.14 | 88.48 ± 0.09 | 92.76 ± 0.10 | 95.17 ± 0.13 | 76.90 ± 0.61 | 85.65 ± 0.48 |
| M2SNet | Res2Net50-v1b-26w-4s1 | 92.17 ± 0.17 | 86.93 ± 0.32 | 91.72 ± 0.26 | 94.80 ± 0.22 | 74.81 ± 1.01 | 84.24 ± 0.79 |
| Dilated SegNet | ResNet501 | 92.43 ± 0.35 | 87.47 ± 0.51 | 92.04 ± 0.32 | 94.46 ± 0.24 | 73.64 ± 0.88 | 83.35 ± 0.78 |
| PraNet | Res2Net50-v1b-26w-4s1 | 92.38 ± 0.29 | 86.35 ± 0.43 | 91.31 ± 0.31 | 94.48 ± 0.07 | 61.12 ± 0.41 | 74.15 ± 0.34 |
| Transformer-based | |||||||
| SwinV2 + UPerNet | SwinV2-T1 | 93.15 ± 0.11 | 88.50 ± 0.18 | 92.84 ± 0.12 | 95.18 ± 0.23 | 76.97 ± 0.89 | 85.59 ± 0.65 |
| Segformer | MiT-B21 | 93.39 ± 0.232 | 88.94 ± 0.38 | 93.14 ± 0.27 | 95.25 ± 0.31 | 77.20 ± 0.98 | 85.90 ± 0.76 |
| SETR-MLA | ViT-B-161 | 90.09 ± 0.43 | 83.37 ± 0.24 | 89.19 ± 0.37 | 94.17 ± 0.36 | 71.48 ± 1.43 | 82.14 ± 1.23 |
| TransUNet | ResNet501 and ViT-B-161 | 90.67 ± 0.35 | 84.55 ± 0.55 | 90.02 ± 0.40 | 92.74 ± 0.18 | 65.06 ± 1.33 | 77.35 ± 1.16 |
| EMCAD | PVTV2-B21 | 93.33 ± 0.19 | 88.74 ± 0.22 | 93.01 ± 0.16 | 95.22 ± 0.22 | 77.07 ± 0.91 | 85.81 ± 0.67 |
| FCBFormer | PVTV2-B21 | 93.04 ± 0.21 | 87.96 ± 0.34 | 92.48 ± 0.28 | 95.04 ± 0.15 | 76.03 ± 0.70 | 85.18 ± 0.45 |
| Mamba-based | |||||||
| Swin-UMamba | VSSM-encoder1 | 92.57 ± 0.10 | 87.78 ± 0.10 | 92.20 ± 0.06 | 93.81 ± 0.30 | 71.12 ± 1.26 | 81.23 ± 0.86 |
| Swin-UMamba-D | VSSM-D-encoder1 | 93.39 ± 0.19 | 89.06 ± 0.202 | 93.19 ± 0.142 | 95.37 ± 0.092 | 77.53 ± 0.322 | 86.15 ± 0.222 |
| UMamba-Bot | NA | 88.47 ± 0.20 | 81.43 ± 0.14 | 87.38 ± 0.22 | 92.04 ± 0.14 | 61.79 ± 0.86 | 75.03 ± 0.80 |
| UMamba-Enc | NA | 87.72 ± 0.11 | 80.74 ± 0.13 | 86.64 ± 0.16 | 92.12 ± 0.29 | 61.82 ± 0.56 | 75.24 ± 0.60 |
| VM-UNETV2 | VM-UNET-encoder1 | 93.09 ± 0.29 | 88.38 ± 0.34 | 92.76 ± 0.33 | 94.89 ± 0.14 | 74.89 ± 0.47 | 84.36 ± 0.42 |
Table 6 Comparison of different efficiency metrics of different models across architectural paradigms, mean ± SD
| Model | Parameters (M) | FLOPs (G) | GPU usage (GB) | Mean training time (minute) | Mean inference time (ms) | FPS | ||
| Self-collected | EDD 2020 | CPU | GPU | |||||
| CNN-based | ||||||||
| U-Net | 31.46 | 36.95 | 3.30 | 82.79 ± 0.25 | 10.35 ± 0.03 | 215.35 ± 37.50 | 3.64 ± 6.321 | 274.731 |
| ResNet + U-Net | 32.52 | 8.23 | 2.00 | 57.03 ± 0.351 | 6.69 ± 0.121 | 62.67 ± 6.32 | 5.18 ± 0.35 | 193.05 |
| ConvNeXt + UPerNet | 41.37 | 16.71 | 2.50 | 77.63 ± 0.16 | 8.84 ± 0.10 | 102.97 ± 8.35 | 5.65 ± 0.40 | 176.99 |
| M2SNet | 29.89 | 13.50 | 2.50 | 89.83 ± 1.75 | 9.15 ± 0.07 | 101.97 ± 6.52 | 14.86 ± 1.21 | 67.29 |
| Dilated SegNet | 18.111 | 20.72 | 3.20 | 98.38 ± 0.87 | 10.71 ± 0.04 | 146.60 ± 9.76 | 9.23 ± 1.24 | 108.34 |
| PraNet | 32.56 | 5.30 | 1.801 | 84.06 ± 0.22 | 8.18 ± 0.10 | 65.78 ± 7.19 | 12.16 ± 0.78 | 82.24 |
| Transformer-based | ||||||||
| SwinV2 + UPerNet | 41.91 | 17.19 | 2.70 | 89.07 ± 0.98 | 8.96 ± 0.09 | 126.12 ± 13.56 | 12.19 ± 0.65 | 82.03 |
| SegFormer | 24.73 | 4.23 | 2.00 | 71.93 ± 0.11 | 8.04 ± 0.19 | 61.48 ± 5.74 | 9.68 ± 0.75 | 103.31 |
| SETR-MLA | 90.77 | 18.60 | 3.00 | 71.40 ± 1.13 | 9.18 ± 0.17 | 109.13 ± 4.38 | 5.55 ± 0.52 | 180.18 |
| TransUNet | 105.00 | 29.33 | 4.50 | 107.16 ± 0.84 | 12.87 ± 0.21 | 204.71 ± 25.55 | 13.03 ± 0.71 | 76.75 |
| PVTV2 + EMCAD | 26.77 | 4.43 | 2.50 | 85.03 ± 1.18 | 9.08 ± 0.12 | 78.35 ± 9.39 | 12.56 ± 1.99 | 79.62 |
| FCBFormer | 33.09 | 29.98 | 8.10 | 163.76 ± 0.56 | 18.44 ± 0.13 | 305.01 ± 34.36 | 21.43 ± 3.49 | 46.66 |
| Mamba-based | ||||||||
| Swin-UMamba | 59.89 | 31.46 | 6.00 | 162.35 ± 0.57 | 20.68 ± 0.49 | NA | 13.00 ± 0.68 | 76.92 |
| Swin-UMamba-D | 27.50 | 6.10 | 5.30 | 148.57 ± 0.68 | 17.46 ± 0.13 | NA | 12.97 ± 1.53 | 77.10 |
| UMamba-Bot | 28.77 | 18.68 | 2.90 | 91.31 ± 0.15 | 11.03 ± 0.06 | NA | 6.27 ± 0.54 | 159.49 |
| UMamba-Enc | 27.56 | 19.05 | 3.10 | 97.68 ± 0.33 | 12.07 ± 0.04 | NA | 7.28 ± 0.45 | 137.36 |
| VM-UNETV2 | 22.77 | 4.071 | 3.20 | 108.95 ± 0.23 | 12.12 ± 0.21 | NA | 12.90 ± 7.27 | 77.52 |
Table 7 Performance and efficiency comparison of different models across architectural paradigms, mean ± SD
| Model | Parameters (M) | FLOPs (G) | Mean inference time on GPU (ms) | IoU (%) | Average IoU (%) | PET score (%) | |
| Self-collected | EDD 2020 | ||||||
| CNN-based | |||||||
| U-Net | 31.46 | 36.95 | 3.64 ± 6.321 | 82.14 ± 0.21 | 67.63 ± 0.48 | 74.88 | 45.74 |
| ResNet + U-Net | 32.52 | 8.23 | 5.18 ± 0.35 | 87.30 ± 0.30 | 73.97 ± 1.29 | 80.63 | 82.58 |
| ConvNeXt + UPerNet | 41.37 | 16.71 | 5.65 ± 0.40 | 88.48 ± 0.09 | 76.90 ± 0.61 | 82.69 | 84.70 |
| M2SNet | 29.89 | 13.50 | 14.86 ± 1.21 | 86.93 ± 0.32 | 74.81 ± 1.01 | 80.87 | 72.33 |
| Dilated SegNet | 18.111 | 20.72 | 9.23 ± 1.24 | 87.47 ± 0.51 | 73.64 ± 0.88 | 80.55 | 74.88 |
| PraNet | 32.56 | 5.30 | 12.16 ± 0.78 | 86.35 ± 0.43 | 61.12 ± 0.41 | 73.74 | 48.81 |
| Transformer-based | |||||||
| SwinV2 + UPerNet | 41.91 | 17.19 | 12.19 ± 0.65 | 88.50 ± 0.18 | 76.97 ± 0.89 | 82.74 | 78.41 |
| SegFormer | 24.73 | 4.23 | 9.68 ± 0.75 | 88.94 ± 0.38 | 77.20 ± 0.98 | 82.86 | 92.021 |
| SETR-MLA | 90.77 | 18.60 | 5.55 ± 0.52 | 83.37 ± 0.24 | 71.48 ± 1.43 | 77.42 | 52.45 |
| TransUNet | 105.00 | 29.33 | 13.03 ± 0.71 | 84.55 ± 0.55 | 65.06 ± 1.33 | 74.81 | 26.39 |
| PVTV2 + EMCAD | 26.77 | 4.43 | 12.56 ± 1.99 | 88.74 ± 0.22 | 77.07 ± 0.91 | 82.91 | 88.14 |
| FCBFormer | 33.09 | 29.98 | 21.43 ± 3.49 | 87.96 ± 0.34 | 76.03 ± 0.70 | 82.00 | 61.89 |
| Mamba-based | |||||||
| Swin-UMamba | 59.89 | 31.46 | 13.00 ± 0.68 | 87.78 ± 0.10 | 71.12 ± 1.26 | 79.45 | 53.31 |
| Swin-UMamba-D | 27.50 | 6.10 | 12.97 ± 1.53 | 89.06 ± 0.201 | 77.53 ± 0.321 | 83.291 | 88.39 |
| UMamba-Bot | 28.77 | 18.68 | 6.27 ± 0.54 | 81.43 ± 0.14 | 61.79 ± 0.86 | 71.61 | 39.45 |
| UMamba-Enc | 27.56 | 19.05 | 7.28 ± 0.45 | 80.74 ± 0.13 | 61.82 ± 0.56 | 71.28 | 37.15 |
| VM-UNETV2 | 22.77 | 4.071 | 12.90 ± 7.27 | 88.38 ± 0.34 | 74.89 ± 0.47 | 81.63 | 83.48 |
Table 8 Statistical comparison of different architecture types across the two datasets
| Architecture type | Models (n) | Mean IoU (%) (95%CI) | Pairwise comparisons |
| Self-collected dataset | |||
| CNN-based | 6 | 86.44 (84.11, 88.77) | vs T: T = -0.43, P = 0.68, d = 0.25 |
| vs M: T = 0.50, P = 0.63, d = -0.30 | |||
| Transformer-based | 6 | 87.01 (84.48, 89.55) | vs C: T = 0.43, P = 0.68, d = 0.25 |
| vs M: T = 0.78, P = 0.46, d = -0.46 | |||
| Mamba-based | 5 | 85.48 (80.46, 90.50) | vs C: T = -0.50, P = 0.63, d = -0.30 |
| vs T: T = -0.78, P = 0.46, d = -0.46 | |||
| EDD2020 dataset | |||
| CNN-based | 6 | 71.35 (65.16, 77.53) | vs T: T = -0.84, P = 0.42, d = 0.49 |
| vs M: T = 0.48, P = 0.64, d = -0.29 | |||
| Transformer-based | 6 | 73.97 (68.85, 79.08) | vs C: T = 0.84, P = 0.42, d = 0.49 |
| vs M: T = 1.23, P = 0.25, d = -0.73 | |||
| Mamba-based | 5 | 69.43 (60.34, 78.52) | vs C: T = -0.48, P = 0.64, d = -0.29 |
| vs T: T = -1.23, P = 0.25, d = -0.73 | |||
Table 9 Performance comparison of different models evaluated on the self-collected dataset and through cross-validation on the test split of the EDD2020 dataset, mean ± SD
| Model | Self-collected dataset | Cross validation on EDD2020 dataset | GRR (%) | ||||
| Pixel accuracy (%) | IoU (%) | Dice score (%) | Pixel accuracy (%) | IoU (%) | Dice score (%) | ||
| CNN-based | |||||||
| U-Net | 89.22 ± 0.16 | 82.14 ± 0.21 | 88.47 ± 0.17 | 67.77 ± 1.17 | 53.72 ± 1.28 | 67.77 ± 1.17 | 65.41 |
| ResNet + U-net | 92.20 ± 0.14 | 87.30 ± 0.30 | 91.95 ± 0.19 | 71.62 ± 1.09 | 58.52 ± 1.49 | 71.62 ± 1.09 | 67.03 |
| ConvNeXt + UPerNet | 93.05 ± 0.14 | 88.48 ± 0.09 | 92.76 ± 0.10 | 73.60 ± 0.42 | 60.87 ± 0.72 | 73.60 ± 0.42 | 68.79 |
| M2SNet | 92.17 ± 0.17 | 86.93 ± 0.32 | 91.72 ± 0.26 | 71.98 ± 0.24 | 58.80 ± 0.23 | 71.98 ± 0.24 | 67.64 |
| Dilated SegNet | 92.43 ± 0.35 | 87.47 ± 0.51 | 92.04 ± 0.32 | 72.26 ± 0.38 | 59.54 ± 0.52 | 72.26 ± 0.38 | 68.08 |
| PraNet | 92.38 ± 0.29 | 86.35 ± 0.43 | 91.31 ± 0.31 | 70.17 ± 1.45 | 56.81 ± 1.82 | 70.17 ± 1.45 | 65.79 |
| Transformer-based | |||||||
| SwinV2 + UPerNet | 93.15 ± 0.11 | 88.50 ± 0.18 | 92.84 ± 0.12 | 72.96 ± 0.71 | 60.29 ± 0.62 | 72.96 ± 0.71 | 68.12 |
| SegFormer | 93.39 ± 0.231 | 88.94 ± 0.38 | 93.14 ± 0.27 | 74.36 ± 1.12 | 62.36 ± 1.06 | 74.36 ± 1.12 | 70.11 |
| SETR-MLA | 90.09 ± 0.43 | 83.37 ± 0.24 | 89.19 ± 0.37 | 71.78 ± 2.48 | 58.08 ± 2.76 | 71.78 ± 2.48 | 69.67 |
| TransUNet | 90.67 ± 0.35 | 84.55 ± 0.55 | 90.02 ± 0.40 | 70.25 ± 1.24 | 56.77 ± 1.43 | 70.25 ± 1.24 | 67.14 |
| PVTV2 + EMCAD | 93.33 ± 0.19 | 88.74 ± 0.22 | 93.01 ± 0.16 | 75.24 ± 1.31 | 63.35 ± 1.441 | 75.24 ± 1.31 | 71.38 |
| FCBFormer | 93.04 ± 0.21 | 87.96 ± 0.34 | 92.48 ± 0.28 | 75.32 ± 0.551 | 62.91 ± 0.63 | 75.32 ± 0.551 | 71.521 |
| Mamba-based | |||||||
| Swin-UMamba | 92.57 ± 0.10 | 87.78 ± 0.10 | 92.20 ± 0.06 | 74.43 ± 1.23 | 62.26 ± 1.72 | 74.43 ± 1.23 | 70.93 |
| Swin-UMamba-D | 93.39 ± 0.19 | 89.06 ± 0.201 | 93.19 ± 0.141 | 73.83 ± 1.68 | 61.77 ± 1.48 | 73.83 ± 1.68 | 69.36 |
| UMamba-Bot | 88.47 ± 0.20 | 81.43 ± 0.14 | 87.38 ± 0.22 | 67.03 ± 2.52 | 52.76 ± 2.86 | 67.03 ± 2.52 | 64.78 |
| UMamba-Enc | 87.72 ± 0.11 | 80.74 ± 0.13 | 86.64 ± 0.16 | 66.35 ± 1.96 | 52.74 ± 2.31 | 66.35 ± 1.96 | 65.33 |
| VM-UNETV2 | 93.09 ± 0.29 | 88.38 ± 0.34 | 92.76 ± 0.33 | 73.82 ± 1.73 | 61.41 ± 2.09 | 73.82 ± 1.73 | 69.49 |
- Citation: Chan IN, Wong PK, Yan T, Hu YY, Chan CI, Qin YY, Wong CH, Chan IW, Lam IH, Wong SH, Li Z, Gao S, Yu HH, Yao L, Zhao BL, Hu Y. Assessing deep learning models for multi-class upper endoscopic disease segmentation: A comprehensive comparative study. World J Gastroenterol 2025; 31(41): 111184
- URL: https://www.wjgnet.com/1007-9327/full/v31/i41/111184.htm
- DOI: https://dx.doi.org/10.3748/wjg.v31.i41.111184
