BPG is committed to discovery and dissemination of knowledge
Basic Study
Copyright ©The Author(s) 2025.
World J Gastroenterol. Nov 7, 2025; 31(41): 111184
Published online Nov 7, 2025. doi: 10.3748/wjg.v31.i41.111184
Table 1 Details of the self-collected dataset
Aspect
Details
Source hospitalsKiang Wu Hospital, Macao; Xiangyang Central Hospital, Xiangyang, Hubei Province, China
Data collection period2019-2024
Total images3313 endoscopic images
Imaging modalityWhite light endoscopy (majority); narrow band imaging (subset)
Imaging equipmentOlympus EVIS X1 CV-1500 with GIF EZ1500 gastroscopes (Kiang Wu Hospital); Olympus CF-HQ290-I and PENTAX EG29-i10 gastroscopes (Xiangyang Central Hospital)
Disease classesNormal: 1014 images; esophageal neoplasm: 256 images; esophageal varices: 228 images; GERD: 143 images; gastric neoplasm: 486 images; gastric polyp: 526 images; gastric ulcer: 366 images; gastric varices: 83 images; duodenal ulcer: 211 images
Data acquisitionImages were retrieved from the two hospital databases: (1) Normal images were selected based on chronic superficial gastritis cases exhibiting visually normal appearance on white light endoscopy and no significant pathological findings; and (2) Disease images were identified using international classification of diseases codes under the supervision of gastroenterologists and biomedical engineering postgraduate students
Annotation processDisease labels were verified by gastroenterologists; Mask annotations were created using Anylabeling by a PhD student and a post-doctoral researcher in biomedical engineering under the supervision of a gastroenterologist; annotations were exported as JSON and converted to binary PNG masks; Masks were independently reviewed by an experienced gastroenterologist
Ethical approval identifiersMedical Ethics Committee of Xiangyang Central Hospital (No. 2024-145); and Medical Ethics Committee of Kiang Wu Hospital, Macao (No. 2019-005)
ComplianceConducted in accordance with the Declaration of Helsinki
Pre-processing pipelineBlack borders and text metadata are removed; Images are cropped to resolutions ranging from 268 × 217 pixels to 1545 × 1156 pixels; images are divided into training-and-validation (80%) and test (20%) sets, and further split training-and-validation into training (80%) and validation (20%); multi-class segmentation masks are generated with nine channels: All pixels in normal images are assigned to the “normal” channel. While in disease images, non-lesion/background areas are assigned to the “normal” channel and lesion areas to their respective disease channels
Table 2 Details of the EDD2020 dataset
Aspect
Details
Source hospitalsAmbroise Paré Hospital, France; Centro Riferimento Oncologico IRCCS, Italy; Istituto Oncologico Veneto, Italy; John Radcliffe Hospital, United Kingdom
Total images386 still images with 502 segmentation masks
Disease classesBarrett’s esophagus: 160 masks; suspicious precancerous lesions: 88 masks; high-grade dysplasia: 74 masks; cancer: 53 masks; polyps: 127 masks
Annotation processPerformed by two clinical experts and two post-doctoral researchers using the open-source VGG image annotator annotation tool
Pre-processing pipelineStratified split into training (81%), validation (9%), and test (10%) using scikit-learn[34]; specialized multi-class segmentation mask generation with six channels (five disease classes + background): Non-lesion/background areas in disease images are assigned to the “normal” channel, while lesions are assigned to their respective disease channels
Table 3 Distribution of disease labels across dataset splits
Disease class
Training
Validation
Test
Total
BE1301416160
HGD616774
Cancer435553
Polyp1031113127
Suspicious6811999
Total4054750502
Table 4 Model selection for this comparative study
Model
Model name
Encoder size
Applications
CNN-basedU-Net[36]NAGeneral segmentation, Biomedical segmentation
ResNet[37] + U-NetResNet501General segmentation, Biomedical segmentation
ConvNeXt[38] + UPerNet[39]ConvNeXt-T1General segmentation
M2SNet[40]Res2Net50-v1b-26w-4s1[41]Medical segmentation
Dilated SegNet[42]ResNet501Polyp segmentation
PraNet[27]Res2Net50-v1b-26w-4s1Polyp segmentation
Transformer-basedSwinV2[43] + UPerNetSwinV2-T1General segmentation
SETR-MLA[44]ViT-B-161[22]General segmentation
SegFormer[45]MiT-B21General segmentation
TransUNet[46]ResNet501 and ViT-B-161Medical segmentation
PVTV2[47] + EMCAD[48]PVTV2-B21Medical segmentation
FCBFormer[26]PVTV2-B21Polyp segmentation
Mamba-basedSwin-UMamba[49]VSSM-encoder1Medical segmentation
Swin-UMamba-D[49]VSSM-encoder1Medical segmentation
U-Mamba-Bot[50]NABiomedical segmentation
U-Mamba-Enc[50]NABiomedical segmentation
VM-UNETV2[51]VM-UNET-encoder1Medical segmentation
Table 5 Performance metrics of different segmentation models on the self-collected dataset and the EDD2020 dataset, mean ± SD
ModelEncoder sizeSelf-collected dataset
EDD2020 dataset
Pixel accuracy (%)
IoU (%)
Dice score (%)
Pixel accuracy (%)
IoU (%)
Dice score (%)
CNN-based
U-NetNA89.22 ± 0.1682.14 ± 0.2188.47 ± 0.1793.41 ± 0.0767.63 ± 0.4879.37 ± 0.44
ResNet +U-NetResNet50192.20 ± 0.1487.30 ± 0.3091.95 ± 0.1994.52 ± 0.3173.97 ± 1.2983.59 ± 1.05
ConvNeXt + UPerNetConvNeXt-T193.05 ± 0.1488.48 ± 0.0992.76 ± 0.1095.17 ± 0.1376.90 ± 0.6185.65 ± 0.48
M2SNetRes2Net50-v1b-26w-4s192.17 ± 0.1786.93 ± 0.3291.72 ± 0.2694.80 ± 0.2274.81 ± 1.0184.24 ± 0.79
Dilated SegNetResNet50192.43 ± 0.3587.47 ± 0.5192.04 ± 0.3294.46 ± 0.2473.64 ± 0.8883.35 ± 0.78
PraNetRes2Net50-v1b-26w-4s192.38 ± 0.2986.35 ± 0.4391.31 ± 0.3194.48 ± 0.0761.12 ± 0.4174.15 ± 0.34
Transformer-based
SwinV2 + UPerNetSwinV2-T193.15 ± 0.1188.50 ± 0.1892.84 ± 0.1295.18 ± 0.2376.97 ± 0.8985.59 ± 0.65
SegformerMiT-B2193.39 ± 0.23288.94 ± 0.3893.14 ± 0.2795.25 ± 0.3177.20 ± 0.9885.90 ± 0.76
SETR-MLAViT-B-16190.09 ± 0.4383.37 ± 0.2489.19 ± 0.3794.17 ± 0.3671.48 ± 1.4382.14 ± 1.23
TransUNetResNet501 and ViT-B-16190.67 ± 0.3584.55 ± 0.5590.02 ± 0.4092.74 ± 0.1865.06 ± 1.3377.35 ± 1.16
EMCADPVTV2-B2193.33 ± 0.1988.74 ± 0.2293.01 ± 0.1695.22 ± 0.2277.07 ± 0.9185.81 ± 0.67
FCBFormerPVTV2-B2193.04 ± 0.2187.96 ± 0.3492.48 ± 0.2895.04 ± 0.1576.03 ± 0.7085.18 ± 0.45
Mamba-based
Swin-UMambaVSSM-encoder192.57 ± 0.1087.78 ± 0.1092.20 ± 0.0693.81 ± 0.3071.12 ± 1.2681.23 ± 0.86
Swin-UMamba-DVSSM-D-encoder193.39 ± 0.1989.06 ± 0.20293.19 ± 0.14295.37 ± 0.09277.53 ± 0.32286.15 ± 0.222
UMamba-BotNA88.47 ± 0.2081.43 ± 0.1487.38 ± 0.2292.04 ± 0.1461.79 ± 0.8675.03 ± 0.80
UMamba-EncNA87.72 ± 0.1180.74 ± 0.1386.64 ± 0.1692.12 ± 0.2961.82 ± 0.5675.24 ± 0.60
VM-UNETV2VM-UNET-encoder193.09 ± 0.2988.38 ± 0.3492.76 ± 0.3394.89 ± 0.1474.89 ± 0.4784.36 ± 0.42
Table 6 Comparison of different efficiency metrics of different models across architectural paradigms, mean ± SD
ModelParameters (M)FLOPs (G)GPU usage (GB)Mean training time (minute)
Mean inference time (ms)
FPS
Self-collected
EDD 2020
CPU
GPU
CNN-based
U-Net31.4636.953.3082.79 ± 0.2510.35 ± 0.03215.35 ± 37.503.64 ± 6.321274.731
ResNet + U-Net32.528.232.0057.03 ± 0.3516.69 ± 0.12162.67 ± 6.325.18 ± 0.35193.05
ConvNeXt + UPerNet41.3716.712.5077.63 ± 0.168.84 ± 0.10102.97 ± 8.355.65 ± 0.40176.99
M2SNet29.8913.502.5089.83 ± 1.759.15 ± 0.07101.97 ± 6.5214.86 ± 1.2167.29
Dilated SegNet18.11120.723.2098.38 ± 0.8710.71 ± 0.04146.60 ± 9.769.23 ± 1.24108.34
PraNet32.565.301.80184.06 ± 0.228.18 ± 0.1065.78 ± 7.1912.16 ± 0.7882.24
Transformer-based
SwinV2 + UPerNet41.9117.192.7089.07 ± 0.988.96 ± 0.09126.12 ± 13.5612.19 ± 0.6582.03
SegFormer24.734.232.0071.93 ± 0.118.04 ± 0.1961.48 ± 5.749.68 ± 0.75103.31
SETR-MLA90.7718.603.0071.40 ± 1.139.18 ± 0.17109.13 ± 4.385.55 ± 0.52180.18
TransUNet105.0029.334.50107.16 ± 0.8412.87 ± 0.21204.71 ± 25.5513.03 ± 0.7176.75
PVTV2 + EMCAD26.774.432.5085.03 ± 1.189.08 ± 0.1278.35 ± 9.3912.56 ± 1.9979.62
FCBFormer33.0929.988.10163.76 ± 0.5618.44 ± 0.13305.01 ± 34.3621.43 ± 3.4946.66
Mamba-based
Swin-UMamba59.8931.466.00162.35 ± 0.5720.68 ± 0.49NA13.00 ± 0.6876.92
Swin-UMamba-D27.506.105.30148.57 ± 0.6817.46 ± 0.13NA12.97 ± 1.5377.10
UMamba-Bot28.7718.682.9091.31 ± 0.1511.03 ± 0.06NA6.27 ± 0.54159.49
UMamba-Enc27.5619.053.1097.68 ± 0.3312.07 ± 0.04NA7.28 ± 0.45137.36
VM-UNETV222.774.0713.20108.95 ± 0.2312.12 ± 0.21NA12.90 ± 7.2777.52
Table 7 Performance and efficiency comparison of different models across architectural paradigms, mean ± SD
ModelParameters (M)FLOPs (G)Mean inference time on GPU (ms)IoU (%)
Average IoU (%)
PET score (%)
Self-collected
EDD 2020
CNN-based
U-Net31.4636.953.64 ± 6.32182.14 ± 0.2167.63 ± 0.4874.8845.74
ResNet + U-Net32.528.235.18 ± 0.3587.30 ± 0.3073.97 ± 1.2980.6382.58
ConvNeXt + UPerNet41.3716.715.65 ± 0.4088.48 ± 0.0976.90 ± 0.6182.6984.70
M2SNet29.8913.5014.86 ± 1.2186.93 ± 0.3274.81 ± 1.0180.8772.33
Dilated SegNet18.11120.729.23 ± 1.2487.47 ± 0.5173.64 ± 0.8880.5574.88
PraNet32.565.3012.16 ± 0.7886.35 ± 0.4361.12 ± 0.4173.7448.81
Transformer-based
SwinV2 + UPerNet41.9117.1912.19 ± 0.6588.50 ± 0.1876.97 ± 0.8982.7478.41
SegFormer24.734.239.68 ± 0.7588.94 ± 0.3877.20 ± 0.9882.8692.021
SETR-MLA90.7718.605.55 ± 0.5283.37 ± 0.2471.48 ± 1.4377.4252.45
TransUNet105.0029.3313.03 ± 0.7184.55 ± 0.5565.06 ± 1.3374.8126.39
PVTV2 + EMCAD26.774.4312.56 ± 1.9988.74 ± 0.2277.07 ± 0.9182.9188.14
FCBFormer33.0929.9821.43 ± 3.4987.96 ± 0.3476.03 ± 0.7082.0061.89
Mamba-based
Swin-UMamba59.8931.4613.00 ± 0.6887.78 ± 0.1071.12 ± 1.2679.4553.31
Swin-UMamba-D27.506.1012.97 ± 1.5389.06 ± 0.20177.53 ± 0.32183.29188.39
UMamba-Bot28.7718.686.27 ± 0.5481.43 ± 0.1461.79 ± 0.8671.6139.45
UMamba-Enc27.5619.057.28 ± 0.4580.74 ± 0.1361.82 ± 0.5671.2837.15
VM-UNETV222.774.07112.90 ± 7.2788.38 ± 0.3474.89 ± 0.4781.6383.48
Table 8 Statistical comparison of different architecture types across the two datasets
Architecture type
Models (n)
Mean IoU (%) (95%CI)
Pairwise comparisons
Self-collected dataset
CNN-based686.44 (84.11, 88.77)vs T: T = -0.43, P = 0.68, d = 0.25
vs M: T = 0.50, P = 0.63, d = -0.30
Transformer-based687.01 (84.48, 89.55)vs C: T = 0.43, P = 0.68, d = 0.25
vs M: T = 0.78, P = 0.46, d = -0.46
Mamba-based585.48 (80.46, 90.50)vs C: T = -0.50, P = 0.63, d = -0.30
vs T: T = -0.78, P = 0.46, d = -0.46
EDD2020 dataset
CNN-based671.35 (65.16, 77.53)vs T: T = -0.84, P = 0.42, d = 0.49
vs M: T = 0.48, P = 0.64, d = -0.29
Transformer-based673.97 (68.85, 79.08)vs C: T = 0.84, P = 0.42, d = 0.49
vs M: T = 1.23, P = 0.25, d = -0.73
Mamba-based569.43 (60.34, 78.52)vs C: T = -0.48, P = 0.64, d = -0.29
vs T: T = -1.23, P = 0.25, d = -0.73
Table 9 Performance comparison of different models evaluated on the self-collected dataset and through cross-validation on the test split of the EDD2020 dataset, mean ± SD
ModelSelf-collected dataset
Cross validation on EDD2020 dataset
GRR (%)
Pixel accuracy (%)
IoU (%)
Dice score (%)
Pixel accuracy (%)
IoU (%)
Dice score (%)
CNN-based
U-Net89.22 ± 0.1682.14 ± 0.2188.47 ± 0.1767.77 ± 1.1753.72 ± 1.2867.77 ± 1.1765.41
ResNet + U-net92.20 ± 0.1487.30 ± 0.3091.95 ± 0.1971.62 ± 1.0958.52 ± 1.4971.62 ± 1.0967.03
ConvNeXt + UPerNet93.05 ± 0.1488.48 ± 0.0992.76 ± 0.1073.60 ± 0.4260.87 ± 0.7273.60 ± 0.4268.79
M2SNet92.17 ± 0.1786.93 ± 0.3291.72 ± 0.2671.98 ± 0.2458.80 ± 0.2371.98 ± 0.2467.64
Dilated SegNet92.43 ± 0.3587.47 ± 0.5192.04 ± 0.3272.26 ± 0.3859.54 ± 0.5272.26 ± 0.3868.08
PraNet92.38 ± 0.2986.35 ± 0.4391.31 ± 0.3170.17 ± 1.4556.81 ± 1.8270.17 ± 1.4565.79
Transformer-based
SwinV2 + UPerNet93.15 ± 0.1188.50 ± 0.1892.84 ± 0.1272.96 ± 0.7160.29 ± 0.6272.96 ± 0.7168.12
SegFormer93.39 ± 0.23188.94 ± 0.3893.14 ± 0.2774.36 ± 1.1262.36 ± 1.0674.36 ± 1.1270.11
SETR-MLA90.09 ± 0.4383.37 ± 0.2489.19 ± 0.3771.78 ± 2.4858.08 ± 2.7671.78 ± 2.4869.67
TransUNet90.67 ± 0.3584.55 ± 0.5590.02 ± 0.4070.25 ± 1.2456.77 ± 1.4370.25 ± 1.2467.14
PVTV2 + EMCAD93.33 ± 0.1988.74 ± 0.2293.01 ± 0.1675.24 ± 1.3163.35 ± 1.44175.24 ± 1.3171.38
FCBFormer93.04 ± 0.2187.96 ± 0.3492.48 ± 0.2875.32 ± 0.55162.91 ± 0.6375.32 ± 0.55171.521
Mamba-based
Swin-UMamba92.57 ± 0.1087.78 ± 0.1092.20 ± 0.0674.43 ± 1.2362.26 ± 1.7274.43 ± 1.2370.93
Swin-UMamba-D93.39 ± 0.1989.06 ± 0.20193.19 ± 0.14173.83 ± 1.6861.77 ± 1.4873.83 ± 1.6869.36
UMamba-Bot88.47 ± 0.2081.43 ± 0.1487.38 ± 0.2267.03 ± 2.5252.76 ± 2.8667.03 ± 2.5264.78
UMamba-Enc87.72 ± 0.1180.74 ± 0.1386.64 ± 0.1666.35 ± 1.9652.74 ± 2.3166.35 ± 1.9665.33
VM-UNETV293.09 ± 0.2988.38 ± 0.3492.76 ± 0.3373.82 ± 1.7361.41 ± 2.0973.82 ± 1.7369.49