©The Author(s) 2026.
World J Gastroenterol. Feb 28, 2026; 32(8): 115297
Published online Feb 28, 2026. doi: 10.3748/wjg.v32.i8.115297
Published online Feb 28, 2026. doi: 10.3748/wjg.v32.i8.115297
Table 1 Comparative model summary
| Model name | Architecture family | Pretraining | Mean IoU, % | Mean Dice | PET score, % | GRR, % | Inference latency (milliseconds/frame) | Model size | Notes |
| U-Net | CNN | N/A | 74.88 | (Self 79.37 + EDD2020 67.63)/2 = 73.50% | 45.74 | 65.41 | 3.64 | 31.46 M, approximately 126 MB | Classic encoder-decoder with skip connections; excellent throughput with very high FPS but lower PET; simple architecture may limit generalization |
| ResNet + U-Net | CNN | ResNet encoder typically ImageNet-1K | 80.63 | (83.59 + 73.97)/2 = 78.78% | 82.58 | 67.03 | 5.18 | 32.52 M, approximately 130 MB | Residual backbone improves accuracy and robustness with modest compute; good PET |
| ConvNeXt + UPerNet | CNN | Typically ImageNet-1K | 82.69 | (85.65 + 76.90)/2 = 81.27% | 84.70 | 68.79 | 5.65 | 41.37 M, approximately 165 MB | Modern CNN with ViT-inspired design; strong accuracy and good throughput |
| M2SNet | CNN | N/A | 80.87 | (84.24 + 74.81)/2 = 79.53% | 72.33 | 67.64 | 14.86 | 29.89 M, approximately 120 MB | Multi-scale subtraction units for improved feature complementarity and edge clarity; slower inference among CNNs |
| Dilated SegNet | CNN | ResNet50 backbone | 80.55 | (83.35 + 73.64)/2 = 78.49% | 74.88 | 68.08 | 9.23 | 18.111 M, approximately 72 MB | Dilated convolutions for real-time polyp segmentation; good trade-off, moderate speed |
| PraNet | CNN | N/A | 73.74 | (74.15 + 61.12)/2 = 67.63% | 48.81 | 65.79 | 12.16 | 32.56 M, approximately 130 MB | Parallel partial decoder + reverse attention for boundary refinement; decent accuracy but lower PET and generalization |
| SwinV2 + UPerNet | Transformer | SwinV2 typically ImageNet-pretrained | 82.74 | (85.59 + 76.97)/2 = 81.28% | 78.41 | 68.12 | 12.19 | 41.91 M, approximately 168 MB | SwinV2 hierarchical transformer backbone with UPerNet decoder; strong accuracy with moderate compute |
| SegFormer | Transformer | Typically ImageNet-1K | 82.86 | (93.14 + 77.20)/2 = 85.17% | 92.02 | 70.11 | 9.68 | 24.73 M, approximately 99 MB | Excellent performance-efficiency balance; low FLOPs (4.23 GFLOPs), good generalization: Recommended for real-time clinical use per paper |
| SETR-MLA | Transformer | ViT backbone | 77.42 | (82.14 + 71.48)/2 = 76.81% | 52.45 | 69.67 | 5.55 | 90.77 M, approximately 363 MB | Segmentation transformer with multi-level aggregation; large parameter count but relatively fast inference in this setup |
| TransUNet | Hybrid (CNN + transformer) | ResNet50 + ViT | 74.81 | (77.35 + 65.06)/2 = 71.21% | 26.39 | 67.14 | 13.03 | 105.00 M, approximately 420 MB | Combines CNN encoder and ViT: Strong representational power but heavy compute and lower PET |
| PVTV2 + EMCAD | Transformer | PVTV2 usually ImageNet-pretrained | 82.91 | (85.81 + 77.07)/2 = 81.44% | 88.14 | 71.38 | 12.56 | 26.77 M, approximately 107 MB | Pyramid vision transformer v2 + efficient multi-scale decoding; strong generalization and good PET |
| FCBFormer | Transformer | PVTV2 backbone | 82.00 | (85.18 + 76.03)/2 = 80.61% | 61.89 | 71.52 | 21.43 | 33.09 M, approximately 132 MB | Polyp-specialized transformer variant; strong generalization (highest GRR) but highest inference time among many models, limiting real-time use at high resolution |
| Swin-UMamba | Hybrid (Mamba based) | N/A | 79.45 | (81.23 + 71.12)/2 = 76.18% | 53.31 | 70.93 | 13.00 | 59.89 M, approximately 240 MB | Mamba hybrid leveraging visual-state-space-model; good generalization but relatively large and slower training/inference cost |
| Swin-UMamba-D | Hybrid (Mamba based) | N/A | 83.29 | (86.15 + 77.53)/2 = 81.84% | 88.39 | 69.36 | 12.97 | 27.50 M, approximately 110 MB | Best segmentation performance (average IoU) among study but relatively high training and inference cost; strong segmentation accuracy but moderate generalization |
| UMamba-Bot | Hybrid (Mamba based) | N/A | 71.61 | (75.03 + 61.79)/2 = 68.41% | 39.45 | 64.78 | 6.27 | 28.77 M, approximately 115 MB | Lightweight mamba variant; good FPS but weaker accuracy and generalization |
| UMamba-Enc | Hybrid (Mamba based) | N/A | 71.28 | (75.24 + 61.82)/2 = 68.53% | 37.15 | 65.33 | 7.28 | 27.56 M, approximately 110 MB | Encoder-focused mamba variant; similar trade-offs to UMamba-Bot: Faster but lower accuracy |
| VM-UNETV2 | Hybrid (Mamba like) | N/A | 81.63 | (84.36 + 74.89)/2 = 79.63% | 83.48 | 69.49 | 12.90 | 22.77 M, approximately 91 MB | VM-UNET encoder variant; strong PET and competitive accuracy, GPU-focused design |
Table 2 Implementation roadmap
| Translational challenge | Concrete solution | Responsible stakeholders | Measurable success metrics |
| Dataset bias and limited diversity | Multi-center data sharing agreements; standardized metadata schema (demographics, device, protocol); stratified sampling and targeted collection for underrepresented cohorts; federated learning to enable cross-site models while preserving privacy | Clinical consortiums, data governance teams, hospital IT, study PIs, legal/compliance | Number of centers and countries represented; device/vendor diversity index; demographic coverage (age/sex/ethnicity) proportions; change in GRR and external IoU on held-out sites |
| Annotation variability and subjectivity | Develop and enforce standardized annotation protocol and labeling guidelines; multi-expert consensus labeling; adjudication workflows; active learning to prioritize ambiguous cases; periodic re-annotation audits | Clinical experts (gastroenterologists), annotation managers, platform vendors, data scientists | Inter-rater agreement (Cohen’s kappa/mean IoU across annotators); % masks adjudicated; annotation time per case; model performance gains after consensus labels |
| Class imbalance/rare pathology sensitivity | Oversampling/targeted collection of rare classes; class-aware loss functions (focal, class-weighted); synthetic data and augmentation for rare classes; curriculum learning focusing on rare classes | Data acquisition teams, ML engineers, clinical partners, biostatisticians | Per-class recall/sensitivity (especially for rare classes); AUPRC for rare classes; reduction in false-negative rate for underrepresented labels |
| Imaging variability (lighting, specular reflection, motion blur) | Advanced preprocessing (illumination normalization, reflection removal), robust augmentation (exposure, blur, specular sim), self-supervised pretraining on large unlabeled endoscopy corpora; spatio-temporal modeling for videos | ML research team, imaging engineers, clinical endoscopy unit, vendors | Performance stratified by exposure/quality buckets (IoU under overexposed vs normal); reduction in failure cases linked to artifacts; frame-level temporal consistency metrics (temporal IoU) |
| Poor cross-dataset generalization/overfitting | Cross-dataset evaluation, domain adaptation techniques, federated or multi-site training, hold-out external validation sets, regularization and ensembling | ML engineers, external collaborators, validation leads, statisticians | Delta IoU/Dice between internal test and external test sets; GRR improvement on external cohorts; calibration metrics (Brier score) |
| Real-time performance and resource constraints (PET) | Model compression and pruning; lightweight architectures (e.g., SegFormer variants); hardware benchmark targeting (edge GPU/CPU); optimized inference pipelines | ML engineers, DevOps, clinical IT, hardware vendors | Inference latency (microseconds/frame), throughput (fps) on target hardware; memory usage, FLOPs; PET score or task-specific tradeoff metric; clinician acceptance for live use |
| Clinical validation and impact on workflow | Prospective clinical studies, reader studies comparing model + clinician vs clinician alone; integration pilots in endoscopy suite; user-centred UI/UX design and training | Clinical investigators, hospital operations, human factors specialists, clinical IT | Diagnostic accuracy improvement (sensitivity/specificity) in prospective trials; change in missed-lesion rate; time-to-report; clinician satisfaction and adoption rates |
| Trust, explainability and clinician acceptance | Provide visual explanations (attention maps, uncertainty overlays); case-level confidence scores; reporting of failure modes and limitations; clinician training modules | ML explainability team, clinical educators, product managers, regulatory/QA | Proportion of model outputs with uncertainty flags; clinician trust scores in surveys; reduction in dismissed correct alerts; explainability usability ratings |
| Privacy, legal & regulatory readiness | Data de-identification pipeline, DPIAs, early engagement with regulators, pre-specified validation plan, post-market surveillance plan, robust audit trails | Legal/compliance, regulatory affairs, data governance, QA, cybersecurity | Completion of DPIA and IRB approvals; regulatory submission milestones (pre-submission, submission, approvals); number of privacy incidents; time to resolve security findings |
| Multi-modal and longitudinal integration | Design multi-modal models (image + report + temporal video), link endoscopy frames with pathology/EMR metadata, adopt interoperable standards (DICOM/HL7/FHIR) | Data engineers, clinical informatics, pathology, ML researchers, standards officers | Increase in model performance when adding modalities (delta IoU/Dice); % cases with linked pathology; successful end-to-end FHIR/DICOM integrations; improvement in clinically-relevant outcome measures (e.g., appropriate biopsy rate) |
- Citation: Yang YH. Bridging innovation and clinical reality: Interpreting the comparative study of deep learning models for multi-class upper gastrointestinal disease segmentation. World J Gastroenterol 2026; 32(8): 115297
- URL: https://www.wjgnet.com/1007-9327/full/v32/i8/115297.htm
- DOI: https://dx.doi.org/10.3748/wjg.v32.i8.115297
