BPG is committed to discovery and dissemination of knowledge
Editorial
©The Author(s) 2026.
World J Gastroenterol. Feb 28, 2026; 32(8): 115297
Published online Feb 28, 2026. doi: 10.3748/wjg.v32.i8.115297
Table 1 Comparative model summary
Model name
Architecture family
Pretraining
Mean IoU, %
Mean Dice
PET score, %
GRR, %
Inference latency (milliseconds/frame)
Model size
Notes
U-NetCNNN/A74.88(Self 79.37 + EDD2020 67.63)/2 = 73.50%45.7465.413.64 31.46 M, approximately 126 MBClassic encoder-decoder with skip connections; excellent throughput with very high FPS but lower PET; simple architecture may limit generalization
ResNet + U-NetCNNResNet encoder typically ImageNet-1K80.63(83.59 + 73.97)/2 = 78.78%82.5867.035.18 32.52 M, approximately 130 MBResidual backbone improves accuracy and robustness with modest compute; good PET
ConvNeXt + UPerNetCNNTypically ImageNet-1K82.69(85.65 + 76.90)/2 = 81.27%84.7068.795.65 41.37 M, approximately 165 MBModern CNN with ViT-inspired design; strong accuracy and good throughput
M2SNetCNNN/A80.87(84.24 + 74.81)/2 = 79.53%72.3367.6414.86 29.89 M, approximately 120 MBMulti-scale subtraction units for improved feature complementarity and edge clarity; slower inference among CNNs
Dilated SegNetCNNResNet50 backbone80.55(83.35 + 73.64)/2 = 78.49%74.8868.089.23 18.111 M, approximately 72 MBDilated convolutions for real-time polyp segmentation; good trade-off, moderate speed
PraNetCNNN/A73.74(74.15 + 61.12)/2 = 67.63%48.8165.7912.16 32.56 M, approximately 130 MBParallel partial decoder + reverse attention for boundary refinement; decent accuracy but lower PET and generalization
SwinV2 + UPerNetTransformerSwinV2 typically ImageNet-pretrained82.74(85.59 + 76.97)/2 = 81.28%78.4168.1212.19 41.91 M, approximately 168 MBSwinV2 hierarchical transformer backbone with UPerNet decoder; strong accuracy with moderate compute
SegFormerTransformerTypically ImageNet-1K82.86(93.14 + 77.20)/2 = 85.17%92.0270.119.68 24.73 M, approximately 99 MBExcellent performance-efficiency balance; low FLOPs (4.23 GFLOPs), good generalization: Recommended for real-time clinical use per paper
SETR-MLATransformerViT backbone77.42(82.14 + 71.48)/2 = 76.81%52.4569.675.55 90.77 M, approximately 363 MBSegmentation transformer with multi-level aggregation; large parameter count but relatively fast inference in this setup
TransUNetHybrid (CNN + transformer)ResNet50 + ViT74.81(77.35 + 65.06)/2 = 71.21%26.3967.1413.03 105.00 M, approximately 420 MBCombines CNN encoder and ViT: Strong representational power but heavy compute and lower PET
PVTV2 + EMCADTransformerPVTV2 usually ImageNet-pretrained82.91(85.81 + 77.07)/2 = 81.44%88.1471.3812.56 26.77 M, approximately 107 MBPyramid vision transformer v2 + efficient multi-scale decoding; strong generalization and good PET
FCBFormerTransformerPVTV2 backbone82.00(85.18 + 76.03)/2 = 80.61%61.8971.5221.43 33.09 M, approximately 132 MBPolyp-specialized transformer variant; strong generalization (highest GRR) but highest inference time among many models, limiting real-time use at high resolution
Swin-UMambaHybrid (Mamba based)N/A79.45(81.23 + 71.12)/2 = 76.18%53.3170.9313.00 59.89 M, approximately 240 MBMamba hybrid leveraging visual-state-space-model; good generalization but relatively large and slower training/inference cost
Swin-UMamba-DHybrid (Mamba based)N/A83.29(86.15 + 77.53)/2 = 81.84%88.3969.3612.97 27.50 M, approximately 110 MBBest segmentation performance (average IoU) among study but relatively high training and inference cost; strong segmentation accuracy but moderate generalization
UMamba-BotHybrid (Mamba based)N/A71.61(75.03 + 61.79)/2 = 68.41%39.4564.786.27 28.77 M, approximately 115 MBLightweight mamba variant; good FPS but weaker accuracy and generalization
UMamba-EncHybrid (Mamba based)N/A71.28(75.24 + 61.82)/2 = 68.53%37.1565.337.28 27.56 M, approximately 110 MBEncoder-focused mamba variant; similar trade-offs to UMamba-Bot: Faster but lower accuracy
VM-UNETV2Hybrid (Mamba like)N/A81.63(84.36 + 74.89)/2 = 79.63%83.4869.4912.90 22.77 M, approximately 91 MBVM-UNET encoder variant; strong PET and competitive accuracy, GPU-focused design
Table 2 Implementation roadmap
Translational challenge
Concrete solution
Responsible stakeholders
Measurable success metrics
Dataset bias and limited diversityMulti-center data sharing agreements; standardized metadata schema (demographics, device, protocol); stratified sampling and targeted collection for underrepresented cohorts; federated learning to enable cross-site models while preserving privacyClinical consortiums, data governance teams, hospital IT, study PIs, legal/complianceNumber of centers and countries represented; device/vendor diversity index; demographic coverage (age/sex/ethnicity) proportions; change in GRR and external IoU on held-out sites
Annotation variability and subjectivityDevelop and enforce standardized annotation protocol and labeling guidelines; multi-expert consensus labeling; adjudication workflows; active learning to prioritize ambiguous cases; periodic re-annotation auditsClinical experts (gastroenterologists), annotation managers, platform vendors, data scientistsInter-rater agreement (Cohen’s kappa/mean IoU across annotators); % masks adjudicated; annotation time per case; model performance gains after consensus labels
Class imbalance/rare pathology sensitivityOversampling/targeted collection of rare classes; class-aware loss functions (focal, class-weighted); synthetic data and augmentation for rare classes; curriculum learning focusing on rare classesData acquisition teams, ML engineers, clinical partners, biostatisticiansPer-class recall/sensitivity (especially for rare classes); AUPRC for rare classes; reduction in false-negative rate for underrepresented labels
Imaging variability (lighting, specular reflection, motion blur)Advanced preprocessing (illumination normalization, reflection removal), robust augmentation (exposure, blur, specular sim), self-supervised pretraining on large unlabeled endoscopy corpora; spatio-temporal modeling for videosML research team, imaging engineers, clinical endoscopy unit, vendorsPerformance stratified by exposure/quality buckets (IoU under overexposed vs normal); reduction in failure cases linked to artifacts; frame-level temporal consistency metrics (temporal IoU)
Poor cross-dataset generalization/overfittingCross-dataset evaluation, domain adaptation techniques, federated or multi-site training, hold-out external validation sets, regularization and ensemblingML engineers, external collaborators, validation leads, statisticiansDelta IoU/Dice between internal test and external test sets; GRR improvement on external cohorts; calibration metrics (Brier score)
Real-time performance and resource constraints (PET)Model compression and pruning; lightweight architectures (e.g., SegFormer variants); hardware benchmark targeting (edge GPU/CPU); optimized inference pipelinesML engineers, DevOps, clinical IT, hardware vendorsInference latency (microseconds/frame), throughput (fps) on target hardware; memory usage, FLOPs; PET score or task-specific tradeoff metric; clinician acceptance for live use
Clinical validation and impact on workflowProspective clinical studies, reader studies comparing model + clinician vs clinician alone; integration pilots in endoscopy suite; user-centred UI/UX design and trainingClinical investigators, hospital operations, human factors specialists, clinical ITDiagnostic accuracy improvement (sensitivity/specificity) in prospective trials; change in missed-lesion rate; time-to-report; clinician satisfaction and adoption rates
Trust, explainability and clinician acceptanceProvide visual explanations (attention maps, uncertainty overlays); case-level confidence scores; reporting of failure modes and limitations; clinician training modulesML explainability team, clinical educators, product managers, regulatory/QAProportion of model outputs with uncertainty flags; clinician trust scores in surveys; reduction in dismissed correct alerts; explainability usability ratings
Privacy, legal & regulatory readinessData de-identification pipeline, DPIAs, early engagement with regulators, pre-specified validation plan, post-market surveillance plan, robust audit trailsLegal/compliance, regulatory affairs, data governance, QA, cybersecurityCompletion of DPIA and IRB approvals; regulatory submission milestones (pre-submission, submission, approvals); number of privacy incidents; time to resolve security findings
Multi-modal and longitudinal integrationDesign multi-modal models (image + report + temporal video), link endoscopy frames with pathology/EMR metadata, adopt interoperable standards (DICOM/HL7/FHIR)Data engineers, clinical informatics, pathology, ML researchers, standards officersIncrease in model performance when adding modalities (delta IoU/Dice); % cases with linked pathology; successful end-to-end FHIR/DICOM integrations; improvement in clinically-relevant outcome measures (e.g., appropriate biopsy rate)