Bridging innovation and clinical reality: Interpreting the comparative study of deep learning models for multi-class upper gastrointestinal disease segmentation

doi:10.3748/wjg.v32.i8.115297

Advanced Search

BPG is committed to discovery and dissemination of knowledge

Home / Archive / Volume 32, Issue 8

This Article

Table of Contents

Peer-Review Report of This Article

CrossCheck and Google Search of This Article

Academic Rules and Norms of This Article

Citation of this article

Corresponding Author of This Article

Research Domain of This Article

Article-Type of This Article

Open-Access Policy of This Article

Times Cited Counts in Google of This Article

Number of Hits and Downloads for This Article

Total Article Views (584)

All Articles published online

The chart showing PDF series, HTML series, Tables (1-2) series.

Item

Count

PDF

HTML

143

Tables (1-2)

Sum=174

Featured Article

The chart showing Browse series, Download series.

Item

Count

Browse

Download

194

Sum=280

Publishing Process of This Article

Item

Count

Browse

Download

Sum=96

Feb 28, 2026 (publication date) through Mar 8, 2026

Times Cited of This Article

Times Cited (0)

Journal Information of This Article

Publication Name

World Journal of Gastroenterology

ISSN

1007-9327

Publisher of This Article

Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA

Editorial Open Access

World J Gastroenterol. Feb 28, 2026; 32(8): 115297
Published online Feb 28, 2026. doi: 10.3748/wjg.v32.i8.115297

Bridging innovation and clinical reality: Interpreting the comparative study of deep learning models for multi-class upper gastrointestinal disease segmentation

Yu-Han Yang

Yu-Han Yang, West China Hospital, Sichuan University, Chengdu 6100041, Sichuan Province, China

ORCID number: Yu-Han Yang (0000-0002-4405-5711).

Author contributions: Yang YH contributed to conceptualization, writing-original draft preparation, writing-review & editing; Yang YH read and approved the final manuscript and agree to be accountable for all aspects of the work.

Conflict-of-interest statement: All the authors report no relevant conflicts of interest for this article.

Corresponding author: Yu-Han Yang, MD, West China Hospital, Sichuan University, No. 17 People's South Road, Chengdu 6100041, Sichuan Province, China. yyh_1023@163.com

Received: October 15, 2025
Revised: November 22, 2025
Accepted: January 4, 2026
Published online: February 28, 2026
Processing time: 121 Days and 1.5 Hours

Abstract

In this editorial we comment on the article by Chan et al. The study presents the most comprehensive comparative evaluation to date of deep learning models for multi-class segmentation of upper gastrointestinal diseases, leveraging a novel 3313-image, nine-class clinical dataset alongside the public EDD2020 benchmark. Their results demonstrate that hierarchical, pre-trained encoders (notably Swin-UMamba-D) deliver the highest segmentation accuracy, while SegFormer balances accuracy with computational efficiency, an important consideration for clinical deployment. Beyond raw performance metrics, the work confronts core translational barriers: Limited and biased datasets, lighting and imaging variability, boundary ambiguity, and multi-label complexity. This Editorial argues that the manuscript marks a pivotal shift from isolated technical advances toward clinically-minded validation of segmentation systems, and proposes a concrete agenda for the field to accelerate safe, generalizable, and ethically responsible adoption of automated endoscopic assistance.

Key Words: Gastrointestinal disease; Endoscopy; Automated segmentation; Deep learning; Multi-class model; Comparative study; Artificial intelligence

Core Tip: This study evaluates 17 advanced deep learning models, including convolutional neural network-, transformer-, and mamba-based architectures, for multi-class upper gastrointestinal disease segmentation. Swin-UMamba achieves the highest segmentation accuracy, while SegFormer balances efficiency and performance. Automated segmentation demonstrates significant clinical value by improving diagnostic precision, reducing missed diagnoses, streamlining treatment planning, and easing physician workload. Key challenges include lighting variability, vague lesion boundaries, multi-label complexities, and dataset limitations. Future directions, such as multi-modal learning, self-supervised techniques, spatio-temporal modeling, and rigorous clinical validation, are essential to enhance model robustness and ensure applicability in diverse healthcare settings for better patient outcomes.

Citation: Yang YH. Bridging innovation and clinical reality: Interpreting the comparative study of deep learning models for multi-class upper gastrointestinal disease segmentation. World J Gastroenterol 2026; 32(8): 115297
URL: https://www.wjgnet.com/1007-9327/full/v32/i8/115297.htm
DOI: https://dx.doi.org/10.3748/wjg.v32.i8.115297

INTRODUCTION

Endoscopy remains the clinical gold standard for diagnosing many upper gastrointestinal (UGI) conditions, yet missed lesions and observer variability are longstanding, clinically significant problems[1,2]. With the development of artificial intelligence (AI) and more specifically deep learning (DL), its application for UGI endoscopy has moved from algorithmic novelty to an active front in the clinical armamentarium[3].

Over the last five years, convolutional neural networks (CNNs) and endoscopy-specific DL systems have evolved from proof-of-concept demonstrations into tools achieving near-human or even over endoscopists’ performance for lesion detection, characterization, and outcome prediction in the esophagus and stomach[4-6]. Automated segmentation has become one of the most important applications of endoscopy-specific DL systems integrating many new approaches, such as DL-spanning CNNs, vision transformers, and hybrid or mamba-based architectures. It has been demonstrated to reliably delineate multiple disease classes in varied imaging conditions to materially reduce missed diagnoses, improve triage during procedures, and standardize documentation for downstream care[7,8]. These advances have enabled not only lesion detection but also automated segmentation to delineate lesion boundaries useful for biopsy guidance and surgical planning.

Multi-class segmentation that simultaneously identifies ulcers, polyps, varices, neoplasms and other pathologies is of strong clinical interest because it reflects real-world decision tasks. However, it remains nontrivial from impressive benchmark metrics to routine clinical utility. The comparative segmentation studies have illuminated both pragmatic opportunities and persistent barriers by systematically evaluating diverse architectures on a multi-class segmentation task, which are essential to inform clinically relevant model selection, especially considering real-time inference and resource constraints. It meant that not only the model performed best on a benchmark, but those comparative gains translated into improved and sustainable clinical outcomes. Chan et al’s comparative study[9] has evaluated 17 state-of-the-art architectures across two datasets and introduced performance-efficiency trade-off (PET) and generalization-reliability rate (GRR) metrics that finally displayed rigor and clinical relevance and met the field’s needs.

Chan et al[9] have provided a practical decision-making toolkit for a realistic, multi-label clinical scenario considering AI augmentation of endoscopic workflows beyond benchmark models. This editorial aims to interpret this comparative study of DL models for multi-class upper GI disease segmentation in the context of the broader literature and to argue for its further paths in bridging laboratory performance and routine clinical practices.

ENDOSCOPY-SPECIFIC DL MODELS IN MULTI-CLASS SEGMENTATION

In recent years, technical performance of endoscopy-specific DL models has been found impressive but task-specific. Multiple studies focusing on UGI conditions have reported high sensitivities and accuracies, for example, DL-based diagnostic systems have shown detection sensitivity in a range of 80%-90% for esophageal and gastric neoplasia[10,11]. These models could delineate lesion margins, predict invasion depth, and even detect Helicobacter pylori with promising accuracy[12], reflecting a maturity in CNN architectures and AI training strategies as well as the availability of annotated datasets enabling multi-class segmentation and classification. Additionally, DL systems have displayed operational advantage with extraordinary scale and consistency. Through inherently immune to momentary lapses of human attention and analysis of every video frame in real time, the model acted as a second reader to significantly reduce the endoscopy’s miss rate which has been considered as an important approach for clinical implications[13,14]. Moreover, studies about AI’s effect have made progress from still-image classification to frame-by-frame segmentation in videos with proof of prospective randomized trials. The integration of anatomical site classification sets standards into training datasets by enforcing capture of standard views to enhance AI models’ reproducibility[15].

However, some critical caveats temper the promise. First, considering the training data realism and distributional shift, most high-performing models only fit tertiary centers trained on high-quality annotated images. Their application in community practice and low-prevalence screening populations has been found limited due to selection bias and diseased-enriched datasets[5,16]. It has also caused the imbalance of clinical data and risk of overfitting when datasets were small without representative[4]. Otherwise, multi-class segmentation was annotation-intensive, so it exists inter-rater variability for manual frame-by-frame delineation which often lacks standardized criteria across centers. Thus, it has remained immature for current propagation tools and consensus annotation protocols[17]. Additionally, static image segmentation metrics were not insufficient because in real clinical tasks. The video-based endoscopic practice was easily influenced by navigation, insufflation, and mucosal preparation. Few AI systems have robust prospective validation in live procedures that the individualized analysis of endoscopy images seemed unpracticable even for random controlled trials[18].

At its core, the study of Chan et al[9] has demonstrated that contemporary DL architectures could achieve competitive segmentation performance across multiple lesion classes when trained on curated endoscopic image sets (Table 1). The work has also convincingly shown that hierarchical encoders with pretraining ImageNet (Swin-UMamba-D, SegFormer) dominated in accuracy and generalization. Except for the progress of CNN, design choices were decisive by encoding multi-scale anatomical priors and leveraging transfer learning. These choices have noted the priority of clinically meaningful inductive biases like multi-scale hierarchies or boundary-awareness in future model innovation over chasing incremental architecture permutations. Otherwise, PETs might not be optional. As a pragmatic choice, SegFormer’s PET profile showed the preferable possibility of slightly lowering intersection over union (IoU) to allow real-time clinical utility concerning inference time, memory footprint, and integration complexity. In constrained clinical environments, the AI-based endoscopy-assisted tool need to be optimized considering edge inference, regulatory documentation, and user acceptance, not merely IoU.

Table 1 Comparative model summary.

Model name	Architecture family	Pretraining	Mean IoU, %	Mean Dice	PET score, %	GRR, %	Inference latency (milliseconds/frame)	Model size	Notes
U-Net	CNN	N/A	74.88	(Self 79.37 + EDD2020 67.63)/2 = 73.50%	45.74	65.41	3.64	31.46 M, approximately 126 MB	Classic encoder-decoder with skip connections; excellent throughput with very high FPS but lower PET; simple architecture may limit generalization
ResNet + U-Net	CNN	ResNet encoder typically ImageNet-1K	80.63	(83.59 + 73.97)/2 = 78.78%	82.58	67.03	5.18	32.52 M, approximately 130 MB	Residual backbone improves accuracy and robustness with modest compute; good PET
ConvNeXt + UPerNet	CNN	Typically ImageNet-1K	82.69	(85.65 + 76.90)/2 = 81.27%	84.70	68.79	5.65	41.37 M, approximately 165 MB	Modern CNN with ViT-inspired design; strong accuracy and good throughput
M2SNet	CNN	N/A	80.87	(84.24 + 74.81)/2 = 79.53%	72.33	67.64	14.86	29.89 M, approximately 120 MB	Multi-scale subtraction units for improved feature complementarity and edge clarity; slower inference among CNNs
Dilated SegNet	CNN	ResNet50 backbone	80.55	(83.35 + 73.64)/2 = 78.49%	74.88	68.08	9.23	18.111 M, approximately 72 MB	Dilated convolutions for real-time polyp segmentation; good trade-off, moderate speed
PraNet	CNN	N/A	73.74	(74.15 + 61.12)/2 = 67.63%	48.81	65.79	12.16	32.56 M, approximately 130 MB	Parallel partial decoder + reverse attention for boundary refinement; decent accuracy but lower PET and generalization
SwinV2 + UPerNet	Transformer	SwinV2 typically ImageNet-pretrained	82.74	(85.59 + 76.97)/2 = 81.28%	78.41	68.12	12.19	41.91 M, approximately 168 MB	SwinV2 hierarchical transformer backbone with UPerNet decoder; strong accuracy with moderate compute
SegFormer	Transformer	Typically ImageNet-1K	82.86	(93.14 + 77.20)/2 = 85.17%	92.02	70.11	9.68	24.73 M, approximately 99 MB	Excellent performance-efficiency balance; low FLOPs (4.23 GFLOPs), good generalization: Recommended for real-time clinical use per paper
SETR-MLA	Transformer	ViT backbone	77.42	(82.14 + 71.48)/2 = 76.81%	52.45	69.67	5.55	90.77 M, approximately 363 MB	Segmentation transformer with multi-level aggregation; large parameter count but relatively fast inference in this setup
TransUNet	Hybrid (CNN + transformer)	ResNet50 + ViT	74.81	(77.35 + 65.06)/2 = 71.21%	26.39	67.14	13.03	105.00 M, approximately 420 MB	Combines CNN encoder and ViT: Strong representational power but heavy compute and lower PET
PVTV2 + EMCAD	Transformer	PVTV2 usually ImageNet-pretrained	82.91	(85.81 + 77.07)/2 = 81.44%	88.14	71.38	12.56	26.77 M, approximately 107 MB	Pyramid vision transformer v2 + efficient multi-scale decoding; strong generalization and good PET
FCBFormer	Transformer	PVTV2 backbone	82.00	(85.18 + 76.03)/2 = 80.61%	61.89	71.52	21.43	33.09 M, approximately 132 MB	Polyp-specialized transformer variant; strong generalization (highest GRR) but highest inference time among many models, limiting real-time use at high resolution
Swin-UMamba	Hybrid (Mamba based)	N/A	79.45	(81.23 + 71.12)/2 = 76.18%	53.31	70.93	13.00	59.89 M, approximately 240 MB	Mamba hybrid leveraging visual-state-space-model; good generalization but relatively large and slower training/inference cost
Swin-UMamba-D	Hybrid (Mamba based)	N/A	83.29	(86.15 + 77.53)/2 = 81.84%	88.39	69.36	12.97	27.50 M, approximately 110 MB	Best segmentation performance (average IoU) among study but relatively high training and inference cost; strong segmentation accuracy but moderate generalization
UMamba-Bot	Hybrid (Mamba based)	N/A	71.61	(75.03 + 61.79)/2 = 68.41%	39.45	64.78	6.27	28.77 M, approximately 115 MB	Lightweight mamba variant; good FPS but weaker accuracy and generalization
UMamba-Enc	Hybrid (Mamba based)	N/A	71.28	(75.24 + 61.82)/2 = 68.53%	37.15	65.33	7.28	27.56 M, approximately 110 MB	Encoder-focused mamba variant; similar trade-offs to UMamba-Bot: Faster but lower accuracy
VM-UNETV2	Hybrid (Mamba like)	N/A	81.63	(84.36 + 74.89)/2 = 79.63%	83.48	69.49	12.90	22.77 M, approximately 91 MB	VM-UNET encoder variant; strong PET and competitive accuracy, GPU-focused design

IoU: Intersection over union; PET: Performance-efficiency trade-off; GRR: Generalization-reliability rate; CNN: Convolutional neural network; N/A: Not applicable; FPS: Frames per second.

Open in New Tab Full Size Table

Dataset diversity and annotation rigor have remained the bottlenecks that the self-collected dataset substantially improved class coverage and size relative to EDD2020, yet single-region sourcing and class imbalance remained persisted. Therefore, multi-center annotations with standardized protocols incorporating meta-data have potential to produce robust benchmarks for clinical generalization. Moreover, theoretical evaluation metrics like mean IoU couldn’t reflect clinical risk that sensitivity-weighted and lesion-level outcome metrics should be adopted with per-class clinical-risk curves. As Chan et al[9] reported, the moderate GRR (approximately 0.7) has underscored that even high-performing models could underperform under domain shift. The prospective and multi-institutional testing should be included in regulatory-grade validation with pre-specified performance thresholds and prespecified failure modes.

INFLUENCE OF COMPARATIVE STUDY OF DL MODELS IN CLINICAL TRANSLATION

Chan et al’s comparative methodology[9] has addressed a longstanding need for reproducibility and fair benchmarking in gastrointestinal DL-based computer vision research using shared metrics and standardized ground truth. There are several contextual realities making this work timely and consequential (Table 2). First, missed and late diagnoses remain a serious global health problem in upper GI disease, especially early cancer diagnosis[19]. Second, DL-based research has mainly focused on architectural novelty and leaderboard performance without sufficient considerations of clinical constraints such as interpretability and domain shift[20]. Third, multi-class segmentation shows clinically practical than single-target tasks that clinicians frequently distinguish among multiple concurrent pathologies, such as ulcers, polyps, varices, and etc., but it’s intrinsically harder that most of prior systems couldn’t be robustly referred[21]. Chan et al[9] aimed to bridge these gaps through the efforts of expanding dataset diversity and disease coverage and contrasting architectural families in a deployment-minded framework. They also explicitly assessed generalization across disparate clinical sources.

Table 2 Implementation roadmap.

Translational challenge	Concrete solution	Responsible stakeholders	Measurable success metrics
Dataset bias and limited diversity	Multi-center data sharing agreements; standardized metadata schema (demographics, device, protocol); stratified sampling and targeted collection for underrepresented cohorts; federated learning to enable cross-site models while preserving privacy	Clinical consortiums, data governance teams, hospital IT, study PIs, legal/compliance	Number of centers and countries represented; device/vendor diversity index; demographic coverage (age/sex/ethnicity) proportions; change in GRR and external IoU on held-out sites
Annotation variability and subjectivity	Develop and enforce standardized annotation protocol and labeling guidelines; multi-expert consensus labeling; adjudication workflows; active learning to prioritize ambiguous cases; periodic re-annotation audits	Clinical experts (gastroenterologists), annotation managers, platform vendors, data scientists	Inter-rater agreement (Cohen’s kappa/mean IoU across annotators); % masks adjudicated; annotation time per case; model performance gains after consensus labels
Class imbalance/rare pathology sensitivity	Oversampling/targeted collection of rare classes; class-aware loss functions (focal, class-weighted); synthetic data and augmentation for rare classes; curriculum learning focusing on rare classes	Data acquisition teams, ML engineers, clinical partners, biostatisticians	Per-class recall/sensitivity (especially for rare classes); AUPRC for rare classes; reduction in false-negative rate for underrepresented labels
Imaging variability (lighting, specular reflection, motion blur)	Advanced preprocessing (illumination normalization, reflection removal), robust augmentation (exposure, blur, specular sim), self-supervised pretraining on large unlabeled endoscopy corpora; spatio-temporal modeling for videos	ML research team, imaging engineers, clinical endoscopy unit, vendors	Performance stratified by exposure/quality buckets (IoU under overexposed vs normal); reduction in failure cases linked to artifacts; frame-level temporal consistency metrics (temporal IoU)
Poor cross-dataset generalization/overfitting	Cross-dataset evaluation, domain adaptation techniques, federated or multi-site training, hold-out external validation sets, regularization and ensembling	ML engineers, external collaborators, validation leads, statisticians	Delta IoU/Dice between internal test and external test sets; GRR improvement on external cohorts; calibration metrics (Brier score)
Real-time performance and resource constraints (PET)	Model compression and pruning; lightweight architectures (e.g., SegFormer variants); hardware benchmark targeting (edge GPU/CPU); optimized inference pipelines	ML engineers, DevOps, clinical IT, hardware vendors	Inference latency (microseconds/frame), throughput (fps) on target hardware; memory usage, FLOPs; PET score or task-specific tradeoff metric; clinician acceptance for live use
Clinical validation and impact on workflow	Prospective clinical studies, reader studies comparing model + clinician vs clinician alone; integration pilots in endoscopy suite; user-centred UI/UX design and training	Clinical investigators, hospital operations, human factors specialists, clinical IT	Diagnostic accuracy improvement (sensitivity/specificity) in prospective trials; change in missed-lesion rate; time-to-report; clinician satisfaction and adoption rates
Trust, explainability and clinician acceptance	Provide visual explanations (attention maps, uncertainty overlays); case-level confidence scores; reporting of failure modes and limitations; clinician training modules	ML explainability team, clinical educators, product managers, regulatory/QA	Proportion of model outputs with uncertainty flags; clinician trust scores in surveys; reduction in dismissed correct alerts; explainability usability ratings
Privacy, legal & regulatory readiness	Data de-identification pipeline, DPIAs, early engagement with regulators, pre-specified validation plan, post-market surveillance plan, robust audit trails	Legal/compliance, regulatory affairs, data governance, QA, cybersecurity	Completion of DPIA and IRB approvals; regulatory submission milestones (pre-submission, submission, approvals); number of privacy incidents; time to resolve security findings
Multi-modal and longitudinal integration	Design multi-modal models (image + report + temporal video), link endoscopy frames with pathology/EMR metadata, adopt interoperable standards (DICOM/HL7/FHIR)	Data engineers, clinical informatics, pathology, ML researchers, standards officers	Increase in model performance when adding modalities (delta IoU/Dice); % cases with linked pathology; successful end-to-end FHIR/DICOM integrations; improvement in clinically-relevant outcome measures (e.g., appropriate biopsy rate)

IoU: Intersection over union; PET: Performance-efficiency trade-off; GRR: Generalization-reliability rate; IT: Information technology; PIs: Principal investigators; ML: Machine learning; QA: Quality assurance; DPIA: Data protection impact assessment; IRB: Institution review board; AUPRC: Area under precision-recall curve.

Open in New Tab Full Size Table

This kind of comparative study involving multiple DL segmentation models has shown its strengths in boundary delineation and computational efficiency. However, the biases have merged when ranking models merely based on task metrics, like IoU, Dice, sensitivity, or specificity, and implying that higher numbers of these metrics would lead to better clinical decisions. The training and validation process of comparative studies would confound external validity. There have been selection biases when training initial models on curated datasets collected in tertiary centers with higher disease prevalence and expert operators, which would be lack of generalizability to different settings. The biases were hard to eliminate through data augmentation and synthetic images. In a design of a comparative study, the presence of a superior metric only implied the model’s fit to the specific dataset rather than the genuine clinical utility.

Without careful regularization and external validation, DL models often overfit idiosyncrasies of image acquisition rather than pathology, which meant that a model with marginally great Dice scores might not be reliable when exposing to different scopes or endoscopists. Several studies have reported the unsatisfactory explainability in DL-based models that clinicians with insufficient AI knowledge require intelligible reasoning to accept advanced models’ output, such as heatmaps and bounding boxes tied to interpretable features[22,23]. Merely comparative superiority in raw metrics was hard to account for how easily the model could be interrogated in the clinical practice. Additionally, the human-AI interaction is important that the DL model’s value depends partly on how it alters endoscopist behavior. It remained doubtful that if the automated segmentation model exceled at delineating gastric intestinal metaplasia boundaries, it would increase biopsy yield at meaningful sites or obtain benefits on clinical decision making[24]. With the occurrence of false positive cases, comparative evaluations must examine not only per-frame metrics but their clinical consequences to avoid unnecessary interventions. Moreover, since the requirement of regulatory approval, robust post-market surveillance, and cybersecurity, the models with superior performance in the comparative analysis might fail to scale. The clinical equity of DL models was hard to be assured. The training datasets had few risks in demographic, ethnic, and equipment diversity.

The comparative studies of multi-class segmentation models need a multi-dimensional evaluation framework to move beyond superficial comparisons and to achieve clinical translation. The implementation of multi-site external validation set compares models on demographically diverse test sets to simulate community practice and non-expert operators. The DL-based segmentation metrics tightly integrates with clinically relevant outcomes to increase biopsy yield accuracy and optimize treatment decisions. It’s also warranted to evaluate sensitivity to nuisances, partial occlusion, and prevalence shifts. The observational experiments for human-AI interaction trials should be applied that endoscopists use model outputs in a simulated or live environment to measure behavioral changes. Additional cost-effectiveness analysis examines computational costs and annotation burden to show predicted impact on health economic metrics. The DL-based segmentation models’ performance should also be evaluated by stratification of patient subgroups, scope equipment, and operator experience.

Several structural factors converge attributing to why has translation lagged despite rapid architectural progress. First, misaligned incentives: Academic and industry incentives reward novelty and benchmark gains including new architectures and leaderboard rank more directly than labor-intensive clinical validation, so teams optimize for leaderboard metrics rather than clinical endpoints. Second, evaluation culture and venue bias: Top machine-learning venues emphasize algorithmic novelty and fresh datasets, often at the expense of replication, robustness, and multi-site generalization studies that require larger resource commitments. Third, data and logistic friction: Acquiring, harmonizing, and annotating diverse, high-quality multi-center endoscopy datasets is ethically, legally, and financially challenging that annotation is annotation-intensive and suffers inter-rater variability. Finally, regulatory and operational complexity concerning real-time constraints, device heterogeneity, and cybersecurity raises the bar for deployment in ways, not captured by static image benchmarks. For naming these root causes reframes the agenda, we must realign funding and publication incentives, build shared annotation infrastructure and consenting frameworks, and standardize deployment-level testing as part of research pipelines to accelerate clinical impact.

BLUEPRINT FOR A NEXT-GENERATION EVALUATION FRAMEWORK FOR ENDOSCOPIC AI

To move from exhortation to action, we propose a concrete four-pillar evaluation blueprint to operationalize clinical translation for multi-class endoscopic segmentation systems. Each pillar defines measurable requirements, minimum datasets/experiments, and pragmatic pass/fail criteria.

Pillar 1: Clinical-utility metrics (prioritize outcomes over pixel scores)

What to measure: Procedure-level and patient-relevant endpoints based on changes in biopsy yield at clinically meaningful sites, reduction in missed lesions per procedure, procedure time saved, and improvement in novice endoscopist diagnostic accuracy.

Minimum evaluation: Report both pixel-level metrics (IoU/Dice) and at least two procedure-level outcomes in retrospective or simulated workflows; include lesion-level sensitivity/specificity and time-to-decision if applicable.

Pass/fail criterion: Demonstrate a pre-specified clinically meaningful improvement in at least one procedure-level outcome in simulated or pilot clinical use, for example, increase in targeted biopsy yield or reduction in missed lesions.

Pillar 2: Generalizability and fairness audits

What to measure: Cross-site performance, device/manufacturer stratification, lighting and prep variability, and demographic subgroup performance like age, sex, and ethnicity.

Minimum evaluation: Multi-site external validation across at least three geographically and device-diverse centers; report GRR and per-subgroup performance with confidence intervals.

Pass/fail criterion: Minimum GRR threshold align with pre-specified regulatory standard for instance, GRR ≥ 0.8 and no clinically significant performance degradation in any defined demographic or device subgroup.

Pillar 3: Human-AI integration and safety assessment

What to measure: Effects on clinician behavior and safety with automation bias, alert fatigue, and changes to biopsy selection, interpretability/usability, and failure-mode analysis.

Minimum evaluation: Observational human-AI interaction studies in simulated or live settings measuring decision change, task time, false positive-driven unnecessary interventions, and clinician trust/usability in accordance to standardized scales.

Pass/fail criterion: Absence of net harm with no increase in unnecessary interventions and no clinically significant deterioration in decision quality; acceptable usability scores; pre-specified mitigation strategies for common failure modes.

Pillar 4: Operational, regulatory, and economic readiness

What to measure: Inference latency/memory for target deployment environment, annotation cost and reproducibility, cybersecurity/data governance posture, and cost-effectiveness.

Minimum evaluation: Profile edge-performance with real-time feasibility, annotation variability analysis, and a basic health-economic model, such as projected cost per additional lesion detected.

Pass/fail criterion: Real-time inference on intended clinical hardware, standardized annotation protocol with inter-rater agreement above a threshold, and a positive or justified cost-effectiveness estimate for the intended use case.

Integration of these pillars move evaluation from isolated technical metrics to a reproducible and clinically meaningful validation pathway, which can be operationalized in pre-clinical trials and regulatory submissions.

CONCLUSION

Chan et al[9] presented compelling evidence that DL-based multi-class segmentation is nearing clinical viability for UGI endoscopy. The comparative study of DL models for multi-class upper GI segmentation as a necessary and informative exercise has implied the importance of architectural design, pretraining, and multi-scale feature modeling. It incisively addressed the translational bottlenecks, including dataset diversity, annotation uncertainty, and evaluation standards. The study calls for the next phase of endoscopic AI by uniting technical ingenuity with clinical rigor, multi-institutional collaboration, and ethically grounded validation. Only by following such a path can automated segmentation mature into a practice-changing tool for gastrointestinal care.

References

1.	Bhat P, Kaffes AJ, Lassen K, Aabakken L. Upper gastrointestinal endoscopy in the surgically altered patient. Dig Endosc. 2024;36:1077-1093. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 1] [Cited by in RCA: 3] [Article Influence: 1.5] [Reference Citation Analysis (0)]

Veitch AM, Uedo N, Yao K, East JE. Optimizing early upper gastrointestinal cancer detection at endoscopy. Nat Rev Gastroenterol Hepatol. 2015;12:660-667. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 65] [Cited by in RCA: 97] [Article Influence: 8.8] [Reference Citation Analysis (0)]

He Q, Bano S, Ahmad OF, Yang B, Chen X, Valdastri P, Lovat LB, Stoyanov D, Zuo S. Deep learning-based anatomical site classification for upper gastrointestinal endoscopy. Int J Comput Assist Radiol Surg. 2020;15:1085-1094. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 16] [Cited by in RCA: 31] [Article Influence: 5.2] [Reference Citation Analysis (0)]

Yan T, Wong PK, Qin YY. Deep learning for diagnosis of precancerous lesions in upper gastrointestinal endoscopy: A review. World J Gastroenterol. 2021;27:2531-2544. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in CrossRef: 18] [Cited by in RCA: 15] [Article Influence: 3.0] [Reference Citation Analysis (0)]

5.	Sharma P, Hassan C. Artificial Intelligence and Deep Learning for Upper Gastrointestinal Neoplasia. Gastroenterology. 2022;162:1056-1066. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 2] [Cited by in RCA: 52] [Article Influence: 13.0] [Reference Citation Analysis (0)]

6.	Tokat M, van Tilburg L, Koch AD, Spaander MCW. Artificial Intelligence in Upper Gastrointestinal Endoscopy. Dig Dis. 2022;40:395-408. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 3] [Cited by in RCA: 24] [Article Influence: 4.8] [Reference Citation Analysis (0)]

7.	Ren X, Zhou W, Yuan N, Li F, Ruan Y, Zhou H. Prompt-based polyp segmentation during endoscopy. Med Image Anal. 2025;102:103510. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 3] [Reference Citation Analysis (0)]

Wang S, Cong Y, Zhu H, Chen X, Qu L, Fan H, Zhang Q, Liu M. Multi-Scale Context-Guided Deep Network for Automated Lesion Segmentation With Endoscopy Images of Gastrointestinal Tract. IEEE J Biomed Health Inform. 2021;25:514-525. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 98] [Cited by in RCA: 66] [Article Influence: 13.2] [Reference Citation Analysis (0)]

Chan IN, Wong PK, Yan T, Hu YY, Chan CI, Qin YY, Wong CH, Chan IW, Lam IH, Wong SH, Li Z, Gao S, Yu HH, Yao L, Zhao BL, Hu Y. Assessing deep learning models for multi-class upper endoscopic disease segmentation: A comprehensive comparative study. World J Gastroenterol. 2025;31:111184. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 1] [Reference Citation Analysis (0)]

10.

Nakao E, Yoshio T, Kato Y, Namikawa K, Tokai Y, Yoshimizu S, Horiuchi Y, Ishiyama A, Hirasawa T, Kurihara N, Ishizuka N, Ishihara R, Tada T, Fujisaki J. Randomized controlled trial of an artificial intelligence diagnostic system for the detection of esophageal squamous cell carcinoma in clinical practice. Endoscopy. 2025;57:210-217. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 4] [Cited by in RCA: 12] [Article Influence: 12.0] [Reference Citation Analysis (0)]

11.

Li SW, Zhang LH, Cai Y, Zhou XB, Fu XY, Song YQ, Xu SW, Tang SP, Luo RQ, Huang Q, Yan LL, He SQ, Zhang Y, Wang J, Ge SQ, Gu BB, Peng JB, Wang Y, Fang LN, Wu WD, Ye WG, Zhu M, Luo DH, Jin XX, Yang HD, Zhou JJ, Wang ZZ, Wu JF, Qin QQ, Lu YD, Wang F, Chen YH, Chen X, Xu SJ, Tung TH, Luo CW, Ye LP, Yu HG, Mao XL. Deep learning assists detection of esophageal cancer and precursor lesions in a prospective, randomized controlled study. Sci Transl Med. 2024;16:eadk5395. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 15] [Cited by in RCA: 23] [Article Influence: 11.5] [Reference Citation Analysis (1)]

12.	Ebigbo A, Messmann H, Lee SH. Artificial Intelligence Applications in Image-Based Diagnosis of Early Esophageal and Gastric Neoplasms. Gastroenterology. 2025;169:396-415.e2. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 15] [Reference Citation Analysis (0)]

13.

Glissen Brown JR, Mansour NM, Wang P, Chuchuca MA, Minchenberg SB, Chandnani M, Liu L, Gross SA, Sengupta N, Berzin TM. Deep Learning Computer-aided Polyp Detection Reduces Adenoma Miss Rate: A United States Multi-center Randomized Tandem Colonoscopy Study (CADeT-CS Trial). Clin Gastroenterol Hepatol. 2022;20:1499-1507.e4. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 27] [Cited by in RCA: 149] [Article Influence: 37.3] [Reference Citation Analysis (1)]

14.

Wu L, Shang R, Sharma P, Zhou W, Liu J, Yao L, Dong Z, Yuan J, Zeng Z, Yu Y, He C, Xiong Q, Li Y, Deng Y, Cao Z, Huang C, Zhou R, Li H, Hu G, Chen Y, Wang Y, He X, Zhu Y, Yu H. Effect of a deep learning-based system on the miss rate of gastric neoplasms during upper gastrointestinal endoscopy: a single-centre, tandem, randomised controlled trial. Lancet Gastroenterol Hepatol. 2021;6:700-708. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 82] [Cited by in RCA: 97] [Article Influence: 19.4] [Reference Citation Analysis (0)]

15.

Wallace MB, Sharma P, Bhandari P, East J, Antonelli G, Lorenzetti R, Vieth M, Speranza I, Spadaccini M, Desai M, Lukens FJ, Babameto G, Batista D, Singh D, Palmer W, Ramirez F, Palmer R, Lunsford T, Ruff K, Bird-Liebermann E, Ciofoaia V, Arndtz S, Cangemi D, Puddick K, Derfus G, Johal AS, Barawi M, Longo L, Moro L, Repici A, Hassan C. Impact of Artificial Intelligence on Miss Rate of Colorectal Neoplasia. Gastroenterology. 2022;163:295-304.e5. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 195] [Cited by in RCA: 169] [Article Influence: 42.3] [Reference Citation Analysis (1)]

16.

Jong MR, Jaspers TJM, Kusters CHJ, Jukema JB, van Eijck van Heslinga RAH, Fockens KN, Boers TGW, Visser LS, van der Putten JA, van der Sommen F, de With PH, de Groof AJ, Bergman JJ; BONS‐AI consortium. Challenges in Implementing Endoscopic Artificial Intelligence: The Impact of Real-World Imaging Conditions on Barrett's Neoplasia Detection. United European Gastroenterol J. 2025;13:929-937. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 1] [Cited by in RCA: 7] [Article Influence: 7.0] [Reference Citation Analysis (0)]

17.	Nathani P, Sharma P. Role of Artificial Intelligence in the Detection and Management of Premalignant and Malignant Lesions of the Esophagus and Stomach. Gastrointest Endosc Clin N Am. 2025;35:319-353. [RCA] [PubMed] [DOI] [Full Text] [Cited by in RCA: 2] [Reference Citation Analysis (0)]

18.

Luo H, Xu G, Li C, He L, Luo L, Wang Z, Jing B, Deng Y, Jin Y, Li Y, Li B, Tan W, He C, Seeruttun SR, Wu Q, Huang J, Huang DW, Chen B, Lin SB, Chen QM, Yuan CM, Chen HX, Pu HY, Zhou F, He Y, Xu RH. Real-time artificial intelligence for detection of upper gastrointestinal cancer by endoscopy: a multicentre, case-control, diagnostic study. Lancet Oncol. 2019;20:1645-1654. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 155] [Cited by in RCA: 281] [Article Influence: 40.1] [Reference Citation Analysis (0)]

19.

Danpanichkul P, Auttapracha T, Kongarin S, Ponvilawan B, Simadibrata DM, Duangsonk K, Jaruvattanadilok S, Saowapa S, Suparan K, Lui RN, Liangpunsakul S, Wallace MB, Wijarnpreecha K. Global epidemiology of early-onset upper gastrointestinal cancer: trend from the Global Burden of Disease Study 2019. J Gastroenterol Hepatol. 2024;39:1856-1868. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 8] [Cited by in RCA: 22] [Article Influence: 11.0] [Reference Citation Analysis (0)]

20.

Neri A, Penza V, Baldini C, Mattos LS. Surgical augmented reality registration methods: A review from traditional to deep learning approaches. Comput Med Imaging Graph. 2025;124:102616. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 2] [Cited by in RCA: 4] [Article Influence: 4.0] [Reference Citation Analysis (0)]

21.

Weisman AJ, Huff DT, Govindan RM, Chen S, Perk TG. Multi-organ segmentation of CT via convolutional neural network: impact of training setting and scanner manufacturer. Biomed Phys Eng Express. 2023;9. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 1] [Cited by in RCA: 5] [Article Influence: 1.7] [Reference Citation Analysis (0)]

22.	Habe TT, Haataja K, Toivanen P. Review of Deep Learning Performance in Wireless Capsule Endoscopy Images for GI Disease Classification. F1000Res. 2024;13:201. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 3] [Reference Citation Analysis (1)]

23.

Krenzer A, Heil S, Fitting D, Matti S, Zoller WG, Hann A, Puppe F. Automated classification of polyps using deep learning architectures and few-shot learning. BMC Med Imaging. 2023;23:59. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 26] [Reference Citation Analysis (1)]

24.	Campion JR, O'Connor DB, Lahiff C. Human-artificial intelligence interaction in gastrointestinal endoscopy. World J Gastrointest Endosc. 2024;16:126-135. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in RCA: 11] [Reference Citation Analysis (3)]

Footnotes

Provenance and peer review: Invited article; Externally peer reviewed.

Peer-review model: Single blind

Specialty type: Gastroenterology and hepatology

Country of origin: China

Peer-review report’s classification

Scientific Quality: Grade B, Grade B, Grade B

Novelty: Grade B, Grade B, Grade C

Creativity or Innovation: Grade B, Grade B, Grade C

Scientific Significance: Grade B, Grade B, Grade C

Open Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: https://creativecommons.org/Licenses/by-nc/4.0/

P-Reviewer: Qu CS, Chief, China; Yu QQ, PhD, Professor, China S-Editor: Li L L-Editor: A P-Editor: Zhang L