This article is an open-access article which was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Received: October 9, 2025 Revised: November 10, 2025 Accepted: November 25, 2025 Published online: February 15, 2026 Processing time: 122 Days and 21.1 Hours
Abstract
A core challenge in the diagnosis and treatment of esophageal cancer (EC) lies in accurately identifying patients who will benefit from neoadjuvant therapy (NAT). Yang et al reported a predictive model for NAT response in EC, constructed using radiomics from T2-weighted magnetic resonance imaging (MRI) and machine learning. The model achieved an area under the curve of 0.932 in the training cohort and 0.900 in the validation cohort. While encouragingly, we urge caution with limitation. First, the study’s single-center, retrospective design with an insufficient sample size limits the model’s generalizability and significantly increases the risk of overfitting. Second, the study only extracted features from the T2-weighted MRI sequence, failing to integrate data from other functional MRI sequences such as diffusion-weighted imaging and dynamic contrast-enhanced MRI. Third, the model suffers from a "black box" issue regarding its extracted features—its low interpretability hinders clinicians’ trust in and acceptance of the model. This editorial reviews the study by Yang et al, identifies its limitations, and puts forward in-depth suggestions to further optimize the model.
Core Tip: Accurate prediction of neoadjuvant therapy response in esophageal cancer (EC) is critical to avoid ineffective treatment. Yang et al developed a non-invasive radiomics model based on T2-weighted magnetic resonance imaging (T2WI); however, it has several key limitations. These limitations include single-center, small-sample retrospective design, exclusive reliance on T2WI sequence, inadequate stability of manual segmentation, and insufficient clinical interpretability. Corresponding optimization suggestions are proposed in this editorial. Only through multicenter validation, multidisciplinary collaboration, and multimodal integration can radiomics-based machine learning models be truly translated into clinical practice, thereby supporting personalized treatment decision-making for patients with EC.
Citation: Zhao ZX. Radiomics-based model for predicting neoadjuvant therapy response in esophageal cancer: Limitations and suggestions. World J Gastrointest Oncol 2026; 18(2): 114981
Globally, esophageal cancer (EC) is a digestive tract tumor characterized by high malignancy and poor prognosis[1,2]. The application of neoadjuvant therapy (NAT) has brought hope for tumor downstaging and improved surgical R0 resection rate in patients with locally advanced EC. However, there is significant heterogeneity in patients' responses to the treatment[3]. Some patients not only fail to achieve clinical benefits due to treatment resistance, but also suffer from treatment-related toxic and side effects. Even in certain cases, patients may experience delays in receiving surgery at the optimal timing[4]. All these factors make the accurate prediction of responses to NAT a crucial link in the management and treatment process of EC. Recently, a retrospective study conducted by the Yang et al[5] and published in the World Journal of Gastrointestinal Oncology innovatively combined T2-weighted magnetic resonance imaging (MRI) with multiple machine learning algorithms. The model achieved an area under the curve of 0.932 in the training cohort and 0.900 in the validation cohort. This study offers a novel non-invasive solution for predicting the reactivity of EC to NAT. Furthermore, it brings new impetus to the development of precision diagnosis and treatment in this field. This study, which closely focuses on the core clinical challenges in the management of NAT for EC, boasts a rigorous methodological design and combines innovation with clinical value. However, when we shift our perspective from theoretical research to the translation into clinical application, it becomes evident that the model still has many limitations that need to be addressed. For this reason, this editorial will systematically analyzes these limitations and offers constructive recommendations to guide future work, thereby advancing the model toward clinical utility.
LIMITATIONS
Single-center retrospective design
A fundamental limitation of Yang et al’s study[5] is its single-center, small-sample-size (n = 132) retrospective design. This flaw significantly undermines the generalizability of the radiomics model and raises concerns about its applicability in various clinical settings. First, patient population homogeneity in single-center studies fails to reflect the broader clinical diversity of EC patients. The study cohort likely shares similar baseline characteristics—such as regional epidemiological features (e.g., high prevalence of squamous cell carcinoma and frequent occurrence of tumors located in the middle third), relatively limited local treatment protocols (e.g., uniform NAT regimens), and even overlapping socioeconomic factors[6]. However, in real-world clinical practice, there are significant differences in the patient populations and diagnosis-treatment scenarios of EC across different regions and medical centers. In terms of tumor subtypes, adenocarcinoma is more prevalent in Western countries, while squamous cell carcinoma dominates in Asia[7]. In terms of treatment strategies, there are obvious preferences in the selection of regimens among different medical centers. For instance, the adoption rate of immunotherapy combined with NAT and the specific combination of chemotherapeutic drugs vary considerably[8]. In terms of patients' individual conditions, the profiles of comorbidities also differ greatly, with significant variations in the incidence of diseases such as diabetes and cardiovascular diseases across different cohorts. However, a model trained on a homogeneous single-center cohort may struggle to adapt to these variations.
Second, retrospective data collection introduces selection bias and hinders data standardization[9,10]. The study relies on historical medical records and archived MRI scans, which inherently vary in documentation completeness and imaging acquisition standardization[11,12]. In the study from Yang et al[5], all MRI examinations were conducted using a single 3.0T Magnetom Skyra system (Siemens Healthineers, Germany) with standardized parameters, which ensures internal data consistency but also creates a "technology-specific" issue. Most clinical centers worldwide use MRI equipment from different manufacturers (e.g., GE Healthcare and Philips), and scanning protocols [such as slice thickness and motion artifact reduction techniques for T2-weighted MRI (T2WI) sequences] differ[13,14]. When the model is applied to images acquired via other equipment or protocols, the extracted radiomic features may not match the training data, leading to reduced model performance.
Additionally, Yang et al’s study[5] has a limitation of a relatively small sample size. In machine learning, there is a significant negative correlation between sample size and overfitting risk. And a small sample size may lead to an increased risk of model overfitting[15,16]. When the training set size is insufficient, the model may misidentify accidental features in the training data (such as artifacts and image noise) as tumor-specific features, leading the model to "learn inaccurately". In addition, an insufficiently sized validation set fails to detect this problem in a timely manner[17]. And since the training set and validation set are sourced from a single center, the imaging features of patients in it are highly similar, and this makes the model exhibit high accuracy. However, once applied to external centers, its accuracy drops sharply.
T2WI single-sequence dependence
This study only extracted features from the T2WI sequence alone and did not integrate information from other functional MRI sequences [e.g., diffusion-weighted imaging (DWI) and contrast-enhanced MRI] or other imaging examinations [e.g., computed tomography (CT) and positron emission tomography (PET)-CT]. Existing studies have confirmed that the apparent diffusion coefficient of DWI can reflect changes in tumor cell density, while dynamic contrast-enhanced MRI can evaluate tumor angiogenesis. CT can accurately assess tumor anatomy and invasion range, and PET-CT can evaluate tumor metabolic activity and treatment sensitivity[18-20]. Multiple high-quality studies have shown that multimodal models are more capable of comprehensively capturing the characteristics of the tumor microenvironment and effectively avoiding the "biased learning" of single-sequence models caused by the singularity of feature dimensions. Table 1 presents a summary of relevant studies[21-28].
Table 1 Multiple data types included in high-quality multimodal radiomics research.
In the study by Yang et al[5], the tumor region of interest (ROI) was manually delineated by two physicians. Although reviewed by a senior physician, the inherent inter-observer variability of manual segmentation could not be completely avoided. According to the consensus in the radiomics field, the consistency of manually segmented ROIs must reach an intraclass correlation coefficient (ICC) > 0.90 to ensure the stability and reliability of subsequently extracted radiomic features[29]. However, the ICC > 0.75 standard adopted in this study can only barely meet the requirements. It neither specified the specific inter-observer ICC value during the ROI segmentation stage nor disclosed the detailed operational procedures for manual segmentation and key details of consistency testing. This may make it difficult for other research teams to reproduce its feature extraction process, thereby reducing the reproducibility of the study results.
Clinical interpretability
The insufficient clinical interpretability of the radiomics model ("black box" issue) is a key barrier restricting its acceptance by clinicians[30-32]. The study identified 10 key radiomic features through feature selection (wavelet-transformed features, gray-level run-length matrix features, etc.), but failed to analyze the association between these features and the core clinical problem (pathological response of EC to NAT). For instance, which pathological changes of the tumor do the "wavelet-transformed features" correspond to? What clinical significance do the "gray-level run-length matrix features" represent? Furthermore, can these statistically significant feature parameters truly reflect the clinical reality of EC NAT response, or are they merely "numerical games" caused by various biases (such as selection bias in single-center samples, or spurious correlations due to unified imaging conditions)? The absence of answers to these key questions makes it difficult for clinicians to understand the model's decision-making logic, thereby reducing their trust in the model's predictive results and affecting the willingness to apply the model in practical clinical scenarios.
SUGGESTIONS AND OUTLOOKS
Further validation with multicenter cohort study
Further collection of an adequate number of cases (either retrospective or prospective) from multiple centers across different regions and of various levels is of great significance for improving the reliability and generalizability of the research results. Importantly, unified patient inclusion/exclusion criteria should be developed, and a centralized data management platform should be established to ensure the completeness of clinical information and imaging data. In terms of equipment, MRI scanners from different brands (Siemens, GE, and Philips) and with different field strengths should be included. Meanwhile, uniform and rigorous image preprocessing should be performed, and the consistency of features across different devices should be reported to ensure the stability of the model in real-world clinical settings.
Integration of multimodal imaging features
Multimodal imaging features refer to the integration of quantitative or qualitative features from two or more imaging modalities (e.g., MRI, CT, and PET-CT), forming a multi-dimensional and multi-scale disease description system. Compared with a single modality, the core advantage of multimodal imaging features lies in the combination of multiple imaging features, which provides information complementarity and enables a relatively comprehensive description of disease characteristics. It is also a current research focus in radiomics[33]. The value of multimodal imaging features has been verified in multiple fields, including oncology, neurological diseases, and cardiovascular diseases[34,35]. With the advancement of artificial intelligence computing power and algorithms, multimodal imaging features can be further integrated with molecular biomarkers, transcriptomics, proteomics, and other omics data. This integration enables multi-dimensional and multi-faceted prediction of disease progression and prognosis[34]. The study by Yang et al[5] could further integrate imaging features from DWI, contrast-enhanced MRI, and CT to enrich the dimensionality of tumor heterogeneity characterization. Additionally, predictive performance could be further enhanced via feature concatenation or ensemble learning.
Standardization of manual segmentation and automatic segmentation models
In manual segmentation of the ROI, the criteria for defining tumor boundaries and the segmentation sequence should be further clarified, and segmentation examples of typical cases should be provided. Meanwhile, data on segmentation consistency (ICC values and cases of resolving discrepancies) should be made publicly available to ensure reproducibility. To address the issues of low efficiency and high variability in manual segmentation, deep learning-based automatic segmentation models for EC tumors (transformer architectures) can be further developed[36,37]. Subsequently, these models should be trained on multicenter data to adapt to EC images from different devices and with different pathological subtypes. When the models are applied to imaging data from new centers, they can automatically identify device differences and adjust feature extraction parameters, thereby avoiding performance degradation caused by manual segmentation.
Strengthening clinical interpretability
The clinical interpretability of radiomics has long been a current research focus[32]. However, challenges including the inherent complexity of models, the diversity of data, discrepancies in annotations, and insufficient clinical validation pose significant hurdles for clinicians and relevant researchers. Although complex models such as deep learning can improve predictive accuracy in radiomics, their architectures are typically highly complex, making their internal operational mechanisms difficult to understand and interpret. For instance, a model may predict that a patient with EC is sensitive to NAT, yet fail to clarify the specific reasons. Furthermore, deep learning models are usually built on correlations rather than causal relationships, which makes it challenging to identify the root causes of model predictions and explain how the model extracts critical features from input data. This difficulty prevents clinicians from truly understanding the basis of the model’s decisions and the underlying biological significance.
In addition, medical imaging data exhibits high modal diversity (e.g., CT, MRI, and ultrasound). Different modalities have distinct data characteristics and noise distributions, which increases the difficulty of model interpretation. Meanwhile, there are significant variations in annotation standards, and annotation results may differ among annotators, which further undermines the interpretability of the model. Most radiomics studies are retrospective analyses, lacking prospective, multicenter validation. This leads to potential discrepancies between the model’s expected performance and its actual performance in clinical settings; its interpretability also cannot be effectively validated in real-world clinical environments, thereby limiting the widespread application of radiomics models in clinical practice.
Nevertheless, in recent years, research findings on the clinical interpretability of radiomics have been increasing. For example, researchers have systematically elaborated on the breakthrough applications of radiomics in brain tumor diagnosis, prognosis, and treatment planning—with a specific focus on interpretability methods to address the "black box" problem[38]. Visualization techniques such as heatmaps, gradient visualization, and attention mechanisms have also been employed to help understand how neural networks extract features from input data[39]. Additionally, some studies have integrated biological mechanisms with explainable artificial intelligence techniques to divide tumors into subregions with clear biological significance[40,41]. Notable progress has also been made in integrating radiomics with multi-omics data. For instance, the methylation status of the O6-methylguanine-DNA methyltransferase promoter can be predicted via18F-DOPA PET radiomics, which is critical for selecting temozolomide chemotherapy regimens[42]. Additionally, MRI-based radiomic risk scores are significantly correlated with the degree of CD8+ T cell infiltration, offering new insights for predicting the efficacy of immune checkpoint inhibitors[43].
CONCLUSION
Yang et al’s study[5] offers a valuable technical exploration for predicting NAT response in EC. Its non-invasive T2WI-based radiomics framework aligns with the trend toward precision medicine. Beyond acknowledging the study’s notable merits, we must also recognize its limitations, which in turn provide important implications for future research in this field. Specifically, this study highlights the urgency of prioritizing multidisciplinary collaboration, multicenter large-sample validation, and multimodal data integration. By focusing on these key directions, future studies can continuously optimize machine learning-based multimodal radiomics models, promote their successful translation into clinical practice, and ultimately safeguard the realization of personalized treatment for EC patients.
Morgan E, Soerjomataram I, Rumgay H, Coleman HG, Thrift AP, Vignat J, Laversanne M, Ferlay J, Arnold M. The Global Landscape of Esophageal Squamous Cell Carcinoma and Esophageal Adenocarcinoma Incidence and Mortality in 2020 and Projections to 2040: New Estimates From GLOBOCAN 2020.Gastroenterology. 2022;163:649-658.
[RCA] [PubMed] [DOI] [Full Text][Cited by in Crossref: 796][Cited by in RCA: 739][Article Influence: 184.8][Reference Citation Analysis (1)]
Wang H, Jiang Z, Wang Q, Wu T, Guo F, Xu Z, Yang W, Yang S, Feng S, Wang X, Chen S, Cheng C, Chen W. Pathological response and prognostic factors of neoadjuvant PD-1 blockade combined with chemotherapy in resectable oesophageal squamous cell carcinoma.Eur J Cancer. 2023;186:196-210.
[RCA] [PubMed] [DOI] [Full Text][Cited by in RCA: 17][Reference Citation Analysis (0)]
Goebl P, Wingrove J, Abdelmannan O, Brito Vega B, Stutters J, Ramos SDG, Kenway O, Rossor T, Wassmer E, Arnold DL, Collins DL, Hemingway C, Narayanan S, Chataway J, Chard D, Iglesias JE, Barkhof F, Parker GJM, Oxtoby NP, Hacohen Y, Thompson A, Alexander DC, Ciccarelli O, Eshaghi A. Enabling new insights from old scans by repurposing clinical MRI archives for multiple sclerosis research.Nat Commun. 2025;16:3149.
[RCA] [PubMed] [DOI] [Full Text][Cited by in RCA: 7][Reference Citation Analysis (0)]
Deng Y, Zhang X, Song Y, Lan X. Comparison of [(18)F]FDG PET/CT and CT in the response assessment and clinical outcome prediction for immunotherapy in patients with advanced melanoma: a systematic review and meta-analysis.Eur Radiol. 2025;.
[RCA] [PubMed] [DOI] [Full Text][Cited by in RCA: 1][Reference Citation Analysis (0)]
Dunnwald LK, Doot RK, Specht JM, Gralow JR, Ellis GK, Livingston RB, Linden HM, Gadi VK, Kurland BF, Schubert EK, Muzi M, Mankoff DA. PET tumor metabolism in locally advanced breast cancer patients undergoing neoadjuvant chemotherapy: value of static versus kinetic measures of fluorodeoxyglucose uptake.Clin Cancer Res. 2011;17:2400-2409.
[RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)][Cited by in Crossref: 97][Cited by in RCA: 94][Article Influence: 6.3][Reference Citation Analysis (0)]
Lemore A, Vogt N, Oster J, Germain E, Fauvel M, Gillet R, Sirveaux F, Marie B, Sans N, Faruch M, Lapègue F, Lafourcade F, Badr S, Cotten A, Mihoubi Bouvier F, Yang S, Drapé JL, Pastor M, Thouvenin Y, Baron MP, Cyteval C, Fadli D, Fournier C, Hauger O, Ben Haj Amor M, Stacoffe N, Daubie S, Noel JB, Pialat JB, Cherix S, Zanchi F, Omoumi P, Blum A, Hossu G, Gondim Teixeira PA. Enhanced CT and MRI Focal Bone Tumor Classification with Machine Learning-based Stratification: A Multicenter Retrospective Study.Radiology. 2025;315:e232834.
[RCA] [PubMed] [DOI] [Full Text][Cited by in RCA: 4][Reference Citation Analysis (0)]
Pan K, Yao F, Hong W, Xiao J, Bian S, Zhu D, Yuan Y, Zhang Y, Zhuang Y, Yang Y. Multimodal radiomics based on 18F-Prostate-specific membrane antigen-1007 PET/CT and multiparametric MRI for prostate cancer extracapsular extension prediction.Br J Radiol. 2024;97:408-414.
[RCA] [PubMed] [DOI] [Full Text][Cited by in RCA: 8][Reference Citation Analysis (0)]
Bian Y, Liu C, Li Q, Meng Y, Liu F, Zhang H, Fang X, Li J, Yu J, Feng X, Ma C, Zhao Z, Wang L, Xu J, Shao C, Lu J. Preoperative Radiomics Approach to Evaluating Tumor-Infiltrating CD8(+) T Cells in Patients With Pancreatic Ductal Adenocarcinoma Using Noncontrast Magnetic Resonance Imaging.J Magn Reson Imaging. 2022;55:803-814.
[RCA] [PubMed] [DOI] [Full Text][Cited by in Crossref: 14][Cited by in RCA: 21][Article Influence: 5.3][Reference Citation Analysis (0)]
Footnotes
Provenance and peer review: Invited article; Externally peer reviewed.
Peer-review model: Single blind
Specialty type: Oncology
Country of origin: China
Peer-review report’s classification
Scientific Quality: Grade A, Grade D
Novelty: Grade B, Grade D
Creativity or Innovation: Grade B, Grade D
Scientific Significance: Grade C, Grade D
Open Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: https://creativecommons.org/Licenses/by-nc/4.0/
P-Reviewer: Li CM, MD, Professor, China; Zhou JH, MD, Associate Chief Physician, China S-Editor: Lin C L-Editor: Wang TQ P-Editor: Zhao S