Liu SC, Zhang H. Early cancer diagnosis via interpretable two-layer machine learning of plasma extracellular vesicle long RNA. World J Gastrointest Oncol 2025; 17(11): 111670 [DOI: 10.4251/wjgo.v17.i11.111670]
Corresponding Author of This Article
Shi-Cai Liu, PhD, School of Medical Information, Wannan Medical College, No. 22 Wenchang West Road, Wuhu 241002, Anhui Province, China. liushicainj@163.com
Research Domain of This Article
Gastroenterology & Hepatology
Article-Type of This Article
Basic Study
Open-Access Policy of This Article
This article is an open-access article which was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Nov 15, 2025 (publication date) through Nov 13, 2025
Times Cited of This Article
Times Cited (0)
Journal Information of This Article
Publication Name
World Journal of Gastrointestinal Oncology
ISSN
1948-5204
Publisher of This Article
Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA
Share the Article
Liu SC, Zhang H. Early cancer diagnosis via interpretable two-layer machine learning of plasma extracellular vesicle long RNA. World J Gastrointest Oncol 2025; 17(11): 111670 [DOI: 10.4251/wjgo.v17.i11.111670]
Co-corresponding authors: Shi-Cai Liu and Han Zhang.
Author contributions: Liu SC and Zhang H collected and analyzed the data, wrote the manuscript, and made equal contributions as co-corresponding authors; Liu SC supervised the project. Both authors have read and approved the final version to be published.
Supported by Talent Scientific Research Start-up Foundation of Wannan Medical College, No. WYRCQD2023045.
Institutional review board statement: This study did not involve human participants or animal subjects; therefore, neither Institutional Review Board nor Institutional Animal Care and Use Committee approval was required.
Conflict-of-interest statement: All the authors report no relevant conflicts of interest for this article.
Data sharing statement: The data that support the findings of this study are available from the authors upon reasonable request.
Open Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: https://creativecommons.org/Licenses/by-nc/4.0/
Corresponding author: Shi-Cai Liu, PhD, School of Medical Information, Wannan Medical College, No. 22 Wenchang West Road, Wuhu 241002, Anhui Province, China. liushicainj@163.com
Received: July 7, 2025 Revised: August 7, 2025 Accepted: October 9, 2025 Published online: November 15, 2025 Processing time: 130 Days and 22.7 Hours
Abstract
BACKGROUND
The early diagnosis rate of pancreatic ductal adenocarcinoma (PDAC) is low and the prognosis is poor. It is important to develop an interpretable noninvasive early diagnostic model in clinical practice.
AIM
To develop an interpretable noninvasive early diagnostic model for PDAC using plasma extracellular vesicle long RNA (EvlRNA).
METHODS
The diagnostic model was constructed based on plasma EvlRNA data. During the process of establishing the model, EvlRNA-index was introduced, and four algorithms were adopted to calculate EvlRNA-index. After the model was successfully constructed, performance evaluation was conducted. A series of bioinformatics methods were adopted to explore the potential mechanism of EvlRNA-index as the input feature of the model. And the relationship between key characteristics and PDAC were explored at the single-cell level.
RESULTS
A novel interpretable machine learning framework was developed based on plasma EvlRNA. In this framework, a two-layer classifier was established. A new concept was proposed: EvlRNA-index. Based on EvlRNA-index, a cancer diagnostic model was established, and a good diagnostic effect was achieved. The accuracy of PDACandCPvsHealth-Probabilistic PCA Index-SVM (PDAC and chronic pancreatitis vs health-probabilistic principal component analysis index-support vector machine) (1-18) was 91.51%, with Mathew’s correlation coefficient 0.7760 and area under the curve 0.9560. In the second layer of the model, the accuracy of PDACvsCP-Probabilistic PCA Index-RF (PDAC vs chronic pancreatitis-probabilistic principal component analysis index-random forest) (2-17) was 93.83%, with Mathew’s correlation coefficient 0.8422 and area under the curve 0.9698. Forty-nine PDAC-related genes were identified, among which 16 were known, inferring that the remaining ones were also PDAC-related genes.
CONCLUSION
An interpretable two-layer machine learning framework was proposed for early diagnosis and prediction of PDAC based on plasma EvlRNA, providing new insights into the clinical value of EvlRNA.
Core Tip: The early diagnosis rate of pancreatic ductal adenocarcinoma is low and the prognosis is poor. It is important to develop an interpretable noninvasive early diagnostic model in clinical practice. In this study, an interpretable two-layer machine learning framework was proposed for the early diagnosis and prediction of pancreatic ductal adenocarcinoma based on plasma extracellular vesicle long RNA. This study provides new insights into the clinical value of extracellular vesicle long RNA for promoting the development of precision medicine.
Citation: Liu SC, Zhang H. Early cancer diagnosis via interpretable two-layer machine learning of plasma extracellular vesicle long RNA. World J Gastrointest Oncol 2025; 17(11): 111670
Pancreatic cancer is a type of malignant neoplasm that primarily arises from the pancreatic duct epithelium and acinar cells. This cancer is highly aggressive. Its onset is insidious, making early diagnosis a challenging task. The disease progresses at a rapid pace, and patients typically have a short survival period[1]. Pancreatic cancer is regarded as one of the most poorly prognosticated malignant tumors and is often referred to as “the king of cancers”. Pancreatic ductal adenocarcinoma (PDAC) is a tumor occurring in the ductal epithelium of the pancreas, which is the main type of pancreatic cancer, accounting for > 90% of cases[2]. The onset of PDAC is insidious and highly aggressive, and most patients are diagnosed at an advanced stage[3]. Although early-stage cancers can be effectively treated with surgery and radiation, late-stage cancers often cannot be controlled. The emergence of cancer is a multifactor, multistage, complex and progressive process. In the process of disease progression, early screening, early diagnosis, early treatment, and the management of cancer as a chronic disease are the most effective ways to improve the cure rate, reduce the pain of treatment, improve prognosis and reduce the economic burden. Therefore, early diagnosis is crucial for the successful treatment of cancer.
Although some existing techniques have been applied to the early diagnosis of PDAC, the overall effect is still far from expectations. For example, although carbohydrate antigen 19-9 (CA19-9) level is helpful for the prediction and efficacy judgment of pancreatic cancer, its sensitivity and specificity are low [sensitivity of 75.4% and a specificity of 77.6% for differentiation between malignant and non-malignant forms of cancer; the specificity of distinction between PDAC and chronic pancreatitis (CP) often does not exceed 60%][4,5]. CA19-9 levels may also be elevated in cases of biliary tract infection, cholangitis, bile duct obstruction, or jaundice[6-8]. For Lewis antigen negative patients, CA19-9 levels usually do not increase[9]. Therefore, there is an urgent need to find new diagnostic biomarkers for PDAC, especially those liquid biopsy biomarkers suitable for early detection and diagnosis of PDAC.
Plasma extracellular vesicles (EVs)are one of the important materials for liquid biopsy. EVs, which include exosomes and microvesicles, are a special class of membrane-like, nanosized endocytotic vesicles secreted by most cell types. EVs contain a variety of molecular components (including RNA, proteins, lipids, and metabolites) that reflect the type of cell from which they are derived[10]. Initially, EVs were considered to be cellular waste. However, at present, EVs are being more widely acknowledged as crucial mediators for intercellular communication and as biomarkers for disease. EVs are associated with most pathological conditions, including cancer[11], and cardiovascular[12], neurological[13], and infectious[14] diseases. EVs contain and stabilize various types of RNA[15]. In EVs, microRNA (miRNA) has been well characterized and investigated[16]. Nevertheless, the small number of miRNA in EVs and the lack of specificity in their production limit their wide application. More evidence shows that plasma EV long RNA (EvlRNA), including mRNA, long non-coding RNA (lncRNA) and circular RNA, have functional and clinical significance[17,18]. For example, androgen receptor splice variant 7 can be detected in the blood EVs of patients with castration-resistant prostate cancer and can be used as a predictive biomarker of hormone therapy resistance[19]. In patients with melanoma and non-small cell lung cancer, CD274 mRNA in plasma-derived EVs is associated with anti-programmed death-1 antibody response[20]. Nabet et al[21] found that unshielded RN7SL1 can be transferred into breast cancer cells via EVs and activate the pattern recognition receptor retinoic acid-inducible gene I, promoting cancer invasion. These findings reveal that plasma EVs are rich in a large number of valuable and functional EvlRNAs. Therefore, it is feasible to identify tumor-specific genes in the plasma EvlRNA library for early cancer diagnosis. This is a non-invasive strategy for early diagnosis, detection, and treatment evaluation of human cancer.
In recent years, with the increasing availability of clinical data to support diagnosis and prognosis, advances in science and technology have made it possible to study cancer using high-throughput biomedical data. However, due to the complexity of cancer, clinical bioinformatic analysis and genetic interpretation pose challenges, and these data need to be truly explored. At present, most of the machine learning algorithms, especially the complex algorithms that rely on neural networks, although they have good classification effect, it is difficult to mine the knowledge learned during the model training process, which is a “black box” model[22,23]. For biological problems, it is not only necessary to build a high-performance diagnostic model to meet the prediction needs, but also to find out the rules used in the training process of the model, visualize the corresponding important feature weights, and explore the close correlation between important features and actual biological processes, so as to provide help for biomedical researchers to understand the model. Therefore, how to design a model with interpretability and visualize the corresponding important feature weights is a hot and difficult topic in the field of machine learning in recent years.
In view of these, we propose an interpretable machine learning framework for early diagnostic prediction of PDAC based on plasma EvlRNA. In this framework, we combine our previous research ideas[24] to establish a two-layer classifier. The first layer identified normal and non-normal samples, and the second layer identified whether the samples belonged to PDAC or CP. In this study, a new concept, EvlRNA-index, was proposed. Based on EvlRNA-index, a cancer diagnosis model was established, and a good diagnostic effect was achieved. In this study, the interpretability of the entire machine learning framework was studied and explored, and the close correlation between important features and actual biological processes was explored, in order to provide important help for biomedical researchers to understand the model.
MATERIALS AND METHODS
Dataset
The dataset S used in this study was collected from the research of Yu et al[25], which was derived from multiple centers. The dataset can be formulated as S = Snon-normal ∪ Snormal, where Snon-normal is the non-normal dataset consisting of PDAC and CP samples, Snormal the normal dataset with normal samples only, and ∪ is the symbol for union in the set theory. The non-normal dataset can be further classified into two categories: Snon-normal = S1non-normal ∪ S2non-normal, where the subscripts 1 and 2 represent PDAC and CP. The dataset S was taken from next generation sequencing of plasma EvlRNA. The dataset had 501 patient records in total, comprising 284 patients with PDAC, 100 with CP and 117 healthy subjects (Table 1). For each record, we obtained 54148 explanatory variables and one response variable. These plasma EvlRNA expression data have been standardized by transcripts per kilobase million.
The selection of features is important for the effective establishment of the model. We use mean decrease in accuracy (MDA) combination of DEseq2 and edger methods to screen important features. Figure 1 showed the specific feature selection method flow. MDA represented the average decrease of classification accuracy on the “out of bag” samples when the values of a particular feature were randomly permuted. MDA was calculated using the randomForest package[26] in R (http://cran.r-project.org//). DEseq2 and edger were implemented by DEseq2 package[27] and edger package[28], respectively.
In order to calculate the EvlRNA-index, we used four algorithms to calculate each sample (Figure 2). Each sample was scored by 10 EvlRNA-index. Finally, each sample will generate a 1 × 10 matrix, which is equivalent to 10 EvlRNA-index information (EvlRNA-index1, EvlRNA-index2, EvlRNA-index3, …, EvlRNA-index8, EvlRNA-index9, EvlRNA-index10). The four algorithms used were singular value decomposition (SVD) principal component analysis (PCA), nonlinear iterative partial least squares (Nipals) PCA, probabilistic PCA and FastHCS (high-dimensional congruent subsets).
Figure 2 Extracellular vesicle long RNA-index calculation of plasma extracellular vesicle long RNA samples.
EvlRNA: Extracellular vesicle long RNA; SVD: Singular value decomposition; PCA: Principal component analysis; FastHCS: High-dimensional congruent subsets; Nipals: Nonlinear iterative partial least squares.
SVD PCA is a conventional PCA algorithm[29]. Nipals[30] is an algorithm at the root of partial least squares regression which can execute PCA with missing values by simply leaving those out from the appropriate inner products. It is tolerant to small amounts (generally not more than 5%) of missing data. Probabilistic PCA[31] combined an expectation maximization approach for PCA with a probabilistic model. FastHCS[32] is a robust PCA algorithm suitable for high-dimensional applications, including cases where the number of variables exceeded the number of observations. SVD PCA, Nipals PCA and probabilistic PCA were implemented through the pcaMethods package. FastHCS was implemented through the FastHCS package.
Two-layer classifier
To enable the first layer classifier to distinguish samples as either normal or non-normal, we selected four machine learning approaches: Support vector machine (SVM), random forest (RF), deep learning (DL), and extreme gradient boosting (XGBoost) to construct our first layer classifier. In R, the implementation of SVM was carried out with the e1071 package (http://cran.r-project.org//). The randomForest package in R was used for implementing RF. For DL, we relied on the h2o package in R, and the XGBoost package in R was used to implement XGBoost. For the first layer prediction, we used SVM, RF, DL, or XGBoost as the basic classifier due to its performance. The second layer classifier identified whether the sample belonged to PDAC or CP. We choose four machine learning methods: SVM, RF, DL and XGBoost to implement our second layer classifier. For the second layer prediction, we used SVM, RF, DL or XGBoost as the basic classifier due to their performance. Figure 3 shows the diagnostic model construction flowchart.
Figure 3 Diagnostic model construction flowchart.
NGS: Next-generation sequencing; EvlRNA: Extracellular vesicle long RNA; PDAC: Pancreatic ductal adenocarcinoma; CP: Chronic pancreatitis.
Evaluating performance
After the models were prepared, the performance of the classifier was evaluated based on sensitivity, specificity, accuracy, and Mathew’s Correlation Coefficient (MCC). The calculation formulas for these four metrics were detailed in the Supplementary material. To compare the overall performance of various models, the area under the receiver operating characteristic (ROC) curve (AUC), which ranged from 0 to 1, was calculated based on the ROC curve. The ROC curve depicted the relationship between the true positive rate and false positive rate. Specifically, the AUC represented the likelihood that a randomly chosen real target had a higher rank than a randomly chosen decoy target. A higher AUC value indicated better predictive performance of the model. The five-fold cross validation method was used to assess the performance of the model. To evaluate the performance of our models, we established an independent dataset. This dataset consisted of 35 normal samples randomly selected from the 117 normal samples and 115 non-normal samples randomly selected from the 384 non-normal samples. These samples were not used in the training, feature selection, or parameter optimization processes of the model.
Data exploration and functional analysis
The t-distributed stochastic neighbor embedding (t-SNE) analysis is implemented through the Rtsne package in R, and PCA through the FactoMineR package in R. The biological significance of the gene was determined through the functional enrichment analysis of DAVID[33]. The P value optimized by Benjamin Hochberg took < 0.05 as the critical value. Protein-protein interaction analysis was implemented through the STRING database (https://string-db.org/)[34]. Pathway analysis was implemented through the Reactome Knowledgebase (https://reactome.org)[35]. The single-cell RNA expression analysis was implemented using TISCH2 (Tumor Immune Single-cell Hub 2, http://tisch.comp-genomics.org/) which is a single-cell RNA expression database that focuses on the tumor microenvironment[36].
Survival outcome assessments, including overall survival (OS) and disease-free survival (DFS), were conducted through Kaplan-Meier survival curves complemented by log-rank test comparisons. Cohort stratification according to gene expression levels (high vs low) was determined using median expression values as the cutoff criterion. Statistical calculations encompassed hazard ratio quantification with corresponding 95% confidence intervals. All analytical procedures were executed via the GEPIA bioinformatics platform (http://gepia.cancer-pku.cn/)[37].
RESULTS
Identification of potential biomarkers
Figure 1 shows the specific identification process of potential biomarkers. The EvlRNA differences between PDAC and CP, PDAC and normal, CP and normal were analyzed by DEseq2 (false discovery rate < 0.05, log2fold change > 0.5), and the sets PDACvsCP_By_DEseq2, PDACvsNormal_By_DEseq2 and CPvsNormal_By_DEseq2 were obtained. The EvlRNA differences between PDAC and CP, PDAC and normal, CP and normal were also analyzed by edger (false discovery rate < 0.05, log2fold change > 0.5), and the sets PDACvsCP_By_edger, PDACvsNormal_By_edger and CPvsNormal_By_edger were obtained. The intersection of the results obtained by DEseq2 and edger was taken in turn.
To obtain cancer-associated EvlRNA biomarkers, these EvlRNA were combined with RNA profiles from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) (178 PDAC tissue dataset and 171 normal pancreatic tissue dataset). PDAC and CP obtained 1623 differential genes (set 1), PDAC and normal obtained 1376 differential genes (set 2), and CP and normal obtained 326 differential genes (set 3) (Figure 1). Set 1 and set 2 contained seven genes finally screened Yu et al[25], indicating the reliability of our analysis results. The three sets were further intersected and 96 genes were obtained. Finally, MDA was used for feature selection. After screening, 20 genes selected by MDA were used for further analysis (Figure 4A). These genes contain well-known cancer-related genes, such as S100A9[38], P2RX1[39], and PTPRJ[40]. Heat map visualization showed that there were significant differences in the expression of PDAC, CP and normal samples in EvlRNA (Figure 4B). However, the classification effect of the three types of samples in unsupervised learning is not ideal. Figure 4C shows the results of PCA and t-SNE analysis of 20 identified genes based on EvlRNA expression. PCA and t-SNE analysis were also performed on the eight genes screened in the study of Yu et al[25], which was similar to the analysis in this study (Figure 4D). Figure 4E shows the expression differences of these 20 genes in pancreatic cancer, both significantly up-regulated and significantly down-regulated, while the eight genes screened in Yu et al’s study[25] were all significantly up-regulated. Gene Ontology (GO) analysis of these genes showed that apoptotic process (GO: 0006915), regulation of apoptotic signaling pathway (GO: 2001233), regulation of apoptotic process (GO: 0042981), positive regulation of apoptotic signaling pathway (GO: 2001235), and negative regulation of apoptotic process (GO: 0043066) were significantly enriched, which means that these genes play a key role in regulating apoptotic signaling pathway (Figure 4F). Other enriched GO entries were associated with programmed cell death (GO: 0012501), regulation of growth (GO: 0040008), cell death (GO: 0008219), positive regulation of tumor necrosis factor production (GO: 0032760), regulation of programmed cell death (GO: 0043067), and inflammatory response (GO: 0006954) (Figure 4F), which play an important role in the development of cancer.
Figure 4 Analysis of potential biomarkers and visualization.
A: Feature (gene set) selection with mean decrease in accuracy; B: Heatmap analysis of the selected biomarkers (20 genes); C: Principal component analysis and t-distributed stochastic neighbor embedding analysis based on the extracellular vesicle long RNA expression of the selected biomarkers (20 genes); D: Principal component analysis and t-distributed stochastic neighbor embedding analysis based on the extracellular vesicle long RNA expression of the eight genes screened by Yu et al[25]; E: Gene expression analysis of potential biomarkers was performed using RNA-seq data from the The Cancer Genome Atlas and Genotype-Tissue Expression databases, which included 178 pancreatic ductal adenocarcinoma tissue samples and 171 normal pancreatic tissue samples; F: Gene Ontology analysis of 20 genes (potential biomarkers). PDAC: Pancreatic ductal adenocarcinoma; CP: Chronic pancreatitis; PCA: Principal component analysis; t-SNE: T-distributed stochastic neighbor embedding; FPKM: Fragments per kilobase of exon per million mapped fragments; GO: Gene Ontology; RAGE: Receptor for advanced glycation end products; BP: Biological process; CC: Cellular component; MF: Molecular function.
Analysis and visualization of EvlRNA-index
After calculating the EvlRNA-index (Figure 2), to study the relationship between these indexes, we conducted a correlation analysis using the corrplot package. Figure 5A shows the correlation analysis of EvlRNA-index calculated based on SVD PCA algorithm. Figure 5B shows the correlation analysis of EvlRNA-index calculated based on Nipals PCA algorithm. Figure 5C shows the correlation analysis of EvlRNA-index calculated based on probabilistic PCA algorithm. Figure 5D shows the correlation analysis of EvlRNA-index calculated based on FastHCS algorithm. The indexes were not correlated with each other, indicating that they were not redundant as input features of the model. We visualized the top three indexes for each sample, showing that the naked eye alone cannot accurately distinguish between PDAC/CP and normal samples (Figure 5E-H).
Figure 5 Correlation analysis and visualization of extracellular vesicle long RNA-index calculated by different algorithms.
A: Correlation analysis of extracellular vesicle long RNA (EvlRNA)-index calculated based on singular value decomposition principal component analysis (PCA) algorithm; B: Correlation analysis of EvlRNA-index calculated based on nonlinear iterative partial least squares PCA algorithm; C: Correlation analysis of EvlRNA-index calculated based on Probabilistic PCA algorithm; D: Correlation analysis of EvlRNA-index calculated based on FastHCS algorithm; E: Visualization of EvlRNA-index calculated by singular value decomposition PCA algorithm; F: Visualization of EvlRNA-index calculated by nonlinear iterative partial least squares PCA algorithm; G: Visualization of EvlRNA-index calculated by Probabilistic PCA algorithm; H: Visualization of EvlRNA-index calculated by FastHCS algorithm. SVD: Singular value decomposition; PCA: Principal component analysis; EvlRNA: Extracellular vesicle long RNA; PDAC: Pancreatic ductal adenocarcinoma; CP: Chronic pancreatitis; FastHCS: High-dimensional congruent subsets; Nipals: Nonlinear iterative partial least squares.
First classifier-identifying normal or non-normal
The first layer of the classifier identifies whether the sample is normal or non-normal. Using RF, SVM, DL, and XGBoost, each algorithm creates the first-layer models using biomarkers selected based on MDA, EvlRNA-index calculated based on conventional_SVD_PCA, EvlRNA-index calculated based on FastHCS, EvlRNA-index calculated based on Nipals_PCA, and EvlRNA-index calculated based on Probabilistic_PCA, respectively. Table 2 shows the performance of the first-layer models. The accuracy of PDACandCPvsHealth-MDA-SVM (1-2) was 90.57%, with sensitivity 93.83%, specificity 80.00%, MCC 0.7383, and AUC 0.9146. The accuracy of PDACandCPvsHealth-SVD PCA Index-RF (1-5) was 96.23%, with sensitivity 100.00%, specificity 84.00%, MCC 0.8947, and AUC 0.9901. The accuracy of PDACandCPvsHealth-SVD PCA Index-XGB (1-8) was 91.51%, with sensitivity 98.77%, specificity 68.00%, MCC 0.7549, and AUC 0.9294. The accuracy of PDACandCPvsHealth-Nipals PCA Index-DL (1-15) was 90.57%, with sensitivity 95.06%, specificity 76.00%, MCC 0.7319, and AUC 0.9679. The accuracy of PDACandCPvsHealth-Nipals PCA Index-XGB (1-16) was 91.51%, with sensitivity 98.77%, specificity 68.00%, MCC 0.7549, and AUC 0.9294. The accuracy of PDACandCPvsHealth-Probabilistic PCA Index-RF (1-17) was 93.40%, with sensitivity 96.30%, specificity 84.00%, MCC 0.8145, and AUC 0.9763. The accuracy of PDACandCPvsHealth-Probabilistic PCA Index-SVM (1-18) was 91.51%, with sensitivity 92.59%, specificity 88.00%, MCC 0.7760, and AUC 0.9560. In the first layer of the classifier, the performance of the model based on EvlRNA-index was better than that of the model based on the genes screened by MDA. Among all models, PDACandCPvsHealth-SVD PCA Index-RF (1-5) showed the best performance effect, with accuracy 96.23%, MCC 0.8947, and AUC 0.9901 (Table 2). However, in independent datatest and five-fold cross-validation, the prediction ability of PDACandCPvsHealth-SVD PCA Index-RF (1-5) was poor (Tables 3 and 4). Among all models, the PDACandCPvsHealth-Probabilistic PCA Index-SVM (1-18) performed best in terms of internal stability and external predictability with accuracy 91.51%, MCC 0.7760, and AUC 0.9560 (on training dataset) (Table 2, Figure 6A), and accuracy 93.33%, MCC 0.8137, and AUC 0.9717 (on the independent dataset) (Table 3, Figure 6B). In the five-fold cross-validation, accuracy was 91.19%, MCC 0.7430, and AUC 0.9389 (Table 4).
Figure 6 Receiver operating characteristic curve of pancreatic ductal adenocarcinoma and chronic pancreatitis vs health-probabilistic principal component analysis index-support vector machine (1-18).
A: Based on training dataset; B: Based on the independent dataset. AUC: Area under the receiver operating characteristic curve.
Table 2 Performance of the first-layer models on training dataset.
To validate our approach, we performed a performance evaluation of the models on an independent dataset. An independent dataset containing a total of 115 positive and negative samples (85 PDAC, 30 CP, 35 normal) was used to assess the predictive power of the models. These models all showed good performance, indicating the reliability of these models (Table 3). Five-fold cross-validation was also performed (Table 4). Considering the performance on both the training and independent datasets to select the best classifier, to obtain the best prediction results, we only choose PDACandCPvsHealth-Probabilistic PCA Index-SVM (1-18) as the first layer classifier of our method.
Second classifier-identifying PDAC or CP
The second layer of the classifier is used for diagnosis to distinguish PDAC from CP. Four machine learning algorithms, RF, SVM, DL, and XGBoost, were selected to implement our second classifier. The statistics for the second layer model are summarized in Table 5. In the second layer of the models, the performance based on EvlRNA index was better than that of the models based on the genes screened by MDA, indicating that the models based on EvlRNA index are suitable. The accuracy of PDACvsCP-Probabilistic PCA Index-RF (2-17) was 93.83%, with sensitivity 95.00%, specificity 90.48%, MCC 0.8422, and AUC 0.9698 (Figure 7). The accuracy of PDACvsCP-Probabilistic PCA Index-SVM (2-18) was 96.30%, with sensitivity 100.00%, specificity 85.71%, MCC 0.9035, and AUC 0.9833. In independent dataset tests, the model PDACvsCP-Probabilistic PCA Index-RF (2-17) showed better performance than the model PDACvsCP-Probabilistic PCA Index-SVM (2-18) (Table 6). Combination of EvlRNA index treated by Probabilistic PCA and RF is a good choice, with high internal stability and strong external prediction ability. According to the results of five-fold cross-validation, PDACvsCP-Probabilistic PCA Index-RF (2-17) had the best probabilistic performance among all the models, with an accuracy of 98.13% (Table 7).
Figure 7 Receiver operating characteristic curve of pancreatic ductal adenocarcinoma vs chronic pancreatitis-probabilistic principal component analysis index-random forest (2-17).
A: Based on training dataset; B: Based on the independent dataset. AUC: Area under the receiver operating characteristic curve.
Table 5 Performance of the second-layer models on training dataset.
Among all models, PDACandCPvsHealth-Probabilistic PCA Index-SVM (1-18) in the first layer and PDACvsCP-Probabilistic PCA Index-RF (2-17) in the second layer showed the best diagnostic performance. To explore the potential mechanism of the EvlRNA-index as input features of the diagnostic model and improve the interpretability of the model, the top five genes with weights in each Probabilistic PCA-based EvlRNA-index calculation process were extracted, and 10 × 5 genes were obtained (Figure 8A). After removing duplicate genes and converting gene names, 49 genes were obtained. These genes type contains protein coding (51.02%), pseudogene (26.53%), lncRNA (18.37%) and others (4.08%) (Figure 8B). There was one gene in this set of 49 that also exists in the gene set of Yu et al[25], namely MAL2. Through database and literature search, we found that this gene set contained 16 known PDAC cancer-related genes (Figure 8C), including ANXA4[41], PF4[42], MUC3A[43], MAL2[44,45], EIF4G2[46], NEAT1[47-49], MALAT1[50-52], NRGN[53], SCARNA10[54], GAPDH[55], TUBA1C[56], CALM1[57], DUOX2[58,59], FXYD3[60-62], LGALS4[63,64], and LENG8[53]. Therefore, we infer that the remaining ones are also PDAC cancer-related genes.
Figure 8 Extraction and classification of important genes.
A: The top five genes in terms of weight in each probabilistic principal component analysis-based extracellular vesicle long RNA index calculation process; B: Classification of gene types; C: The 49 pancreatic ductal adenocarcinoma cancer-related genes identified in this study were compared with those reported in the literature. PDAC: Pancreatic ductal adenocarcinoma; lncRNA: Long non-coding RNA.
To verify the relationship between these 49 genes and PDAC, RNA-seq data for PDAC (178 PDAC tissue dataset and 171 normal pancreatic tissue dataset) were downloaded from the TCGA and GTEx databases for analysis. The expression of most genes was found to be significantly up-regulated or down-regulated in cancer samples (Figure 9A). Gene expression at the protein level was closer to the original manifestations of the disease. Therefore, we conducted a further study on the protein expression data of PDAC using the clinical proteomic tumor analysis consortium dataset. Protein expression of most encoding protein genes was significantly upregulated or downregulated in PDAC tissues (Figure 9B), which verified the reliability of these results.
Figure 9 Differential expression of selected genes in normal and pancreatic ductal adenocarcinoma tissues.
A: Expression of selected genes was performed using RNA-seq data from The Cancer Genome Atlas and Genotype-Tissue Expression databases; B: Protein expression of encoding protein genes in normal and primary pancreatic ductal adenocarcinoma tissues based on the clinical proteomic tumor analysis consortium dataset. FPKM: Fragments per kilobase of exon per million mapped fragments; NA: Not available. aP < 0.05, bP < 0.01, and cP < 0.001.
GO analysis based showed that these gene-enriched entries were related to molecular function regulator activity (GO: 0098772), plasma membrane (GO: 0005886), vesicle (GO: 0031982), nucleus (GO: 0005634), miRNA inhibitor activity via base-pairing (GO: 0140869), lncRNA-mediated post-transcriptional gene silencing (GO: 0000512), regulation of miRNA catabolic process (GO: 2000625), and gene expression (GO: 0010467) (Figure 10A), which play an important role in the development of cancer. To better understand the interrelationships among these protein-encoding genes, protein-protein interaction analysis was used to demonstrate the interactions between proteins (Figure 10B). Pathway analysis showed that the pathways enriched by these genes include cAMP responsive element binding protein 1 phosphorylation through the activation of adenylate cyclase, protein kinase A activation, protein kinase A-mediated phosphorylation of CREB, etc. (Figure 10C), and these pathways are closely related to cancer.
We conducted survival analysis on these 49 genes and found that the high expression of some was linked to the survival prognosis of PDAC. Specifically, CCDC13-AS1 (OS and DFS), LENG8 (OS) and NRGN (OS and DFS) were correlated with a more favorable outcome for PDAC patients. In contrast, PTP4A2 (OS), OST4 (OS), MAL2 (OS and DFS), GAPDH (DFS), TUBA1C (OS and DFS) and DUOX2 (OS) were associated with a poor outcome for those suffering from PDAC. The Kaplan-Meier plots of these genes are shown in Figure 11. The above results indicated that these biomarkers are important for the diagnosis and prognosis of patients with PDAC.
Figure 11 Analysis of overall survival and disease free survival.
TPM: Transcripts per kilobase of exon model per million mapped reads; HR: Hazard ratio.
Single-cell RNA expression analysis in PDAC tumor microenvironment
To further investigate the potential mechanism of EvlRNA-index as an input feature of the model and the possible pathways by which EvlRNA-index affects the tumor microenvironment, we explored the expression of 49 identified genes (EvlRNA signature) at the single-cell level. The single-cell RNA-seq dataset (CRA001160[65] and GSE154778[66]) was used to determine the expression level of EvlRNA signature in immune cells. EvlRNA signature was mainly expressed in malignant cells in the PDAC tumor microenvironment (Figure 12), further demonstrating the reliability of the previous experimental results and indicating the potential principle of the model for accurate classification.
Figure 12 Analysis of single-cell RNA expression in pancreatic ductal adenocarcinoma tumor microenvironment.
A: Based on CRA001160, number of cells: 57443; B: Based on GSE154778, number of cells: 14953. EvlRNA: Extracellular vesicle long RNA; CD8T: CD8+ T cell; DC: Dendritic cell; B: B cell.
DISCUSSION
In this study, we proposed an interpretable machine learning framework called ECD-itMLF (early cancer diagnosis: An interpretable two-layer machine learning framework with plasma EvlRNA) for early diagnosis and prediction of PDAC based on plasma EvlRNA. This framework combines our previous research ideas[24] to establish a two-layer classifier. The first layer identifies normal and non-normal samples, and the second layer identifies whether the samples belong to PDAC or CP. In this study, a new concept was proposed: EvlRNA-index. Based on EvlRNA-index, a cancer diagnosis model was established, and a good diagnostic effect was achieved. In the first layer of the model, the accuracy of PDACandCPvsHealth-Probabilistic PCA Index-SVM (1-18) was 91.51%, with MCC 0.7760 and AUC 0.9560. In the second layer of the model, the accuracy of PDACvsCP-Probabilistic PCA Index-RF (2-17) was 93.83% with MCC 0.8422 and AUC 0.9698. ECD-itMLF is significantly different from traditional black box models and has demonstrated its superiority and uniqueness in diagnostic tasks.
In the feature selection process, we ultimately screened 20 genes, which did not intersect with the eight genes screened by Yu et al[25]. The main reason was that our study considered the differences between CP and normal during the screening process. In addition, the feature selection process was more rigorous (Figure 1). A diagnostic model was established using the 20 genes obtained from the final screening. The second layer of the model classified PDAC and CP. The accuracy of PDACvsCP-MDA-SVM (2-2) was 93.83%, with sensitivity 96.67%, specificity 85.71%, MCC 0.8372, and AUC 0.9563. In the study of Yu et al[25], based on the selected eight genes, SVM was used to classify PDAC and CP with accuracy 92.70%, sensitivity 93.39%, specificity 85.00%, and AUC 0.9280. Compared with their model, the performance indicators of model PDACvsCP-MDA-SVM (2-2) were improved. Several of these 20 genes are well-known cancer-related genes. For example, S100A9 can promote the occurrence, growth and metastasis of tumors by interfering with tumor metabolism and microenvironment[38]. The loss of P2RX1 modulates immunosuppressive activity in specific neutrophil subpopulations, thereby facilitating hepatic metastatic progression in PDAC[39]. The results of GO analysis showed that these 20 genes play a key role in regulating apoptotic signaling pathway.
Unlike the existing complex machine learning models, the method in this study solved the problems of model transparency and interpretability to some extent. Traditional black box models often face challenges in interpreting predictions, especially when complex analyses of high-dimensional data are required. The EvlRNA-index proposed in this study not only simplifies the feature space, but also provides an intuitive and easily understood interpretative framework, making the prediction process transparent to non-technical users. More importantly, using EvlRNA-index as the input feature improved the predictive performance and generalization ability of the model. By carefully designing and calculating EvlRNA-index, we captured key patterns and trends in the data, thereby improving the accuracy and robustness of the model. In the EvlRNA-index based model, in the first layer, the accuracy of PDACandCPvsHealth-Probabilistic PCA Index-SVM (1-18) was 91.51%, with MCC 0.7760 and AUC 0.9560. In the second layer, the accuracy of PDACvsCP-Probabilistic PCA Index-RF (2-17) was 93.83%, with MCC 0.8422 and AUC 0.9698. Compared to the study by Yu et al[25], the performance indicators of PDACvsCP-Probabilistic PCA Index-RF (2-17) have been improved.
In this study, the potential mechanism of these EvlRNA-index as input features of the diagnostic prediction model was investigated. The EvlRNA-index extracted based on probabilistic PCA was used as the input feature, which not only improved the model performance, but also enhanced the interpretability through gene weight ranking. The top five genes with the weight ranking in the calculation of each EvlRNA-index were extracted. The screening of 49 key genes demonstrates the ability to mine core markers from high-dimensional data. The study verified the expression differences of the candidate genes through both RNA-seq (TCGA/GTEx) and proteomics (clinical proteomic tumor analysis consortium) data. The results indicated that expressions of RNA and protein was consistent. For example, the consistent changes of the coding protein genes ANXA4 and NRGN at the transcriptional and translation levels suggested that they are directly involved in the pathological process of PDAC. In the study, 26.53% of the pseudogenes and 18.37% of the lncRNA were included in the biomarker list, suggesting that the regulatory role of ncRNAs (such as CCDC13-AS1) in PDAC deserves in-depth exploration and may involve epigenetic or post-transcriptional regulatory mechanisms. GO analysis revealed that the candidate genes were enriched in multiple biological processes closely related to cancer development. The survival analysis revealed that some of these genes had prognostic value for PDAC. The analysis of single-cell RNA expression in the tumor microenvironment showed that EvlRNA signature was mainly expressed in malignant cells in the PDAC tumor microenvironment, further demonstrating the reliability of the experimental results and indicating the potential principle of the model for accurate classification.
Our method has good scalability and adaptability. Although the focus of this study was on PDAC, the core analytical strategy possesses transferable utility across diverse oncological contexts and various pathophysiological conditions. Our method extracts EvlRNA-index information from some biological states and can be applied to better understand other biological states. However, fully developing and validating the model to address different clinical applications will require additional samples in these specific populations. In contrast to conventional approaches that rely on disease-specific biomarker identification, our genome-wide screening methodology facilitates impartial detection of biological signatures independent of pathological specificity. This technical advantage permits potential adaptation for evaluating non-pathological physiological variations. Furthermore, the method demonstrated potential for identifying distinctive genomic fingerprints associated with various disease entities, enabling machine learning algorithms to distinguish cancer subtypes through the EvlRNA-index. Efforts are under way to assess these assumptions.
There were limitations to our study. Firstly, although the research was based on a multi-center sample for model construction and validation, in the future, the generalization ability, diagnostic efficacy and stability of the model need to be verified in a larger and prospective independent cohort to evaluate its practical application in a broader population. Secondly, the second layer of the model focuses on differentiating PDAC from CP. However, in clinical practice, the differential diagnosis of PDAC also needs to consider other benign pancreatic diseases or adjacent organ tumors that are easily confused with PDAC. Therefore, including a more comprehensive disease spectrum for control will help to evaluate more comprehensively the differential diagnostic ability of the model. Thirdly, although the research revealed the association between key biomarkers and PDAC by visualizing feature weights, thus enhancing the interpretability of the model, the specific biological functions of these important EvlRNA markers and their molecular mechanisms in the occurrence and development of PDAC have not been fully explored. Subsequent experimental studies (such as functional verification) are needed to further clarify them.
CONCLUSION
We proposed an interpretable machine learning framework for early diagnostic prediction of PDAC based on plasma EvlRNA, called ECD-itMLF. In this framework, a two-layer classifier was established, the first layer identified Normal and Non-Normal samples, and the second layer identified whether the samples belong to PDAC or CP. A new concept was proposed: EvlRNA-index. Based on EvlRNA-index, a cancer diagnostic model was established, and a good diagnostic effect was achieved. The interpretability of the entire machine learning framework was studied and explored, and the close correlation between important features and actual biological processes was explored, to provide important help for biomedical researchers to understand the model. Finally, this study provides new insights into the clinical value of EvlRNA.
Footnotes
Provenance and peer review: Invited article; Externally peer reviewed.
Peer-review model: Single blind
Specialty type: Oncology
Country of origin: China
Peer-review report’s classification
Scientific Quality: Grade B, Grade B
Novelty: Grade B, Grade B
Creativity or Innovation: Grade B, Grade B
Scientific Significance: Grade B, Grade B
P-Reviewer: Yan SY, PhD, Associate Professor, China S-Editor: Wu S L-Editor: A P-Editor: Xu J
Zhang Y, Yang J, Li H, Wu Y, Zhang H, Chen W. Tumor markers CA19-9, CA242 and CEA in the diagnosis of pancreatic cancer: a meta-analysis.Int J Clin Exp Med. 2015;8:11683-11691.
[PubMed] [DOI]
Del Re M, Biasco E, Crucitta S, Derosa L, Rofi E, Orlandini C, Miccoli M, Galli L, Falcone A, Jenster GW, van Schaik RH, Danesi R. The Detection of Androgen Receptor Splice Variant 7 in Plasma-derived Exosomal RNA Strongly Predicts Resistance to Hormonal Therapy in Metastatic Prostate Cancer Patients.Eur Urol. 2017;71:680-687.
[RCA] [PubMed] [DOI] [Full Text][Cited by in Crossref: 167][Cited by in RCA: 209][Article Influence: 23.2][Reference Citation Analysis (0)]
Bahrambeigi V, Lee JJ, Branchi V, Rajapakshe KI, Xu Z, Kui N, Henry JT, Kun W, Stephens BM, Dhebat S, Hurd MW, Sun R, Yang P, Ruppin E, Wang W, Kopetz S, Maitra A, Guerrero PA. Transcriptomic Profiling of Plasma Extracellular Vesicles Enables Reliable Annotation of the Cancer-Specific Transcriptome and Molecular Subtype.Cancer Res. 2024;84:1719-1732.
[RCA] [PubMed] [DOI] [Full Text][Cited by in Crossref: 1][Cited by in RCA: 7][Article Influence: 7.0][Reference Citation Analysis (0)]
Yu S, Li Y, Liao Z, Wang Z, Wang Z, Li Y, Qian L, Zhao J, Zong H, Kang B, Zou WB, Chen K, He X, Meng Z, Chen Z, Huang S, Wang P. Plasma extracellular vesicle long RNA profiling identifies a diagnostic signature for the detection of pancreatic ductal adenocarcinoma.Gut. 2020;69:540-550.
[RCA] [PubMed] [DOI] [Full Text][Cited by in Crossref: 83][Cited by in RCA: 141][Article Influence: 28.2][Reference Citation Analysis (0)]
Gillespie M, Jassal B, Stephan R, Milacic M, Rothfels K, Senff-Ribeiro A, Griss J, Sevilla C, Matthews L, Gong C, Deng C, Varusai T, Ragueneau E, Haider Y, May B, Shamovsky V, Weiser J, Brunson T, Sanati N, Beckman L, Shao X, Fabregat A, Sidiropoulos K, Murillo J, Viteri G, Cook J, Shorser S, Bader G, Demir E, Sander C, Haw R, Wu G, Stein L, Hermjakob H, D'Eustachio P. The reactome pathway knowledgebase 2022.Nucleic Acids Res. 2022;50:D687-D692.
[RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)][Cited by in Crossref: 1654][Cited by in RCA: 1259][Article Influence: 419.7][Reference Citation Analysis (0)]
Gao Y, Zandieh K, Zhao K, Khizanishvili N, Fazio PD, Yu X, Schulte L, Aillaud M, Chung HR, Ball Z, Meixner M, Bauer UM, Bartsch DK, Buchholz M, Lauth M, Nimsky C, Cook L, Bartsch JW. The long non-coding RNA NEAT1 contributes to aberrant STAT3 signaling in pancreatic cancer and is regulated by a metalloprotease-disintegrin ADAM8/miR-181a-5p axis.Cell Oncol (Dordr). 2025;48:391-409.
[RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)][Cited by in RCA: 2][Reference Citation Analysis (0)]
Feng Y, Gao L, Cui G, Cao Y. LncRNA NEAT1 facilitates pancreatic cancer growth and metastasis through stabilizing ELF3 mRNA.Am J Cancer Res. 2020;10:237-248.
[PubMed] [DOI]
Uhlen M, Zhang C, Lee S, Sjöstedt E, Fagerberg L, Bidkhori G, Benfeitas R, Arif M, Liu Z, Edfors F, Sanli K, von Feilitzen K, Oksvold P, Lundberg E, Hober S, Nilsson P, Mattsson J, Schwenk JM, Brunnström H, Glimelius B, Sjöblom T, Edqvist PH, Djureinovic D, Micke P, Lindskog C, Mardinoglu A, Ponten F. A pathology atlas of the human cancer transcriptome.Science. 2017;357:eaan2507.
[RCA] [PubMed] [DOI] [Full Text][Cited by in Crossref: 1721][Cited by in RCA: 2396][Article Influence: 299.5][Reference Citation Analysis (0)]
Peng J, Sun BF, Chen CY, Zhou JY, Chen YS, Chen H, Liu L, Huang D, Jiang J, Cui GS, Yang Y, Wang W, Guo D, Dai M, Guo J, Zhang T, Liao Q, Liu Y, Zhao YL, Han DL, Zhao Y, Yang YG, Wu W. Single-cell RNA-seq highlights intra-tumoral heterogeneity and malignant progression in pancreatic ductal adenocarcinoma.Cell Res. 2019;29:725-738.
[RCA] [PubMed] [DOI] [Full Text][Cited by in Crossref: 328][Cited by in RCA: 840][Article Influence: 140.0][Reference Citation Analysis (0)]