Retrospective Cohort Study Open Access
Copyright ©The Author(s) 2024. Published by Baishideng Publishing Group Inc. All rights reserved.
World J Gastrointest Oncol. Sep 15, 2024; 16(9): 3839-3850
Published online Sep 15, 2024. doi: 10.4251/wjgo.v16.i9.3839
Construction and evaluation of a liver cancer risk prediction model based on machine learning
Ying-Ying Wang, Wan-Xia Yang, Qia-Jun Du, Zhen-Hua Liu, Ming-Hua Lu, Chong-Ge You, Laboratory Medicine Center, The Second Hospital & Clinical Medical School, Lanzhou University, Lanzhou 730030, Gansu Province, China
ORCID number: Chong-Ge You (0000-0002-3671-596X).
Co-first authors: Ying-Ying Wang and Wan-Xia Yang.
Author contributions: Wang YY and Yang WX served as co-first authors, conceiving and designing the study; Liu ZH, Lu MH, and Du QJ collected the research data; You CG supervised the entire study and revised the manuscript; All authors contributed to the article and approved the submitted version.
Supported by the Cuiying Scientific and Technological Innovation Program of the Second Hospital, No. CY2021-BJ-A16 and No. CY2022-QN-A18; and Clinical Medical School of Lanzhou University and Lanzhou Science and Technology Development Guidance Plan Project, No. 2023-ZD-85.
Institutional review board statement: The study was reviewed and approved for publication by the authors’ Institutional Review Board (Medical Ethics Committee of The Second Hospital & Clinical Medical School, Lanzhou University, China; No. 2024A-075).
Informed consent statement: Informed consent was obtained from all individuals included in this study.
Conflict-of-interest statement: All the authors report no relevant conflicts of interest for this article.
Data sharing statement: The original anonymous data set is available on request from the corresponding author at youchg@lzu.edu.cn.
STROBE statement: The authors have read the STROBE Statement—checklist of items, and the manuscript was prepared and revised according to the STROBE Statement—checklist of items.
Open-Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: https://creativecommons.org/licenses/by-nc/4.0/
Corresponding author: Chong-Ge You, PhD, Chief, Laboratory Medicine Center, The Second Hospital & Clinical Medical School, Lanzhou University, No. 82 Cuiyingmen, Chengguan District, Lanzhou 730030, Gansu Province, China. youchg@lzu.edu.cn
Received: March 29, 2024
Revised: July 31, 2024
Accepted: August 7, 2024
Published online: September 15, 2024
Processing time: 163 Days and 23.7 Hours

Abstract
BACKGROUND

Liver cancer is one of the most prevalent malignant tumors worldwide, and its early detection and treatment are crucial for enhancing patient survival rates and quality of life. However, the early symptoms of liver cancer are often not obvious, resulting in a late-stage diagnosis in many patients, which significantly reduces the effectiveness of treatment. Developing a highly targeted, widely applicable, and practical risk prediction model for liver cancer is crucial for enhancing the early diagnosis and long-term survival rates among affected individuals.

AIM

To develop a liver cancer risk prediction model by employing machine learning techniques, and subsequently assess its performance.

METHODS

In this study, a total of 550 patients were enrolled, with 190 hepatocellular carcinoma (HCC) and 195 cirrhosis patients serving as the training cohort, and 83 HCC and 82 cirrhosis patients forming the validation cohort. Logistic regression (LR), support vector machine (SVM), random forest (RF), and least absolute shrinkage and selection operator (LASSO) regression models were developed in the training cohort. Model performance was assessed in the validation cohort. Additionally, this study conducted a comparative evaluation of the diagnostic efficacy between the ASAP model and the model developed in this study using receiver operating characteristic curve, calibration curve, and decision curve analysis (DCA) to determine the optimal predictive model for assessing liver cancer risk.

RESULTS

Six variables including age, white blood cell, red blood cell, platelet counts, alpha-fetoprotein and protein induced by vitamin K absence or antagonist II levels were used to develop LR, SVM, RF, and LASSO regression models. The RF model exhibited superior discrimination, and the area under curve of the training and validation sets was 0.969 and 0.858, respectively. These values significantly surpassed those of the LR (0.850 and 0.827), SVM (0.860 and 0.803), LASSO regression (0.845 and 0.831), and ASAP (0.866 and 0.813) models. Furthermore, calibration and DCA indicated that the RF model exhibited robust calibration and clinical validity.

CONCLUSION

The RF model demonstrated excellent prediction capabilities for HCC and can facilitate early diagnosis of HCC in clinical practice.

Key Words: Hepatocellular carcinoma; Cirrhosis; Prediction model; Machine learning; Random forest

Core Tip: We constructed a prediction model for hepatocellular carcinoma with reliable and effective clinical diagnostic capacity. In the training cohort (n = 385), machine learning models were developed based on six variables including age; white blood cell, red blood cell, and platelet counts; and alpha-fetoprotein and protein induced by vitamin K absence or antagonist II levels. The performance of these models was assessed in an independent validation cohort of 165 subjects. We compared our model with the ASAP model using receiver operating characteristic curve, calibration, and decision curve analysis. Our findings demonstrated that a random forest model exhibited discriminatory power, calibration performance, and clinical utility.



INTRODUCTION

Liver cancer is one of the most prevalent malignancies globally, with hepatocellular carcinoma (HCC) being its primary histological subtype. Characterized by high incidence and mortality rates, HCC poses a significant health burden. China has the heaviest burden of HCC globally[1,2]. According to statistics, there were 400000 new cases of HCC and 390000 deaths due to HCC in China in 2020[3,4]. Reinforced surveillance of high-risk populations for detection and timely intervention of HCC is crucial for enhancing the survival rate and quality of life of patients with HCC. The current guidelines recommend that high-risk groups of HCC undergo ultrasound (US) or serum alpha-fetoprotein (AFP) testing at least every 6 months to monitor the progression of their disease[5]. However, studies have revealed that the sensitivity and specificity of AFP and US detection for HCC are relatively low[6,7]. Moreover, US not only demands proficient operational skills and diagnostic expertise from test personnel but also faces challenges in achieving result standardization. Therefore, a more intelligent, convenient, and accurate method for monitoring HCC using serum biomarkers that can be standardized with test results that are easy to analyze and interpret is needed. In recent years, the impressive data processing capability of machine learning has resulted in its extensive application in clinical practice, facilitating precise predictions, accurate diagnoses, and effective prognosis assessments for diverse cancer types[8].

Machine learning refers to the extensive utilization of advanced statistical algorithms for comprehensively analyzing vast amounts of data, with the aim of extracting concealed information and constructing relevant models accordingly. Ultimately, these models can be used to guide clinical practice[9,10]. Multiple studies have shown that integrating multiple detection indicators with machine learning techniques offers superior diagnostic accuracy and sensitivity compared to relying on a single indicator[11-13]. This study aimed to develop a prediction model for HCC, using highly standardized serum biomarkers with minimal detection requirements and excellent reproducibility, ultimately aiming to furnish clinicians with a more precise and dependable risk prediction and screening tool.

MATERIALS AND METHODS
Study subjects

A training cohort comprising 190 patients with HCC and 195 patients with cirrhosis was retrospectively enrolled from May 2021 to June 2022 at The Second Hospital & Clinical Medical School, Lanzhou University, with the aim of establishing a model for predicting the risk of developing HCC in patients with cirrhosis. Furthermore, a validation cohort consisting of 83 patients with HCC and 82 patients with cirrhosis admitted to the same hospital between January 2023 and December 2023 was used to assess and validate the diagnostic efficacy of this model.

Inclusion criteria: The diagnosis of HCC was established when one or more of the following criteria were met: (1) Pathological examination revealing HCC features including abnormal hepatocyte proliferation, disordered arrangement, enlarged nuclear-cytoplasmic ratio, and mitotic activity and immunohistochemical staining showing positive expression of markers such as glypican-3; (2) ≤ 1 cm liver nodule on at least one imaging modality and gadolinium-ethoxybenzyl diethylenetriamine pentaacetic acid (Gd-EOB-DTPA) magnetic resonance imaging (MRI) showing HCC's "fast-in, fast-out" enhancement pattern; (3) 1-2 cm liver nodule identified on at least two of four imaging modalities (dynamic contrast-enhanced MRI (DCE-MRI), dynamic contrast-enhanced computed tomography (DCE-CT), contrast-enhanced US (CEUS), and Gd-EOB-DTPA MRI) concurrently displaying typical HCC characteristics; (4) Intrahepatic nodule > 2 cm on at least one of four imaging modalities (DCE-MRI, DCE-CT, CEUS, Gd-EOB-DTPA MRI) showing typical HCC features; and (5) No palpable or visible nodules but an AFP level exceeding 400 μg/L and imaging studies showing typical HCC imaging lesions more than once. Cirrhosis was diagnosed when one or more of the following criteria were met: (1) Histopathological diagnosis of liver cirrhosis; and (2) Abdominal US, CT, or MRI showing altered liver volume, right-left lobe disproportion, wavy/serrated capsule, widened fissures, heterogeneous signals, dilated portal vein, and collateral circulation.

Exclusion criteria: Patients were excluded if any of the following were observed: (1) HCC combined with other tumors; (2) Samples that could not be collected due to insufficient quantity or inadequate quality; (3) Anticoagulation therapy such as warfarin initiated within a month; (4) Liver metastases from other tumors; or (5) Therapeutic interventions for HCC, such as surgical procedures, ablation techniques, radiation therapy, or chemotherapy.

Basic information collection and serum sample testing

We collected the basic information and routine blood test data of each subject. Peripheral blood samples (4 mL) were collected in red-topped ordinary vacuum blood collection tubes (Jinxing, Wuhan, Hubei Province, China) and centrifuged at 4000 rpm for 10 minutes to collect serum. Cobase 801 (Roche, Basel, Switzerland) was used to detect the concentrations of serum AFP, protein induced by vitamin K absence or antagonist II (PIVKA-II), carcinoembryonic antigen, carbohydrate antigen 199, and carbohydrate antigen 125 (CA125) within 2 hours. All testing reagents met quality control standards.

Statistical analysis

SPSS statistical software (version 25.0) was used for statistical description, normality test, and variance analysis of the data. Normally distributed data were presented as mean ± SD, while non-normally distributed data were expressed as M (P25, P75). Qualitative data underwent χ2 test analysis, and quantitative data were assessed using the Kruskal-Wallis test. The statistical software R (version 4.3.2) was used for constructing logistic regression (LR), support vector machine (SVM), random forest (RF), and least absolute shrinkage and selection operator (LASSO) regression models, and comparing and assessing the predictive capabilities of these models. The "rms", "e1071", and "randomForest" packages were employed for model construction and calibration curve plotting, whereas the "pROC" package was utilized for receiver operating characteristic curve (ROC) visualization. Additionally, the "ramd" package was applied to conduct decision curve analysis (DCA). All statistical tests were two-tailed, and P < 0.05 was considered statistically significant.

RESULTS
Baseline characteristics of subjects

Data were collected from 550 subjects, with the training set comprising 385 subjects and the validation set 165 subjects (Figure 1). The incidence of HCC in the two sets was well-balanced (49.35% and 50.30% in the training and validation sets, respectively, P = 0.789). Discrepancies among the indicators of the subjects between the training and validation sets are presented in Table 1. The age range of both patient cohorts was predominantly between 47 and 59 years old, with male patients forming a substantial majority (male-to-female ratios were 3:1 in the training set and 2:5 in the validation set). Furthermore, there was no statistically significant difference in the indicators between the two sets (P > 0.05).

Figure 1
Figure 1 Study design. HCC: Hepatocellular carcinoma; LR: Logistic regression; SVM: Support vector machine; RF: Random forest; LASSO: Least absolute shrinkage and selection operator; ROC: Receiver operating characteristic; DCA: Decision curve analysis.
Table 1 Discrepancy analysis of patient study metrics between the training and validation cohorts.
VariablesTotal (n = 550)Group
Z/χ2P value
Training set (n = 385)
Validation set (n = 165)
Age53 (47, 59)53 (47, 59)53 (48, 58)Z = -0.420.68
WBC (109/L)4.09 (2.98, 5.70)4.06 (2.97, 5.65)4.21 (3.00, 5.76)Z = -0.710.48
RBC (1012/L)4.43 (3.83, 4.94)4.46 (3.84, 4.96)4.36 (3.83, 4.91)Z = -0.430.67
HB (g/L)141.00 (121.00, 156.00)142.00 (121.00, 157.00)137.00 (121.00, 155.00)Z = -1.000.32
PLT (109/L)88.00 (55.00, 138.00)85.00 (55.00, 136.00)92.00 (56.00, 142.00)Z = -0.880.38
AFP (ng/mL)7.42 (2.96, 110.14)7.44 (2.95, 88.46)7.27 (3.07, 167.59)Z = -0.210.84
PIVKA-II (mAU/mL)32.55 (20.48, 1555.89)33.85 (20.89, 1839.88)28.64 (19.52, 1157.19)Z = -0.850.39
CEA (ng/mL)2.22 (1.42, 3.40)2.21 (1.35, 3.36)2.35 (1.54, 3.44)Z = -1.280.20
CA199 (ng/mL)19.70 (10.43, 38.10)19.75 (10.60, 38.92)18.60 (9.37, 34.30)Z = -0.880.38
CA125 (ng/mL)25.41 (13.18, 123.00)23.91 (12.77, 103.25)27.90 (14.70, 148.90)Z = -1.240.22
Sex, n (%)χ² = 0.910.34
    Male406 (74.22)288 (75.39)118 (71.52)
    Female141 (25.78)94 (24.61)47 (28.48)
Selection of modeling indicators

The training set was subjected to univariate analysis. The results revealed statistically significant differences in gender, age, white blood cell (WBC) and red blood cell (RBC) counts, hemoglobin level, platelet (PLT) count, and AFP, PIVKA-II and CA125 levels between patients with HCC and those with liver cirrhosis (P < 0.05). Notably, except for CA125, higher values for all indicators were observed in patients with HCC than in patients with cirrhosis (Table 2). The statistically significant indicators identified in the univariate analysis were incorporated into the LASSO regression analysis using 10-fold cross-validation. When the lambda value was 0.000766 [log (λ) = -3.12], six parameters exhibited statistically significant correlations with occurrence of HCC: Age; WBC, RBC, and PLT counts; and AFP and PIVKA-II levels (Figure 2A and B). These indicators were then ranked in terms of their importance for predicting HCC occurrence using the feature importance measure from the RF classification model. When the number of trees in the RF model reached 319, the minimum out-of-bag error estimate was 18.59%, indicating good generalization performance of the model. Based on evaluations of variable importance using mean decrease in accuracy and mean decrease in Gini metrics, age, WBC, RBC, PLT, AFP, and PIVKA-II emerged as the top-ranked indicators (Figure 2C and D). Therefore, this study ultimately selected age, WBC, RBC, PLT, AFP, and PIVKA-II as modeling indicators.

Figure 2
Figure 2 Revising modeling parameters in the training set. A: Ten-fold cross-validation for tuning parameter selection in the least absolute shrinkage and selection operator (LASSO) model; B: LASSO coefficient curve of 9 variables; C: The relationship between the quantity of decision trees and the average out-of-bag evaluation; D: The ranking of variable importance based on differences in univariate analysis. WBC: White blood cell; RBC: Red blood cell; PLT: Platelet; AFP: Alpha-fetoprotein; PIVKA: Protein induced by vitamin K absence or antagonist; CA: Carbohydrate antigen; CEA: Carcinoembryonic antigen; HB: Hemoglobin level.
Table 2 Univariate analysis of study parameters in the hepatocellular carcinoma and cirrhosis groups, n (%).
Variables
Total (n = 385)
Group
Z/χ2
P value
Cirrhosis (n = 195)
HCC (n = 190)
Age53.00 (47.00, 59.00)51.00 (44.00, 56.00)55.00 (49.00, 60.00)Z = -4.66< 0.001
WBC (109/L)4.06 (2.97, 5.65)3.50 (2.52, 4.54)4.65 (3.69, 6.40)Z = -6.87< 0.001
RBC (1012/L)4.46 (3.84, 4.96)4.13 (3.54, 4.79)4.70 (4.11, 5.07)Z = -5.18< 0.001
HB (g/L)142.00 (121.00, 157.00)134.00 (113.00, 151.75)149.00 (129.75, 160.00)Z = -4.76< 0.001
PLT (109/L)85.00 (55.00, 136.00)64.50 (48.25, 100.00)122.00 (70.50, 168.25)Z = -7.95< 0.001
AFP (ng/mL)7.44 (2.95, 88.46)3.79 (2.34, 9.32)49.16 (5.84, 1220.00)Z = -9.35< 0.001
PIVKA-II (mAU/mL)33.85 (20.89, 1839.88)22.59 (16.30, 32.28)854.40 (37.98, 14001.70)Z = -11.58< 0.001
CEA (ng/mL)2.21 (1.35, 3.36)2.08 (1.28, 3.39)2.27 (1.42, 3.36)Z = -1.410.16
CA199 (ng/mL)19.75 (10.60, 38.92)18.07 (9.85, 34.82)21.25 (12.70, 43.17)Z = -1.580.115
CA125 (ng/mL)23.91 (12.77, 103.25)33.51 (14.52, 143.25)19.70 (12.24, 55.62)Z = -3.050.002
Sex, n (%)χ² = 9.900.002
    Male288 (75.39)130 (68.42)158 (82.29)
    Female94 (24.61)60 (31.58)34 (17.71)
Model construction and validation

Based on the training set data, six variables including age, WBC, RBC, PLT, AFP, and PIVKA-II were used to construct LR, SVM, RF, and LASSO regression models. The ASAP regression model was constructed based on the algorithm proposed by Yang et al[13], ln (P/(1 - P) = -7.57711770 + 0.04666357 × age - 0.57611693 × sex + 0.42243533 × ln (AFP) + 1.10518910 × ln (PIVKA).

The ROC curve showed that the RF model had better discrimination, and its area under the curve (AUC) and 95%CI, accuracy, sensitivity, specificity, positive predictive value, negative predictive value and F1 score in the training and validation sets were 0.969 (0.955-0.984) and 0.858 (0.800-0.917), 92.15% and 80.00%, 88.02% and 75.90%, 96.32% and 84.15%, 96.02% and 82.89%, 88.83% and 77.53%, and 0.92 and 0.79, respectively (Table 3, Figure 3A and B). The calibration curve revealed that all models exhibited excellent calibration. The LR model demonstrated superior accuracy in the training set (Figure 3C). However, in the validation set, the LASSO model performed best in terms of predictive ability, closely matching its predicted values to the actual observations (Figure 3D). The DCA curve analysis results demonstrated the superior clinical application efficacy of the RF model compared to the other models. The training set DCA revealed that using the RF model for making intervention decisions led to a higher net benefit when the patient's risk threshold exceeded 2% (Figure 3E). This finding contrasted with the approaches of intervening in all patients or not intervening at all. Furthermore, the validation set DCA curve revealed that across a patient risk threshold range of 10% to 90%, the RF model achieved a greater net benefit (Figure 3F). After evaluating its performance in both the training and validation sets, along with its discrimination, calibration, and clinical effectiveness, the RF model was ultimately chosen as the most optimal prediction model for HCC in this study (Figure 3G).

Figure 3
Figure 3 Construction and evaluation of prediction models for hepatocellular carcinoma. A: The receiver operating characteristic (ROC) curve of the training set; B: The ROC curve of the validation set; C: The calibration curve of the training set; D: The calibration curve of the validation set; E: The decision curve analysis (DCA) curve of the training set; F: The DCA curve of the validation set; G: Comparison of area under the curve, sensitivity, and specificity between the models on both training and validation sets. The black diagonal line in the calibration curves represents the optimal prediction value, while the X-axis and Y-axis of DCA curves respectively represent the threshold probability and net benefit. LR: Logistic regression; SVM: Support vector machine; RF: Random forest; LASSO: Least absolute shrinkage and selection operator; AUC: Area under the curve.
Table 3 Performance of modeling indicators and prediction models in the diagnosis of hepatocellular carcinoma.
Variables
AUC (95%CI)
Cut-off
Accuracy (%)
Se (%)
Sp (%)
PPV (%)
NPV (%)
F1 Score
Age0.638 (0.582, 0.693)53.5062.5759.9065.2663.5461.690.62
WBC (109/L)0.703 (0.651, 0.755)4.1067.5465.6369.4768.4866.670.67
RBC (1012/L)0.653 (0.598, 0.708)4.2263.0970.8355.2661.5465.220.66
PLT (109/L)0.735 (0.685, 0.785)109.0068.3255.7381.0574.8364.440.64
AFP (ng/mL)0.776 (0.729, 0.823)19.0573.3059.9086.8482.1468.180.69
PIVKA-II (mAU/mL)0.842 (0.802, 0.882)90.9179.3269.7988.9586.4574.450.77
LR-training set0.850 (0.812, 0.888)0.4778.0163.0293.1690.3071.370.74
LR-validation set0.827 (0.764, 0.890)0.4774.5565.0684.1580.6070.410.79
SVM-training set0.860 (0.822, 0.898)0.5480.1071.3588.9586.7175.450.72
SVM-validation set0.803 (0.735, 0.871)0.5472.7368.6776.8375.0070.790.77
RF-training set0.969 (0.955, 0.984)0.5292.1588.0296.3296.0288.830.78
RF-validation set0.858 (0.800, 0.917)0.5280.0075.9084.1582.8977.530.72
LASSO-training set0.845 (0.806, 0.884)0.1878.8069.2788.4285.8174.010.72
LASSO-validation set0.831 (0.769, 0.893)0.1872.7368.6776.8375.0070.790.81
ASAP-training set0.866 (0.829, 0.903)0.9382.2073.9690.5388.7577.480.92
ASAP-validation set0.813 (0.747, 0.879)0.9373.9461.4586.5982.2668.930.70
Evaluation of the clinical application value of RF model

To conduct a comprehensive evaluation of the diagnostic capabilities of the RF model, this study compared it with six individual indicators. The RF-ROC curve demonstrated superior discrimination of the RF model, with the AUC reaching 0.969, which far surpassed the AUC values of age (0.638), WBC (0.703), RBC (0.653), PLT (0.735), AFP (0.776), and PIVKA-II (0.842). The diagnostic accuracy of the RF model was 92.15%, significantly outperforming individual markers such as age (62.57%), WBC (67.54%), RBC (63.09%), PLT (68.32%), AFP (73.30%), and PIVKA-II (79.32%). In comparison to the sensitivity and specificity of these individual markers: Age (59.90% and 65.26%), WBC (65.63% and 69.47%), RBC (70.83% and 55.26%), PLT (55.73% and 81.05%), AFP (59.90% and 86.84%), and PIVKA-II (69.79% and 88.95%), the RF model showed superior sensitivity (88.02%) and specificity (96.32%; Table 3, Figure 4A). The RF-DCA illustrated that the RF model, which integrated six indicator characteristics, offered superior clinical feasibility compared to any single marker (Figure 4B and C).

Figure 4
Figure 4 Random forest model validation. A: The receiver operating characteristic curve curves of the individual variables included in the random forest (RF) model are compared with those of the overall model; B: The decision curve analysis curves for each variable included in the RF model are compared with those of the overall model; C: Comparison of area under the curve, sensitivity, and specificity between the individual variables and RF model. LR: Logistic regression; SVM: Support vector machine; RF: Random forest; LASSO: Least absolute shrinkage and selection operator; AUC: Area under the curve; WBC: White blood cell; RBC: Red blood cell; PLT: Platelet; AFP: Alpha-fetoprotein; PIVKA-II: Protein induced by vitamin K absence or antagonist II.
DISCUSSION

HCC is a liver malignancy characterized by a high incidence rate and a low rate of early diagnosis, which significantly impacts the prognosis of patients. In recent years, with the deepening of research, a variety of biomarkers, such as serum biomarker PIVKA-II, circulating tumor DNA and microRNA, have been found to be useful in the diagnosis of HCC[14-18]. Although these biomarkers have shown certain diagnostic potential, the high cost of some markers has restricted their widespread clinical application to some extent. In order to further enhance the diagnostic accuracy and specificity of HCC, numerous studies have striven to integrate these biomarkers into models for predicting HCC risk. Nevertheless, the specificity, diagnostic accuracy, and precision of these models still require further validation[11-13,19]. Thus, developing a highly targeted, highly accurate, and feasible monitoring model for HCC is crucial to enhancing the diagnostic accuracy of HCC and minimizing the risk of missed diagnoses. Thus, developing a cost-effective, targeted, and precise HCC risk prediction model is crucial for enhancing diagnostic accuracy and minimizing the risk of missed diagnoses of HCC.

The present study employed biomarkers characterized by low detection cost and routine clinical applicability. Statistical analysis revealed significant correlations between age, WBC, RBC, PLT, AFP, and PIVKA-II levels and the development of HCC. The analysis revealed a significant elevation in the level of RBC among patients with HCC compared to patients with cirrhosis. Previous studies have demonstrated that patients with HCC exhibit upregulation of hypoxia inducible factor expression, which stimulates excessive production and secretion of erythropoietin, thereby leading to an elevated RBC count[20]. These findings provided further evidence supporting the pivotal role of RBCs as a substantial risk factor for the development of HCC. WBCs and PLTs are commonly used as routine inflammatory markers to monitor inflammatory activity and progression in the body. Researches have found that WBC and PLT not only signify the extent of inflammation in patients with liver disease, but also indicate the degree of liver fibrosis[21-26]. Therefore, they play a crucial role in evaluating the severity of liver lesions in patients with HCC, offering comprehensive and precise information that facilitates clinical decision-making. Moreover, AFP and PIVKA-II, two extensively researched biomarkers, have demonstrated superior diagnostic accuracy for HCC across numerous studies[27-29]. The present study further reinforces their crucial role in the diagnosis of HCC.

In this study, we developed and evaluated LR, SVM, RF, and LASSO models based on the aforementioned six biomarkers. Compared with the LR, SVM, and LASSO models, the RF model demonstrated notably superior diagnostic accuracy, sensitivity, and specificity in diagnosing HCC. This exceptional performance can be attributed to inherent advantages of the RF algorithm including high classification accuracy, minimal generalization error, rapid training speed, straightforward interpretability of results as well as robustness in handling high-dimensional and imbalanced data[30,31]. In addition, when compared with the currently proposed HCC diagnostic methods (AFP, PIVKA-II, and ASAP model)[27], the RF model exhibited superior diagnostic performance in detecting HCC with an AUC of 0.969, surpassing the diagnostic accuracy of AFP (0.776), PIVKA-II (0.842), and ASAP model (0.866). Numerous studies have indicated that disease prediction or diagnostic models, such as those for cervical cancer risk assessment[32], lung cancer prediction[33], and breast cancer diagnosis[34], which were based on the RF algorithm, exhibited remarkable accuracy in risk prediction outcomes. Consequently, these models have established their reliability for practical applications[35-38]. Therefore, the integration of highly correlated indicators with the RF algorithm in a liver cancer prediction model will not only enable a more accurate and comprehensive analysis of liver cancer risk but also provided clinicians with more effective and precise tools for clinical judgment, thus significantly contributing to early detection and treatment strategies for liver cancer.

The HCC prediction model based on the RF algorithm in this study possesses several advantages, including cost-effectiveness, high diagnostic accuracy, robust data processing capabilities, and ease of result analysis. By offering a dynamic monitoring and real-time warning tool, this model enables the assessment of the conditions of high-risk populations and facilitates timely interventions for HCC. Moreover, it facilitates efficient utilization of medical resources, assists clinicians in conducting diverse assessments of the conditions of patients with HCC, and has significant implications for early diagnosis and treatment to improve both survival rates and quality of life. Nonetheless, it is important to note that this study has certain limitations. Firstly, this was a retrospective study with inherent biases and confounding factors. Therefore, the prediction model requires further validation through large-scale cohort studies and prospective investigations to ascertain its clinical applicability. Secondly, given the limited sample size and the fact that this study was conducted at a single center, it is imperative to incorporate multi-center data for a comprehensive evaluation of the diagnostic efficacy of the proposed model.

CONCLUSION

In summary, the liver cancer risk prediction model developed in this study using machine learning exhibited exceptional diagnostic accuracy and holds significant clinical applicability, therefore offering a novel approaches and avenues for the early diagnosis and risk assessment of liver cancer.

Footnotes

Provenance and peer review: Unsolicited article; Externally peer reviewed.

Peer-review model: Single blind

Specialty type: Oncology

Country of origin: China

Peer-review report’s classification

Scientific Quality: Grade C

Novelty: Grade C

Creativity or Innovation: Grade C

Scientific Significance: Grade B

P-Reviewer: Karagiannakis DS S-Editor: Li L L-Editor: A P-Editor: Zhao S

References
1.  Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin. 2021;71:209-249.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 50630]  [Cited by in F6Publishing: 53625]  [Article Influence: 17875.0]  [Reference Citation Analysis (123)]
2.  Vogel A, Meyer T, Sapisochin G, Salem R, Saborowski A. Hepatocellular carcinoma. Lancet. 2022;400:1345-1362.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 47]  [Cited by in F6Publishing: 760]  [Article Influence: 380.0]  [Reference Citation Analysis (40)]
3.  Organisation mondiale de la Santé  Latest global cancer data: Cancer burden rises to 19.3 million new cases and 10.0 million cancer deaths in 2020. Dec 15, 2020. [cited 3 August 2024]. Available from: https://www.iarc.fr/fr/news-events/latest-global-cancer-data-cancer-burden-rises-to-19-3-million-new-cases-and-10-0-million-cancer-deaths-in-2020.  [PubMed]  [DOI]  [Cited in This Article: ]
4.  Rumgay H, Arnold M, Ferlay J, Lesi O, Cabasag CJ, Vignat J, Laversanne M, McGlynn KA, Soerjomataram I. Global burden of primary liver cancer in 2020 and predictions to 2040. J Hepatol. 2022;77:1598-1606.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 17]  [Cited by in F6Publishing: 650]  [Article Influence: 325.0]  [Reference Citation Analysis (0)]
5.  Ginès P, Krag A, Abraldes JG, Solà E, Fabrellas N, Kamath PS. Liver cirrhosis. Lancet. 2021;398:1359-1376.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 211]  [Cited by in F6Publishing: 584]  [Article Influence: 194.7]  [Reference Citation Analysis (0)]
6.  Lu Q, Li J, Cao H, Lv C, Wang X, Cao S. Comparison of diagnostic accuracy of Midkine and AFP for detecting hepatocellular carcinoma: a systematic review and meta-analysis. Biosci Rep. 2020;40.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 13]  [Cited by in F6Publishing: 22]  [Article Influence: 7.3]  [Reference Citation Analysis (0)]
7.  Colli A, Nadarevic T, Miletic D, Giljaca V, Fraquelli M, Štimac D, Casazza G. Abdominal ultrasound and alpha-foetoprotein for the diagnosis of hepatocellular carcinoma in adults with chronic liver disease. Cochrane Database Syst Rev. 2021;4:CD013346.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 18]  [Cited by in F6Publishing: 15]  [Article Influence: 5.0]  [Reference Citation Analysis (0)]
8.  Swanson K, Wu E, Zhang A, Alizadeh AA, Zou J. From patterns to patients: Advances in clinical machine learning for cancer diagnosis, prognosis, and treatment. Cell. 2023;186:1772-1791.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in F6Publishing: 118]  [Reference Citation Analysis (0)]
9.  Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nat Rev Mol Cell Biol. 2022;23:40-55.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 185]  [Cited by in F6Publishing: 579]  [Article Influence: 289.5]  [Reference Citation Analysis (0)]
10.  Choi RY, Coyner AS, Kalpathy-Cramer J, Chiang MF, Campbell JP. Introduction to Machine Learning, Neural Networks, and Deep Learning. Transl Vis Sci Technol. 2020;9:14.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in F6Publishing: 167]  [Reference Citation Analysis (0)]
11.  Fan R, Papatheodoridis G, Sun J, Innes H, Toyoda H, Xie Q, Mo S, Sypsa V, Guha IN, Kumada T, Niu J, Dalekos G, Yasuda S, Barnes E, Lian J, Suri V, Idilman R, Barclay ST, Dou X, Berg T, Hayes PC, Flaherty JF, Zhou Y, Zhang Z, Buti M, Hutchinson SJ, Guo Y, Calleja JL, Lin L, Zhao L, Chen Y, Janssen HLA, Zhu C, Shi L, Tang X, Gaggar A, Wei L, Jia J, Irving WL, Johnson PJ, Lampertico P, Hou J. aMAP risk score predicts hepatocellular carcinoma development in patients with chronic hepatitis. J Hepatol. 2020;73:1368-1378.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 146]  [Cited by in F6Publishing: 152]  [Article Influence: 38.0]  [Reference Citation Analysis (0)]
12.  Johnson PJ, Pirrie SJ, Cox TF, Berhane S, Teng M, Palmer D, Morse J, Hull D, Patman G, Kagebayashi C, Hussain S, Graham J, Reeves H, Satomura S. The detection of hepatocellular carcinoma using a prospectively developed and validated model based on serological biomarkers. Cancer Epidemiol Biomarkers Prev. 2014;23:144-153.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 133]  [Cited by in F6Publishing: 200]  [Article Influence: 18.2]  [Reference Citation Analysis (0)]
13.  Yang T, Xing H, Wang G, Wang N, Liu M, Yan C, Li H, Wei L, Li S, Fan Z, Shi M, Chen W, Cai S, Pawlik TM, Soh A, Beshiri A, Lau WY, Wu M, Zheng Y, Shen F. A Novel Online Calculator Based on Serum Biomarkers to Detect Hepatocellular Carcinoma among Patients with Hepatitis B. Clin Chem. 2019;65:1543-1553.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 25]  [Cited by in F6Publishing: 54]  [Article Influence: 10.8]  [Reference Citation Analysis (0)]
14.  Johnson P, Zhou Q, Dao DY, Lo YMD. Circulating biomarkers in the diagnosis and management of hepatocellular carcinoma. Nat Rev Gastroenterol Hepatol. 2022;19:670-681.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 8]  [Cited by in F6Publishing: 111]  [Article Influence: 55.5]  [Reference Citation Analysis (0)]
15.  Piñero F, Dirchwolf M, Pessôa MG. Biomarkers in Hepatocellular Carcinoma: Diagnosis, Prognosis and Treatment Response Assessment. Cells. 2020;9.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 192]  [Cited by in F6Publishing: 260]  [Article Influence: 65.0]  [Reference Citation Analysis (0)]
16.  Pinto Marques H, Gomes da Silva S, De Martin E, Agopian VG, Martins PN. Emerging biomarkers in HCC patients: Current status. Int J Surg. 2020;82S:70-76.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 19]  [Cited by in F6Publishing: 34]  [Article Influence: 8.5]  [Reference Citation Analysis (0)]
17.  Ye Q, Ling S, Zheng S, Xu X. Liquid biopsy in hepatocellular carcinoma: circulating tumor cells and circulating tumor DNA. Mol Cancer. 2019;18:114.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 116]  [Cited by in F6Publishing: 213]  [Article Influence: 42.6]  [Reference Citation Analysis (0)]
18.  Zhou Y, Liu F, Ma C, Cheng Q. Involvement of microRNAs and their potential diagnostic, therapeutic, and prognostic role in hepatocellular carcinoma. J Clin Lab Anal. 2022;36:e24673.  [PubMed]  [DOI]  [Cited in This Article: ]  [Reference Citation Analysis (0)]
19.  Cai J, Chen L, Zhang Z, Zhang X, Lu X, Liu W, Shi G, Ge Y, Gao P, Yang Y, Ke A, Xiao L, Dong R, Zhu Y, Yang X, Wang J, Zhu T, Yang D, Huang X, Sui C, Qiu S, Shen F, Sun H, Zhou W, Zhou J, Nie J, Zeng C, Stroup EK, Zhang X, Chiu BC, Lau WY, He C, Wang H, Zhang W, Fan J. Genome-wide mapping of 5-hydroxymethylcytosines in circulating cell-free DNA as a non-invasive approach for early detection of hepatocellular carcinoma. Gut. 2019;68:2195-2205.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 181]  [Cited by in F6Publishing: 167]  [Article Influence: 33.4]  [Reference Citation Analysis (0)]
20.  Schödel J, Ratcliffe PJ. Mechanisms of hypoxia signalling: new implications for nephrology. Nat Rev Nephrol. 2019;15:641-659.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 127]  [Cited by in F6Publishing: 192]  [Article Influence: 38.4]  [Reference Citation Analysis (0)]
21.  Li S, Hong M, Tan HY, Wang N, Feng Y. Insights into the Role and Interdependence of Oxidative Stress and Inflammation in Liver Diseases. Oxid Med Cell Longev. 2016;2016:4234061.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 140]  [Cited by in F6Publishing: 207]  [Article Influence: 25.9]  [Reference Citation Analysis (0)]
22.  Pan GQ, Yang CC, Shang XL, Dong ZR, Li T. The causal relationship between white blood cell counts and hepatocellular carcinoma: a Mendelian randomization study. Eur J Med Res. 2022;27:278.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in F6Publishing: 7]  [Reference Citation Analysis (0)]
23.  Ren L, Chen D, Xu W, Xu T, Wei R, Suo L, Huang Y, Chen H, Liao W. Predictive potential of Nomogram based on GMWG for patients with hepatocellular carcinoma after radical resection. BMC Cancer. 2021;21:817.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 1]  [Cited by in F6Publishing: 1]  [Article Influence: 0.3]  [Reference Citation Analysis (0)]
24.  Zanetto A, Campello E, Bulato C, Gavasso S, Farinati F, Russo FP, Tormene D, Burra P, Senzolo M, Simioni P. Increased platelet aggregation in patients with decompensated cirrhosis indicates higher risk of further decompensation and death. J Hepatol. 2022;77:660-669.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 16]  [Cited by in F6Publishing: 48]  [Article Influence: 24.0]  [Reference Citation Analysis (0)]
25.  Luo J, Du Z, Liang D, Li M, Yin Y, Chen M, Yang L. Gamma-Glutamyl Transpeptidase-to-Platelet ratio predicts liver fibrosis in patients with concomitant chronic hepatitis B and nonalcoholic fatty liver disease. J Clin Lab Anal. 2022;36:e24596.  [PubMed]  [DOI]  [Cited in This Article: ]  [Reference Citation Analysis (0)]
26.  Zhou H, Long J, Hu H, Tian CY, Lin SD. Liver stiffness and serum markers for excluding high-risk varices in patients who do not meet Baveno VI criteria. World J Gastroenterol. 2019;25:5323-5333.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in CrossRef: 9]  [Cited by in F6Publishing: 9]  [Article Influence: 1.8]  [Reference Citation Analysis (0)]
27.  Feng H, Li B, Li Z, Wei Q, Ren L. PIVKA-II serves as a potential biomarker that complements AFP for the diagnosis of hepatocellular carcinoma. BMC Cancer. 2021;21:401.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 13]  [Cited by in F6Publishing: 72]  [Article Influence: 24.0]  [Reference Citation Analysis (0)]
28.  Kim DY, Toan BN, Tan CK, Hasan I, Setiawan L, Yu ML, Izumi N, Huyen NN, Chow PK, Mohamed R, Chan SL, Tanwandee T, Lee TY, Hai TTN, Yang T, Lee WC, Chan HLY. Utility of combining PIVKA-II and AFP in the surveillance and monitoring of hepatocellular carcinoma in the Asia-Pacific region. Clin Mol Hepatol. 2023;29:277-292.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in F6Publishing: 8]  [Reference Citation Analysis (0)]
29.  Tian S, Chen Y, Zhang Y, Xu X. Clinical value of serum AFP and PIVKA-II for diagnosis, treatment and prognosis of hepatocellular carcinoma. J Clin Lab Anal. 2023;37:e24823.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 1]  [Cited by in F6Publishing: 12]  [Article Influence: 6.0]  [Reference Citation Analysis (0)]
30.  Capitaine L, Genuer R, Thiébaut R. Random forests for high-dimensional longitudinal data. Stat Methods Med Res. 2021;30:166-184.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 12]  [Cited by in F6Publishing: 20]  [Article Influence: 5.0]  [Reference Citation Analysis (0)]
31.  Fox EW, Ver Hoef JM, Olsen AR. Comparing spatial regression to random forests for large environmental data sets. PLoS One. 2020;15:e0229509.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 15]  [Cited by in F6Publishing: 17]  [Article Influence: 4.3]  [Reference Citation Analysis (0)]
32.  Ijaz MF, Attique M, Son Y. Data-Driven Cervical Cancer Prediction Model with Outlier Detection and Over-Sampling Methods. Sensors (Basel). 2020;20.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 99]  [Cited by in F6Publishing: 76]  [Article Influence: 19.0]  [Reference Citation Analysis (0)]
33.  Wu Z, Huang T, Zhang S, Cheng D, Li W, Chen B. A prediction model to evaluate the pretest risk of malignancy in solitary pulmonary nodules: evidence from a large Chinese southwestern population. J Cancer Res Clin Oncol. 2021;147:275-285.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 7]  [Cited by in F6Publishing: 7]  [Article Influence: 1.8]  [Reference Citation Analysis (0)]
34.  Bian K, Zhou M, Hu F, Lai W. RF-PCA: A New Solution for Rapid Identification of Breast Cancer Categorical Data Based on Attribute Selection and Feature Extraction. Front Genet. 2020;11:566057.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 7]  [Cited by in F6Publishing: 7]  [Article Influence: 1.8]  [Reference Citation Analysis (0)]
35.  Pellegrino E, Jacques C, Beaufils N, Nanni I, Carlioz A, Metellus P, Ouafik L. Machine learning random forest for predicting oncosomatic variant NGS analysis. Sci Rep. 2021;11:21820.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 9]  [Cited by in F6Publishing: 23]  [Article Influence: 7.7]  [Reference Citation Analysis (0)]
36.  Briceño J, Ayllón MD, Ciria R. Machine-learning algorithms for predicting results in liver transplantation: the problem of donor-recipient matching. Curr Opin Organ Transplant. 2020;25:406-411.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 7]  [Cited by in F6Publishing: 9]  [Article Influence: 2.3]  [Reference Citation Analysis (0)]
37.  Ooka T, Johno H, Nakamoto K, Yoda Y, Yokomichi H, Yamagata Z. Random forest approach for determining risk prediction and predictive factors of type 2 diabetes: large-scale health check-up data in Japan. BMJ Nutr Prev Health. 2021;4:140-148.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 8]  [Cited by in F6Publishing: 8]  [Article Influence: 2.7]  [Reference Citation Analysis (0)]
38.  Bayramli I, Castro V, Barak-Corren Y, Madsen EM, Nock MK, Smoller JW, Reis BY. Temporally informed random forests for suicide risk prediction. J Am Med Inform Assoc. 2021;29:62-71.  [PubMed]  [DOI]  [Cited in This Article: ]  [Cited by in Crossref: 6]  [Cited by in F6Publishing: 3]  [Article Influence: 1.0]  [Reference Citation Analysis (0)]