BPG is committed to discovery and dissemination of knowledge
Retrospective Study Open Access
Copyright: ©Author(s) 2026. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license. No commercial re-use. See permissions. Published by Baishideng Publishing Group Inc.
World J Gastroenterol. May 28, 2026; 32(20): 112559
Published online May 28, 2026. doi: 10.3748/wjg.v32.i20.112559
Construction and efficacy validation of a multi-parametric contrast-enhanced ultrasound nomogram model for discriminating hepatic inflammatory lesions from malignancies
Si-Jie Mou, San-Mei Yu, Yan-Ni Xiang, Bo Tang, Department of Ultrasound, Taizhou First People’s Hospital, Taizhou 318020, Zhejiang Province, China
ORCID number: Bo Tang (0009-0000-1509-3797).
Author contributions: Mou SJ designed the study, collected and analyzed data, and wrote the manuscript; Mou SJ, Yu SM, Xiang YN and Tang B participated in the study’s conception and data collection; Mou SJ and Tang B participated in study design and provided guidance; all authors read and approved the final version.
Institutional review board statement: This study was approved by the Ethic Committee of Taizhou First People’s Hospital (Approval No. 2025-KY108-01).
Informed consent statement: Due to the retrospective and de-identified nature of this study, written informed consent was waived.
Conflict-of-interest statement: The authors have no financial relationships to disclose.
Data sharing statement: No additional data are available.
Corresponding author: Bo Tang, Department of Ultrasound, Taizhou First People’s Hospital, No. 218 Hengjie Road, Huangyan District, Taizhou 318020, Zhejiang Province, China. tangbo19852025@163.com
Received: December 5, 2025
Revised: January 20, 2026
Accepted: February 25, 2026
Published online: May 28, 2026
Processing time: 166 Days and 0.5 Hours

Abstract
BACKGROUND

Liver inflammatory lesions and malignancies often have overlapping manifestations and contrast-enhanced ultrasound (CEUS) enhancement modes, particularly in hepatitis or cirrhosis settings. Conventional biomarkers, such as alpha-fetoprotein (AFP) and the Model for End-Stage Liver Disease (MELD) score, have limited diagnostic accuracy in this setting. Therefore, an interpretable, multi-parameter CEUS-based machine learning (ML) tool may improve differential diagnosis and support clinical decision-making.

AIM

To construct a ML model based on multi-parameter features and clinical variables of CEUS for differential diagnosis of inflammatory liver lesions and malignancies, and to evaluate its diagnostic efficiency and interpretability.

METHODS

From January 2018 to November 2023, 621 cases of liver lesions diagnosed by pathology or clinical follow-up were retrospectively evaluated, including 306 cases of inflammatory lesions and 315 cases of malignancies. Based on a 6:2:2 ratio, the cases were assigned to training, validation, and test sets. Their basic information, disease history, CEUS parameters, and laboratory indicators were collected. We constructed five ML models, namely, Logistic regression (LR), decision tree, random forest, XGBoost, and support vector machine (SVM), based on the mlr3 framework in R. Cross-validation and grid search helped achieve hyperparameter optimization, with area under the curve (AUC) as the major optimization goal. Accuracy, AUC, sensitivity, and specificity were determined for model performance evaluation, while interpretability analyzed using the SHapley Additive exPlanations (SHAP) method. Moreover, model performance was compared with traditional diagnostic indexes, AFP score, and MELD score.

RESULTS

Malignant and inflammatory lesion groups were markedly different in lesion morphology, uniform enhancement, cirrhosis, hepatitis, blood flow signals, calcification, age, lesion size, wash-in/out time, AFP score, and MELD score (all P < 0.001). During testing, LR and SVM performed best, with accuracy reaching 94.35%. The AUCs of LR on the training, validation, and test sets were 0.957, 0.958, and 0.965, respectively, superior to the AFP score (AUC = 0.908, 0.890, and 0.917, respectively) and MELD score (AUC = 0.844, 0.855, and 0.840, respectively; all P < 0.05). SHAP analysis showed that blood flow signals and wash-in time had the most significant influence on model prediction. The model stability evaluation indicated that LR had the best generalization ability (smallest overall stability difference: 0.00738).

CONCLUSION

The CEUS-based multi-parametric nomogram model shows excellent performance in the differential diagnosis of liver inflammatory lesions and malignant tumors, which is significantly superior to that of traditional biomarkers.

Key Words: Hepatic; Contrast-enhanced ultrasound; Machine learning; Differential diagnosis; SHapley Additive exPlanations; Inflammatory lesions; Malignant tumors

Core Tip: We developed and externally tested (within an independent test set) a multi-parametric contrast-enhanced ultrasound-based, SHapley Additive exPlanations-interpretable machine learning nomogram to differentiate hepatic inflammatory lesions from malignant tumors. Logistic regression showed the best overall generalization (area under the curve = 0.965; accuracy = 94.35%) and significantly outperformed alpha-fetoprotein and Model for End-Stage Liver Disease scores. Blood flow signals and wash-in time, as the key driving factors of model decision-making, can offer an intuitive and clinically relevant diagnostic reference basis.



INTRODUCTION

Liver diseases have shown a rising incidence rate and become a major public health problem worldwide[1]. Liver inflammatory lesions and malignancies are the most common types of liver lesions, showing high heterogeneity. Due to their overlapping clinical manifestations and routine imaging features in the early stage, it is often difficult to accurately distinguish these lesions clinically, which may lead to misdiagnosis or missed diagnosis[2]. This diagnostic dilemma is further exacerbated in patients with concomitant diseases, such as hepatitis, liver cirrhosis, or other underlying liver disease. As a result, lesion nature determination becomes more complicated, which ultimately has a direct impact on clinical decision-making and treatment outcomes[3]. Thus, effectively distinguishing inflammatory from malignant liver lesions remains a critical clinical challenge. At present, the alpha-fetoprotein (AFP) score and Model for End-Stage Liver Disease (MELD) score, are commonly used as tumor markers with certain diagnostic value. However, these markers have limited sensitivity and specificity; thus, they cannot meet the requirements of precise medical treatment[4]. Therefore, developing an artificial intelligence (AI) auxiliary diagnosis model that integrates multiple types of diagnostic information and possesses good generalization ability and interpretability is of great significance for improving the efficiency and accuracy of clinical diagnosis.

Among many imaging methods, contrast-enhanced ultrasound (CEUS) is widely used in the dynamic evaluation of liver diseases because of its strong real-time, non-radiation and easy follow-up[5]. Compared with traditional B-ultrasound, CEUS can observe lesion perfusion through contrast agent-enabled enhancement, while capturing the enhancement mode and time characteristics of the lesion in different phases (e.g., wash-in/out time, enhancement uniformity), to reflect the biological behavior of the lesion more accurately[6]. CEUS has unique value in differentiating liver tumors from some focal inflammatory lesions. However, the interpretation of CEUS images is highly dependent on the subjective experience of physicians; in addition, there is a lack of systematic, multi-dimensional, and multi-parameter integrated quantitative evaluation standards in current clinical practice[7]. The intervention of subjective factors weakens the reliability of diagnosis results, while significantly limiting the applications of CEUS in large-scale clinical screening and AI-assisted diagnoses and treatments.

The application of AI technology in the medical field continues to expand, and machine learning (ML) has also been widely introduced into medical image analysis, clinical assisted diagnosis and disease risk prediction[8]. In the diagnosis of liver diseases, feature extraction and modeling are realized through algorithms. This approach optimizes model performance and helps analyze the association law of complex variables; classic models such as random forest (RF), support vector machine (SVM), and XGBoost can markedly improve the accuracy of predictions. However, limited by their complex structure and unexplained internal mechanisms, such models are regarded as “black boxes”. Thus, it is difficult to promote the use and application of these models in clinical practice[9]. This highlights the core role of enhancing model interpretability in the AI field. With the help of interpretability algorithms, such as SHapley Additive exPlanations (SHAP), this problem can be effectively resolved[10]. Based on game theory, SHAP quantifies the positive and negative contributions of each feature to a single prediction result and clarifies the model decision logic, thereby enhancing trust in model output and achieving transparency in clinical decisions.

In this study, we explored the diagnostic methods needed to distinguish inflammatory lesions from malignant lesions in liver disease. We constructed Logistic regression (LR), decision tree (DT), RF, XGBoost, and SVM models using multi-parameter CEUS and clinical characteristics. Training, validation, and test sets were utilized to test the classification performance and stability of these models. At the same time, the SHAP method was introduced to analyze the global and local explanatory features of the model, intuitively presenting the influence direction and contribution intensity of key variables (e.g., blood flow signals, enhancement time) in model decision-making. On this basis, the model was further compared with traditional diagnostic indicators (AFP score, MELD score) to verify its potential advantages in diagnostic accuracy and generalization ability. The purpose of this study was to provide an AI-driven diagnostic tool with both accuracy and interpretability, and to promote the practical application of AI in the classification and diagnosis of liver diseases.

MATERIALS AND METHODS
Calculation of sample size

According to the method described by Liao et al[11], we used the pmsampsize package in R to calculate the sample size required by LR. The model contained three candidate variables, with an expected area under the curve (AUC) of 0.989 and a 70% incident rate. In this study, 0.5 was used as the acceptable difference between adjusted and unadjusted R2, and 0.05 was used as the allowable error range for intercept estimation. The calculation results showed that a sample size of at least 323 cases was required to construct a robust model. Using 0.7 as the event rate estimate, the sample size corresponded to 227 events, and the number of events for the events per predictor parameter was 75.37 (Cox-Snell R2 = 0.6249, maximum R2 = 0.705, Nagelkerke R2 = 0.886). A 0.947 estimated shrinkage factor was used to ensure minimum over-fitting. These parameters met the requirements of model development and provided sufficient accuracy and stability for LR.

General data

From January 2018 to November 2023, 621 cases of liver lesions diagnosed by pathology or clinical follow-up were retrospectively evaluated, including 306 cases of inflammatory lesions and 315 cases of malignant lesions. The patients were divided into groups using a 6:2:2 ratio (training, validation, and test sets) for subsequent data analysis. This research was approved by the institutional ethics committee. Of note, malignant hepatic lesions, exemplified by hepatocellular carcinoma (HCC) and other neoplastic transformations, refer to cancerous growths in the liver, confirmed through pathological examinations and imaging evidence of tumorigenesis. Inflammatory hepatic conditions, such as acute or chronic hepatitis, liver inflammation, or hepatic injury, are inflammatory changes resulting from infections or autoimmune triggers.

Case selection criteria

To be included, cases satisfied the following criteria: (1) Liver CEUS tests with satisfactory image quality and complete documentation of key parameters; (2) Histopathologically or radiologically (≥ 2 modalities) verified inflammatory lesions or malignancies; (3) Age ≥ 18 years; (4) Full access to clinical information, CEUS parameters, and laboratory indicators; and (5) Treatment-naïve lesions (no anticancer therapies, antibiotics, or interventional surgeries).

Criteria for exclusion were: (1) Presence of malignancy in other parts or metastatic spread; Indeterminate lesion characterization (inflammatory vs cancerous) or uncertain diagnoses; (2) Severe cardiopulmonary dysfunction preventing safe contrast injection; or (3) Poor-quality CEUS images (artifacts, obstructions, or incomplete data) that hindered modeling; Pregnant or breastfeeding women.

Clinical data collection

The clinical data used in this study were retrieved from the electronic medical record system of our hospital in a retrospective manner. The collected basic information included age, sex, and body mass index (BMI), with an inclusion criterion specifying adult patients (aged > 18 years) who had undergone CEUS procedures. Disease histories (diabetes, hypertension, chronic hepatitis, and cirrhosis) were recorded. After systematic analysis of medical history data and comprehensive review of medical records, the clinician made the diagnosis. Radiology practitioners completed parameter interpretation based on CEUS images, covering lesion shape, enhancement mode, wash-in/out time, boundary characteristics, calcification manifestations, blood flow signal classification, anatomical position, maximum diameter, and echo features. Laboratory parameters were collected at first admission, including AFP score and related liver function biochemical parameters used to calculate the MELD score[12]. After double data verification by two researchers, the integrity and accuracy of the data in this study were guaranteed.

ML

Relying on the R language mlr3 framework, five classic ML models (LR, DT[13], RF[14], SVM[15], XGBoost[16]) were constructed based on the training set to achieve binary prediction of cardiovascular diseases. Only the LR model was not constructed using the parameter optimization method of fivefold cross-validation combined with grid search. The remaining models were constructed using this strategy, with AUC as the core optimization goal. After the training was completed, based on accuracy, AUC, sensitivity, specificity, precision, recall, F1 score, and area under the precision-recall curve (PR-AUC), prediction performance evaluation was conducted in the training, verification, and test sets. With the help of visualization of heat maps, receiver operating characteristic (ROC) curves and PR curves, the performance of the different models was clearly presented. Experimental results showed that there was obvious heterogeneity in model performance on each data set, indicating that multi-model comparative analysis is indispensable and effective in binary classification tasks.

Endpoints

Primary endpoints: In the process of model training and efficacy evaluation, the output results of the five algorithm models on different datasets were significantly different; SHAP further revealed the positive driving and negative inhibitory effects of single variables on model predictions. Moreover, the performance comparison between the optimal model and traditional biomarkers was completed simultaneously.

Secondary endpoints: After analyzing differences in clinical characteristics in different groups, the model calibration and decision curve analysis (DCA) were further used to verify the clinical application potential and reliability of the models.

Statistical analysis

All statistical analyses in this study were conducted in R (v 4.3.1). First, following normality testing for all variables [using shapiro.test () function], appropriate descriptive and inferential statistical methods were selected. Descriptive statistics of continuous variables are expressed using the mean ± SD (normal distribution) or median (quartile; skewed distribution); categorical variables are shown as n (%). When conducting comparisons between groups, we employed independent sample t-tests [t.test () function] for continuous variables following a normal distribution; otherwise, Mann-Whitney U tests [wilcox.test () function] were used. Comparisons of categorical variables were performed using the χ2 test [chisq.test ()] or, when the expected frequency was < 5, Fisher’s exact test [fisher.test ()]. To further analyze outcome determinants, the multivariate LR model [using glm () function] was used, and the odds ratio and 95%CI of each variable were calculated. During discrimination performance evaluation, ROC curve [roc () function in pROC] and AUC [auc () function in pROC] were utilized. Model calibration was assessed with the Brier score [validate () function in rms] and calibration curves. DCA was conducted with dca () function in the dcurves package to evaluate the clinical practical value of the models under different thresholds. SHAP was used to quantify the importance of selected features in model prediction results to improve model interpretability; visualization was performed using two functions of the SHAPforxgboost package [shaf.plot.summary (), shaf.plot.dependence ()]. For all tests, two-tailed, a P value < 0.05 denoted statistical significance. All analyses were carried out based on R language, and the results were intuitive and understandable through charts and model evaluations.

RESULTS
Baseline information

We first analyzed baseline characteristics (hepatic inflammatory lesions vs malignant tumors). Sex, diabetes/hypertension history, echogenicity, lesion location/boundaries, and BMI were equivalent between the groups (P > 0.05). However, we identified marked inter-group differences in lesion morphology, uniform enhancement, liver cirrhosis, hepatitis, blood flow signals, calcification, age, lesion size, wash-in/out time, AFP score, and MELD score. Specifically, the malignant tumor cohort exhibited markedly irregular lesion morphology (P < 0.001), heterogeneous enhancement, higher prevalence of cirrhosis and hepatitis, distinct blood flow signals, and differing calcification patterns (all P < 0.001). Moreover, older age, larger lesions, fast wash-in/out time, elevated AFP score, and higher MELD score (all P < 0.001) characterized the malignant group (Supplementary Table 1).

Baseline patient parameter comparisons after dataset segmentation (training, validation, and test sets)

Post-split demographic and clinical comparisons revealed balanced distributions in sex, diabetes/hypertension history, echocardiographic features, lesion location, border clarity, lesion morphology, homogeneous enhancement, cirrhosis, hepatitis, blood flow signals, calcification, age, BMI, lesion size, wash-in/out time, AFP score, and MELD score across the training, validation, and test sets (all P > 0.05). Specifically, no significant multi-group differences were observed for sex (P = 0.793), diabetes history (P = 0.089), hypertension (P = 0.114), echocardiographic features (P = 0.582), tumor location (P = 0.995), border clarity (P = 0.590), lesion morphology (P = 0.284), homogeneous enhancement (P = 0.701), cirrhosis (P = 0.924), hepatitis status (P = 0.181), blood flow signals (P = 0.663), calcification status (P = 0.660), age (P = 0.351), BMI (P = 0.545), tumor size (P = 0.139), wash-in time (P = 0.245), time to washout (P = 0.717), AFP score (P = 0.945), or MELD score (P = 0.664; Table 1).

Table 1 Comparative baseline measurements across datasets.
Variable
Total
Training set (n = 373)
Validation set (n = 124)
Test set (n = 124)
Statistic
P value
Sex0.4640.793
    Male417 (67.15)250 (67.02)81 (65.32)86 (69.35)
    Female204 (32.85)123 (32.98)43 (34.68)38 (30.65)
Diabetes history4.8470.089
    Yes69 (11.11)33 (8.85)18 (14.52)18 (14.52)
    No552 (88.89)340 (91.15)106 (85.48)106 (85.48)
Hypertension history4.3510.114
    Yes100 (16.10)51 (13.67)23 (18.55)26 (20.97)
    No521 (83.90)322 (86.33)101 (81.45)98 (79.03)
Echocardiographic features1.0830.582
    Mixed358 (57.65)216 (57.91)67 (54.03)75 (60.48)
    Other263 (42.35)157 (42.09)57 (45.97)49 (39.52)
Lesion location0.0100.995
    Left lobe497 (80.03)299 (80.16)99 (79.84)99 (79.84)
    Right lobe124 (19.97)74 (19.84)25 (20.16)25 (20.16)
Border clarity1.0550.590
    Ill-defined456 (73.43)275 (73.73)87 (70.16)94 (75.81)
    Well-defined165 (26.57)98 (26.27)37 (29.84)30 (24.19)
Morphology2.5150.284
    Irregular361 (58.13)219 (58.71)77 (62.10)65 (52.42)
    Regular260 (41.87)154 (41.29)47 (37.90)59 (47.58)
Homogeneous enhancement0.7090.701
    Yes390 (62.80)236 (63.27)80 (64.52)74 (59.68)
    No231 (37.20)137 (36.73)44 (35.48)50 (40.32)
Cirrhosis0.1580.924
    Yes226 (36.39)134 (35.92)45 (36.29)47 (37.90)
    No395 (63.61)239 (64.08)79 (63.71)77 (62.10)
Hepatitis3.4190.181
    Yes477 (76.81)296 (79.36)91 (73.39)90 (72.58)
    No144 (23.19)77 (20.64)33 (26.61)34 (27.42)
Blood flow signals4.1040.663
    0221 (35.59)132 (35.39)41 (33.06)48 (38.71)
    1153 (24.64)100 (26.81)27 (21.77)26 (20.97)
    2157 (25.28)90 (24.13)37 (29.84)30 (24.19)
    390 (14.49)51 (13.67)19 (15.32)20 (16.13)
Calcification0.8320.660
    Yes303 (48.79)178 (47.72)65 (52.42)60 (48.39)
    No318 (51.21)195 (52.28)59 (47.58)64 (51.61)
Age (year)57.00 (51.00, 62.00)57.00 (51.00, 63.00)57.00 (52.00, 63.00)56.50 (51.00, 61.00)2.0960.351
Body mass index (kg/m2)23.01 ± 1.9922.96 ± 2.0122.96 ± 1.9223.18 ± 2.000.6070.545
Lesion size (cm)4.10 (2.30, 6.00)3.80 (2.10, 5.90)4.50 (2.70, 6.03)4.05 (2.60, 6.05)3.9480.139
Wash-in time (second)23.00 (19.00, 27.00)23.00 (19.00, 27.00)22.50 (20.00, 25.00)22.00 (19.00, 27.00)2.8160.245
Wash-out time (second)135.00 (72.00, 212.00)137.00 (72.00, 210.00)135.50 (88.00, 211.75)127.00 (64.50, 215.50)0.6660.717
AFP9.55 (6.17, 17.97)9.70 (6.20, 17.27)9.13 (5.74, 18.25)9.36 (6.14, 18.41)0.1140.945
MELD score18.79 (15.78, 22.45)18.85 (15.98, 22.33)19.05 (15.14, 23.01)18.59 (15.68, 22.23)0.8180.664
Comparative evaluation of baseline features of patients with cancerous vs inflammatory liver conditions in the training dataset

In the training cohort, no significant inter-group disparities were observed for sex, diabetes/hypertension history, ultrasonographic patterns, lesion location, border clarity, or BMI between patients exhibiting malignant lesions vs those with inflammatory hepatic lesions (all P > 0.05). However, lesion morphology (irregular in malignancies), enhancement patterns (uniform), comorbidities (cirrhosis and hepatitis), imaging markers (blood flow singles, calcification, wash-in/out time), demographic/clinical factors (age, lesion location), and biomarkers (AFP, MELD) showed significant variations (all P < 0.001; Table 2).

Table 2 Baseline characteristics of patients in the training group.
Variable
Total
Training set (n = 190)
Validation set (n = 183)
Statistic
P value
Sex2.1490.143
    Male250 (67.02)134 (70.53)116 (63.39)
    Female123 (32.98)56 (29.47)67 (36.61)
Diabetes history1.0500.305
    Yes33 (8.85)14 (7.37)19 (10.38)
    No340 (91.15)176 (92.63)164 (89.62)
Hypertension history0.0870.768
    Yes51 (13.67)25 (13.16)26 (14.21)
    No322 (86.33)165 (86.84)157 (85.79)
Echocardiographic features1.1120.292
    Mixed216 (57.91)105 (55.26)111 (60.66)
    Other157 (42.09)85 (44.74)72 (39.34)
Lesion location1.4870.223
    Left lobe299 (80.16)157 (82.63)142 (77.60)
    Right lobe74 (19.84)33 (17.37)41 (22.40)
Border clarity0.4720.492
    Ill-defined275 (73.73)143 (75.26)132 (72.13)
    Well-defined98 (26.27)47 (24.74)51 (27.87)
Morphology5.7970.016
    Irregular219 (58.71)123 (64.74)96 (52.46)
    Regular154 (41.29)67 (35.26)87 (47.54)
Homogeneous enhancement55.858< 0.001
    Yes236 (63.27)155 (81.58)81 (44.26)
    No137 (36.73)35 (18.42)102 (55.74)
Cirrhosis69.952< 0.001
    Yes134 (35.92)107 (56.32)27 (14.75)
    No239 (64.08)83 (43.68)156 (85.25)
Hepatitis35.315< 0.001
    Yes296 (79.36)174 (91.58)122 (66.67)
    No77 (20.64)16 (8.42)61 (33.33)
Blood flow signals111.642< 0.001
    0132 (35.39)20 (10.53)112 (61.20)
    1100 (26.81)61 (32.11)39 (21.31)
    290 (24.13)72 (37.89)18 (9.84)
    351 (13.67)37 (19.47)14 (7.65)
Calcification42.207< 0.001
    Yes178 (47.72)122 (64.21)56 (30.60)
    No195 (52.28)68 (35.79)127 (69.40)
Age (year)57.41 ± 8.2659.34 ± 9.1655.41 ± 6.68-4.718< 0.001
Body mass index (kg/m2)22.96 ± 2.0122.97 ± 1.9822.96 ± 2.04-0.0320.974
Lesion size (cm)3.80 (2.10, 5.90)4.95 (2.50, 7.38)3.20 (1.90, 4.70)5.105< 0.001
Wash-in time (second)22.99 ± 5.2521.24 ± 5.2724.81 ± 4.596.963< 0.001
Wash-out time (second)137.00 (72.00, 210.00)117.00 (64.00, 162.75)183.00 (95.50, 252.00)5.512< 0.001
AFP9.70 (6.20, 17.27)17.07 (10.87, 23.52)6.50 (4.38, 8.82)13.616< 0.001
MELD score18.85 (15.98, 22.33)21.88 (18.77, 25.53)16.34 (13.64, 18.98)11.503< 0.001
Hyperparameter tuning visualization analysis

Figure 1 illustrate the hyperparameter tuning outcomes applied to four ML algorithms.

Figure 1
Figure 1 Visualization analysis of hyperparameter tuning results. A: Visualization diagram of decision tree hyperparameter tuning results; B: Graphical representation of random forest hyperparameter optimization; C: Support vector machine hyperparameter adjustment visualization; D: XGBoost hyperparameter adjustment diagram.
Performance evaluation of different classification models on training, validation, and test sets

In this study, the classification performance of the five models was evaluated. During training (Figure 2), DT had the best classification accuracy (classif.acc = 0.94) and PR-AUC (classif.prauc = 0.99), whereas XGBoost showed the worst Brier score (classif.bbrier = 0.09). Upon validation, the classification accuracy (classif.acc = 0.90) and AUC value (classif.auc = 0.95) of RF were the best, whereas the Brier score (classif.bbrier = 0.16) and AUC value (classif.auc = 0.84) of XGBoost were the worst. In the test set, SVM and LR exhibited the best classification accuracy (classif.acc = 0.94) and PR-AUC (classif.prauc = 0.97), whereas XGBoost had the worst Brier score (classif.bbrier = 0.13) and accuracy (classif.precision = 0.83).

Figure 2
Figure 2 Heatmap visualization of classification models’ performance across training, validation, and test sets. A: Training set; B: Validation set; C: Test set. SVM: Support vector machine; RF: Random forest; DT: Decision tree.
AUC-based performance evaluation of different classification models across datasets

Figure 3A shows the ROC curves of each model on the training set, where XGBoost showed the highest AUC (0.991), whereas DT had the lowest (0.925). Upon validation (Figure 3B), the AUC value of the LR was the highest at 0.958, whereas that of DT was the lowest at 0.844. In the independent test dataset (Figure 3C), the highest and lowest AUCs were determined in XGBoost (0.977) and DT (0.886), respectively. Model stability was evaluated by the differences between the training and validation datasets (train_val_diff) and validation and test datasets (val_test_diff), as well as the overall stability. The smallest difference was found in LR (0.00738), indicating the most stable performance among different datasets; DT exhibited the largest overall stability difference (0.0808). The order of stability was LR > SVM > RF > XGBoost > DT.

Figure 3
Figure 3 Receiver operating characteristic curves depicting model performance across different data partitions. A: Training dataset receiver operating characteristic (ROC) analysis; B: Validation dataset ROC analysis; C: Test dataset ROC analysis. AUC: Area under the curve; SVM: Support vector machine; RF: Random forest; DT: Decision tree.
Confusion matrices of different classification models on each dataset

To comprehensively evaluate the classification performance of different ML algorithms, we conducted systematic comparisons using the training, validation, and test sets (Supplementary Tables 2-4). During training, XGBoost showed the best performance (accuracy = 93.57%), followed successively by RF (92.49%), SVM (90.35%), DT (89.54%), and LR (89.28%). For each model, consistent accuracy across the validation and training sets was observed, indicating good model stability. In the independent test set, however, model performance changed to varying degrees: Both LR and SVM reached the highest accuracy of 94.35%, demonstrating favorable generalization ability; the accuracy of RF and XGBoost was 93.55% and 92.74%, respectively; DT displayed reduced accuracy (84.68%), with many misjudgments in inflammatory lesion prediction.

PR curve-based performance evaluation of different classification models on various datasets

Figure 4 shows the PR curves of the models. Upon training, the highest and lowest PR values were determined for XGBoost (0.9915) and LR (0.9610), respectively. In the validation set, LR had the most favorable PR value (0.965), whereas DT had the worst (0.8077). During testing, the PR value of XGBoost was the highest (0.9761), whereas that of DT was the lowest (0.8899). XGBoost and RF performed outstandingly in the training and test datasets, whereas DT demonstrated relatively poor performance during validation and testing.

Figure 4
Figure 4 Precision-recall performance of the classifiers on multiple datasets. A: Training dataset precision-recall (PR) curve; B: Validation dataset PR curve; C: Test dataset PR curve. AUC: Area under the curve; SVM: Support vector machine; RF: Random forest; DT: Decision tree.
Calibration curve-based model performance evaluation across datasets

Figure 5A shows the calibration curves of the models. On the training set, the Brier score of XGBoost was the minimal (0.0400), indicating the best calibration effect; the score of DT was 0.0892, suggesting the worst calibration performance. During validation (Figure 5B), the lowest and highest Brier scores were found in XGBoost (0.0843) and DT (0.1589), respectively. Upon testing (Figure 5C), the Brier score of SVM was the lowest (0.0612), showing the best calibration performance, whereas the highest score (0.1303) was for DT, indicating the worst calibration. Overall, SVM had the best calibration performance on all datasets, whereas DT performed relatively poorly.

Figure 5
Figure 5 Model calibration plots for different data partitions. A: Calibration performance on training data; B: Validation data calibration results; C: Test set calibration analysis. SVM: Support vector machine; RF: Random forest; DT: Decision tree.
DCA of different classification models on different sets

Figure 6 presents the DCA of LR, DT, RF, XGBoost, and SVM across datasets. Figure 6 shows that all models demonstrated varying degrees of net gains and certain threshold dependence on different datasets. During training, the DCA curves of all models showed relatively high net gains (the highest exceeded 50%); among them, the lowest net gain of SVM was 17.2%, while that of RF was 5%. On the validation set, the net gains of LR, SVM, and RF were relatively stable within the 0%-100% threshold range, with the highest net gain exceeding 46%; DT was effective within the 0%-87.6% threshold range, and XGBoost was effective within the 0%-96% range. Upon testing, all models demonstrated good generalization ability, with all net gains exceeding 47%. These results indicate that the decision-making performance of each model varies at different thresholds; however, overall, all models demonstrate good practicality, particularly in the test set, where significant net gains were observed.

Figure 6
Figure 6 Decision curve analysis of different classifiers. A: Training set decision curve analysis (DCA); B: Validation set DCA; C: Test set DCA. SVM: Support vector machine; RF: Random forest; DT: Decision tree.
SHAP value analysis and individual prediction example

Figure 7 shows the visual analysis results for the importance of model features based on the SHAP method. The SHAP summary chart presents the relative importance and influence direction of each clinical feature on the prediction output of the model. In the chart, the distribution of eigenvalues is represented by color gradient, where purple represents low eigenvalue and yellow represents high eigenvalue; the horizontal axis shows the size of SHAP value, reflecting the contribution of each feature to the prediction results. The analysis results showed that the blood flow signal (Blood_Flow_Signal) and the wash-in time (Start_Increase_Time) had a significant impact on the model prediction, presenting large positive or negative SHAP values; in contrast, age and boundary features showed a relatively smaller impact. The single-sample targeted prediction analysis of sample 1 clearly showed that each characteristic variable had a differentiated contribution to the prediction results of that individual. The SHAP values of indicators, such as “Regression_Time and Morphology”, are visualized in the form of a bar graph. The color depth corresponds to the positive contribution strength, and the numerical direction indicates the negative contribution trend. The detailed quantitative analysis of the contributions of these features provides precise quantitative results in terms of their numerical values and corresponding SHAP values (e.g., a -0.203 negative contribution value of the blood flow signal, and a +0.132 positive contribution value of the wash-in time). With all these quantitative indicators, the final prediction output of this sample was determined to inform decision-making, achieved through this interpretable model. This helps clinicians understand the decision-making logic of the model and offers transparent assessment of the importance of features for diagnosis based on AI.

Figure 7
Figure 7 SHapley Additive exPlanations analysis and prediction examples. A: SHapley Additive exPlanations summary plot; B: Example of individual prediction; C: Detailed information presentation. SHAP: SHapley Additive exPlanations.
ROC analysis of LR-based predictions and diagnostic accuracy

Figure 8 presents the ROC curves of the LR-based predictions across datasets. In the training set, the model achieved an AUC of 0.957, outperforming the AFP score (AUC = 0.908, P = 0.014) and MELD score (AUC = 0.844, P < 0.001). In the validation set, the model retained good discriminatory performance with an AUC of 0.958, surpassing the AFP score (AUC = 0.890, P = 0.026) and MELD score (AUC = 0.855, P < 0.001). When evaluated on an independent test set, the LR model achieved an AUC of 0.965, showing statistically significant improvements compared with the AFP score (AUC = 0.917, P = 0.047) and MELD score (AUC = 0.840, P < 0.001). The characteristic convex shape of all ROC curves confirms the excellent diagnostic power and stable performance of the model across diverse patient cohorts, making it clinically valuable for decision support.

Figure 8
Figure 8 Comparative assessment of diagnostic performance between Logistic Regression model and conventional biomarkers using receiver operating characteristic analysis. A: Receiver operating characteristic (ROC) curve evaluation in the training cohort; B: Validation set ROC curve analysis; C: Test set ROC curve evaluation. ROC: Receiver operating characteristic; AUC: Area under the curve; AFP: Alpha-fetoprotein; MELD: Model for End-Stage Liver Disease.
Subgroup analysis

In this study, we evaluated the performance of the model in different patient populations through subgroup analysis. As shown in Table 3, all models have good discriminative power in the cirrhosis subgroup, and the RF ranks best with an AUC value of 0.993 (95%CI: 0.977-1.000). XGBoost had the smallest fluctuations in power between the cirrhosis (AUC = 0.974) and non-cirrhosis (AUC = 0.972) subgroups. The performance of DT in the non-cirrhosis population was significantly attenuated (AUC = 0.774), reflecting certain limitations in its universality. Table 4 presents the subgroup analysis data on lesion size. A comparison based on the cut-off value of 4.0 cm revealed that all models had slightly better discrimination ability for larger lesions than for smaller ones. XGBoost achieved the best performance in the large lesion group (AUC = 0.988), and demonstrated good predictive performance in the small lesion group (AUC = 0.971).

Table 3 Subgroup analysis by cirrhosis status (test set).
Model
Subgroup
n
AUC (95%CI)
Sensitivity
Specificity
Accuracy
LogisticCirrhosis470.979 (0.947-1.000)0.8861.0000.915
Non-cirrhosis770.943 (0.876-1.000)0.8750.9810.948
DTCirrhosis470.937 (0.843-1.000)0.9140.9170.915
Non-cirrhosis770.774 (0.662-0.885)0.7500.8300.805
RFCirrhosis470.993 (0.977-1.000)0.9711.0000.979
Non-cirrhosis770.938 (0.878-0.997)0.7501.0000.922
XGBoostCirrhosis470.974 (0.932-1.000)0.9710.9170.957
Non-cirrhosis770.972 (0.943-1.000)0.9580.8870.909
SVMCirrhosis470.979 (0.944-1.000)0.9710.9170.957
Non-cirrhosis770.940 (0.868-1.000)0.8750.9810.948
Table 4 Subgroup analysis by lesion size (test set).
Model
Subgroup (cm)
n
AUC (95%CI)
Sensitivity
Specificity
Accuracy
LogisticLarge (≥ 4.0)620.983 (0.955-1.000)0.9061.0000.952
Small (< 4.0)620.950 (0.891-1.000)0.8890.9710.936
DTLarge (≥ 4.0)620.887 (0.805-0.969)0.8750.8330.855
Small (< 4.0)620.882 (0.795-0.968)0.8150.8570.839
RFLarge (≥ 4.0)620.984 (0.963-1.000)0.9380.9670.952
Small (< 4.0)620.956 (0.906-1.000)0.8151.0000.919
XGBoostLarge (≥ 4.0)620.988 (0.968-1.000)0.9690.9670.968
Small (< 4.0)620.971 (0.940-1.000)0.9630.8860.919
SVMLarge (≥ 4.0)620.983 (0.959-1.000)0.9380.9670.952
Small (< 4.0)620.950 (0.886-1.000)0.8890.9710.936
Statistical comparison of models

The statistical comparative AUC analysis of models was performed using the DeLong test (Supplementary Table 2). Compared with DT, LR had a higher AUC (difference = 0.079, Z = 2.926, P = 0.003). However, there was no statistically significant difference in AUC between LR and RF, XGBoost, and SVM (all P > 0.05), indicating that the discrimination ability of the models was similar. The McNemar test was used to compare model classification consistency (Supplementary Table 3). The results were consistent with the findings of the DeLong test. The analysis revealed statistically significant differences between the classification characteristics of LR and DT (χ2 = 5.500, P = 0.019); nevertheless, there were no significant differences between the classification performance of RF, XGBoost, and SVM (P > 0.05).

Feature importance analysis

The cross-model comparison of feature importance covered three algorithms, namely LR, RF, and XGBoost (Supplementary Table 4). Among them, Blood_Flow_Signal was the optimal feature of LR and RF, and Regression_Time was the primary feature of XGBoost. Crucially, Blood_Flow_Signal, Lesion_Size and Liver_Cirrhosis were outstanding in the three models, which indicated that they have stable predictive value in distinguishing malignant lesions from inflammatory liver lesions.

DCA

The complete DCA results are shown in Supplementary Table 5. Each model obtained the maximum net benefit at 0% threshold, and always maintained the positive net benefit level in the threshold range of 0%-99%. Among them, XGBoost showed the highest average net benefit in the threshold range of 0%-50%, reaching 0.428; LR performed best in the higher threshold range (50%-100%; average net benefit: 0.350). The average net benefit of DT was the lowest in the higher threshold range (0.176), aligning with its relatively poor discrimination efficiency.

DISCUSSION

ML models based on the characteristics of multi-parameter CEUS were successfully established in this study. They exhibited outstanding effectiveness in distinguishing inflammatory liver lesions from malignant tumors and may be useful in clinical practice. The verification results of the independent test set showed that the AUC value of LR is high (0.965), and the accuracy rate is 94.35%. Its diagnostic performance far exceeds that of traditional indexes AFP score (AUC = 0.917) and MELD score (AUC = 0.840). Hence, by integrating multi-parametric CEUS signatures, ML can efficiently enhance liver disease diagnostic accuracy to better inform clinical decision-making. Hu et al[17] also confirmed the superiority of CEUS-derived ultrasonomics in differentiating benign from malignant liver lesions, with the established nomogram model achieving an AUC of 91.4% (equivalent to the expert level). Liao et al[11] further verified the value of CEUS in liver lesion differential diagnosis. Their CEUS feature-based nomogram excelled in distinguishing HCC from hepatic inflammatory pseudotumors, with AUCs of the training (0.989) and validation (0.984) sets far surpassing the simple ultrasound biomarker score (0.938 and 0.958) and the ultrasound physician’s diagnosis (0.794 and 0.832).

Notably, we identified the key role played by blood flow signals and wash-in time in model decision-making through SHAP analysis. As an important indicator of angiogenesis and blood perfusion, blood flow signals are usually rich and disordered in malignant tumors, but relatively rare in inflammatory lesions. The wash-in time difference, essentially reflecting distinct contrast-agent enhancement kinetics of different lesion types, provides clinicians with important differential diagnosis clues. Color parametric imaging of CEUS is capable of detecting and presenting subtle features of liver lesions; a nomogram constructed based on it significantly improves the performance of CEUS for liver lesion diagnosis (AUC = 0.937)[18].

Traditional diagnosis of liver diseases mainly depends on serological markers and routine imaging examinations, methods that have obvious limitations in distinguishing inflammatory diseases from malignant tumors. Although the AFP score is a classic liver cancer (LC) marker, it is limited in terms of sensitivity and specificity, especially for certain pathological subtypes and early-stage lesions[19]. The GADSAH model developed by Long et al[20] for AFP-negative liver space-occupying lesions further verified the limitations of traditional markers. This model constructed based on various indicators is effective in diagnosing AFP-negative cases (AUC = 0.905). Literature shows that using traditional tumor markers (e.g., AFP, carbohydrate antigen 19-9) separately has limitations in LC diagnosis; instead, the comprehensive detection of AFP and carbohydrate antigen 19-9 plus CEUS and enhanced computed tomography (CT) is highly effective in improving the sensitivity and accuracy of LC diagnosis, providing an important means for early diagnosis[21]. Although the MELD score is valuable in evaluating liver function, it lacks sufficient discriminant power in distinguishing lesion nature. The present study is innovative for the following reasons. First, the multi-dimensional parameter information of CEUS is systematically integrated, covering morphological characteristics, hemodynamic parameters, and time-intensity characteristics, thereby enabling the construction of a more comprehensive feature system. Second, model stability and generalization ability are ensured by adopting the comparative analysis strategies of five classical ML algorithms and strict hyperparameter tuning and cross-validation. Third, the SHAP method is introduced to solve the “black box” problem of ML models, which provides an interpretable decision-making basis and enhances the trust of clinicians in AI-driven diagnoses. Sonazoid-enhanced Kupffer phase imaging has been indicated to excel in distinguishing well-differentiated HCC from atypical benign liver lesions (AUC = 0.912), while also significantly reducing unnecessary biopsies[22]. Yao et al[23] showed the potential of radiomics analysis based on multi-modal ultrasound images in liver lesion diagnosis. They constructed a radiomics system based on parse representation algorithms and SVM, obtaining an AUC value of 0.94 in the classification of benign and malignant liver lesions and an AUC value of 0.97 in identifying malignant subtypes. This evidence confirms the effectiveness of the multi-parameter integration method. Compared with previous investigations, the present study revealed a higher level of diagnostic efficiency. The model constructed by Wang et al[24] using CT radiomics provides an important reference for this field. Compared with that model, the diagnostic performance of our model is more outstanding, which further demonstrates its innovative value and application prospect.

The important advantages of this study in technical methods lie in the scientific construction of a multi-model integration strategy and the innovative application of SHAP interpretable analysis. Through systematic comparison of LR, DT, RF, XGBoost and SVM models, we verified the robustness of the results and provided a basis for model selection in different clinical scenarios. Ma et al[25] also adopted the research paradigm of multi-ML algorithms, covering SVM, K-nearest neighbors, RF, XGBoost, light gradient boosting machine, and multilayer perceptron. In the differential diagnosis of HCC and non-HCC, it was found that K-nearest neighbors performed best in the radiomics model, while SVM had the highest diagnostic accuracy (0.824) in the combined model. This research idea is similar to the multi-algorithm comparison method used in this study. In this study, the difference in the discriminant performance of each model was quantified by the DeLong test, laying a statistical foundation for comparison between models[26]. The analysis results showed that the AUC of LR is significantly higher than that of DT (difference = 0.079, Z = 2.926, P = 0.003). In contrast, there was no statistically significant difference in AUC between LR and RF, XGBoost and SVM (all P > 0.05). The McNemar test also showed statistically significant differences between LR and DT (χ2 = 5.500, P = 0.019).

The performance of LR is better than that of DT and other complex tree-based models. This difference is mainly attributed to the following three reasons. First, in this diagnostic task, key predictors such as wash-in time and blood flow grading show a relatively high linear correlation with the malignant nature of the lesion, which aligns well with the linear decision boundary of LR. Second, in the analysis of medium-sized clinical samples, complex non-linear models like RF or XGBoost are prone to capturing noise, leading to overfitting. In contrast, the anti-over-fitting ability of LR is inherently stronger in such scenarios[27]. Third, the inherent L1/L2 regularization mechanism of LR effectively penalizes excessive complexity, improving model stability across training and test datasets. Although the structure of LR is simple, its high degree of generalizability makes it the preferred model for clinical application. Although the structure of LR is simple, it has become the preferred model for clinical application because of its excellent performance and best stability (overall stability difference = 0.00738) measured in this study.

Shen et al[28] also demonstrated the advantages of multi-modal integration; the prediction model (AUC = 0.978) constructed by the team included sonazoid CEUS, acoustic radiation force pulse imaging, and clinical features, notably outperforming single-modality models. The combination of CEUS characteristics and AFP score in the classification of HCC subtypes is able to effectively predict the tumor cluster pattern surrounded by blood vessels and the subtype of trabecular-massive HCC[29], in which AFP levels, no enhancement in tumor, and CEUS blood flow pattern are independent predictors of vessels that encapsulate tumor clusters-HCC.

To further evaluate clinical applicability of these models in different patient populations, we conducted cirrhosis status- and lesion size-based subgroup analyses. In the cirrhosis subgroup, RF showed the best performance (AUC = 0.993, 95%CI: 0.977-1.000), while XGBoost exhibited the most consistent performance (AUC difference = 0.002). Of note, DT showed a significant decrease in patients with non-cirrhosis (AUC = 0.774), indicating that its universality in different patient groups is limited. This finding echoes previous research results suggesting that in heterogeneous clinical populations, the integrated method often performs better than a single DT[30]. Subgroup analysis was carried out based on a lesion size cut-off value of 4.0 cm. The results showed that the efficiency of each model is slightly superior in larger lesions, which is closely related to the more significant imaging features of large tumors. Beyond the overall AUC, our model maintains a balanced sensitivity and specificity across subgroups. As shown in Table 3, the LR model achieved a high sensitivity (0.886) in the cirrhosis subgroup, which is critical for reducing the missed diagnosis rate of early-stage malignancies. Meanwhile, its high specificity in non-cirrhotic patients helps avoid unnecessary invasive biopsies for inflammatory lesions, thereby optimizing the balance between diagnostic accuracy and clinical intervention risks.

A recent meta-analysis of ML algorithms focusing on focal liver lesion classification also reached a similar conclusion, confirming that the performance of the model was affected by differences in lesion characteristics[31]. The subgroup analyses carried out in this study provide instructive evidence-based insights for the clinical application of related technologies. Moreover, they show that the selection strategy of models should be adapted to the specific characteristics of patients.

SHAP-interpretable analysis provided important technical support for the model interpretation of this study. Zhou et al[32] developed an ultrasound-driven deep learning model to distinguish HCC from other malignant tumors in liver cirrhosis cases. This model achieved an AUC value of 0.74 on the test set. Though with non-inferior diagnostic efficacy to magnetic resonance imaging (MRI) liver imaging reporting and data system category M, the ultrasonography-based deep learning models with clinical features model incorporating clinical factors demonstrates the benefit of multi-parameter integration, evidenced by an AUC increase to 0.81.

When differentiating liver lesions, imaging clues are critical in clinical diagnosis. For example, centrifugal perfusion, blood supply arteries, and mosaic-like alterations are more characteristic of malignancies compared with more centripetal perfusion and peripheral nodular enhancement of benign lesions[18]. To clarify the key predictors in different models, we compared the feature importance of the LR, RF, and XGBoost models horizontally. Blood_Flow_Signal is the most important feature in LR and RF, in contrast to Regression_Time in XGBoost. Although the three algorithms are based on different principles, Blood_Flow_Signal, Lesion_Size, and Liver_Cirrhosis remain at the top of the importance list, showing that they have stable and strong prediction efficiency. This consistent rule among different models strongly confirms the biological significance of related characteristics in distinguishing benign from malignant liver lesions. The prominent manifestations of blood flow signals are consistent with pathophysiological cognition; specifically, there are many chaotic angiogenesis phenomena in malignant tumors, and the heterogeneity of their vascular structures eventually leads to unique blood perfusion characteristics[33]. The same trend is noted in lesion size, as larger lesions are often more likely to show malignant characteristics. The consistency of feature ranking among different algorithms improves the clinical interpretability and credibility of the results of this study, as well as solves the controversial “black box” problem of ML in medical application[34].

The AI-aided diagnosis models constructed in this study show important clinical application value. First, the models can significantly improve the diagnostic accuracy of liver lesions, effectively avoiding misdiagnosis and missed diagnosis while having high recognition efficiency, especially for early onset and atypical symptoms. Beyond the overall AUC, our model maintains a balanced sensitivity and specificity across diverse clinical backgrounds. For instance, the high sensitivity exhibited by the LR model in cirrhotic patients is critical for reducing the missed diagnosis rate of early-stage malignancies, while its high specificity helps avoid unnecessary invasive biopsies for inflammatory lesions, thereby optimizing the balance between diagnostic accuracy and clinical intervention risks. Second, the evaluation of feature importance based on SHAP can point out the core diagnosis basis for clinicians and help optimize the decision-making process of diagnosis and treatment. To further verify the clinical utility of the models, this study introduced DCA to systematically evaluate their net benefits under different threshold probabilities[35]. The test results showed that from the threshold of 0%-99%, all models can achieve positive net benefits, highlighting their strong clinical application value. The clinical significance of these results lies in providing a strategic roadmap for model selection based on specific medical priorities. In the low threshold range (0%-50%), XGBoost led with an average net benefit of 0.428, which is particularly suitable for screening scenarios that need to give priority to ensuring sensitivity to capture all potential malignant cases. In the high threshold range (50%-100%), LR became the best choice (average net benefit = 0.350), which confirmed its application advantages in the specific priority diagnosis environment where high diagnostic confidence is required before initiating invasive treatments like surgery or ablation. The average net benefit of DT in the high threshold range is only 0.176, which is the lowest among all models. This result is consistent with the limited discrimination ability of the model itself. The above DCA analysis results can provide a practical basis for model selection in specific clinical scenarios, and help clinicians achieve a reasonable balance between false positive and false negative risks[36]. With the help of VueBox, Lian et al[37] carried out quantitative research on CEUS perfusion-related parameters, and constructed a robust LR capable of predicting HCC pathological classification. The AUC values of the training and validation sets reached 0.831 and 0.811 respectively, which further verified the application value of quantitative CEUS parameter analysis in improving diagnostic efficiency. Furthermore, compared to emerging diagnostic tools based on proteomics or genomics, which offer molecular precision but are limited by high costs and long processing times, our CEUS-based ML model provides a more cost-effective, real-time, and repeatable alternative. From the perspective of optimal allocation of medical resources, the application of AI-aided diagnosis is expected to reduce the proportion of unnecessary invasive examinations, as well as reduce the medical expenses of patients, thus alleviating their economic burden. The implementation of standardized AI diagnostic methods will help improve the overall level of diagnostic work, and effectively shorten the diagnosis and treatment distance between primary medical institutions and resource-poor areas.

As previously reported[38], the customized Chat Generative Pre-Trained Transformer (ChatGPT) model integrating CEUS characteristics and clinical risk factors reaches 90.3% accuracy in diagnosing HCC in non-high-risk populations (AUC = 0.89), which is significantly superior to the general ChatGPT model. This study shows the potential of AI technology in the standardization and popularization of medical diagnosis. In addition, the high model accuracy and interpretability provide a valuable tool for training and accumulation of clinical experience by young physicians. As a real-time, non-radiation, and repeatable examination method, CEUS has a good clinical promotion prospect when combined with ML. Compared with CT and MRI, CEUS features low cost, convenient operation, and high patient acceptance, rendering it more suitable for routine screening and follow-up monitoring. Sonazoid-enhanced ultrasound has been shown to possess important supplementary diagnostic value in patients with uncertain liver lesions on gadoxetic acid-enhanced MRI[39]. Although it does not markedly enhance the overall diagnostic efficiency, it can increase the confidence of radiologists in classifying liver lesions, with a net reclassification improvement of 0.473. Contrast vector imaging developed by Yoo et al[40] better displayed tumor vascular structure and blood flow characteristics by post-processing arterial-phase CEUS images. It displays an AUC range of 0.851-0.963 in the probability judgment of HCC, which is far superior to traditional CEUS (AUC = 0.853), providing a new direction for the further development of CEUS technology. Compared to emerging diagnostic tools based on proteomics or genomics, which offer high molecular precision but are limited by high costs and long processing times, our CEUS-based ML model provides a more cost-effective, real-time, and repeatable alternative. While multi-omics integration is the future of precision medicine, the accessibility and non-invasive nature of CEUS-AI make it a more practical tool for routine clinical screening and long-term follow-up, particularly in resource-limited primary healthcare settings.

This study verified the superior performance of LR based on multi-parameter characteristics of CEUS in the differential diagnosis of hepatic inflammatory lesions and malignant tumors (AUC = 0.957). Combined with the SHAP method, it provided high accuracy and interpretability. Furthermore, it was shown that features such as blood flow signals and wash-in time provide important clues for clinical diagnosis.

Nevertheless, the limitations of this study should be acknowledged. First, due to the single-center retrospective design of this investigation, it is difficult to ensure the representativeness of the samples; consequently, the extrapolation range of the research conclusions is limited. The differences in equipment configuration and operating specifications among medical institutions may also weaken the universality of the model. In addition, the imaging quality of CEUS and the consistency of feature extraction are closely related to the professional ability of the operator. As emphasized by the reviewers, CEUS is highly operator-dependent; different contrast agent injection speeds, choices of imaging parameters, and subjective interpretations of semi-quantitative features (such as blood flow grading or enhancement uniformity) can easily lead to differences in evaluation among observers, thus affecting the repeated application effect of the model in clinical practice[41]. To mitigate this, future research should integrate automated quantitative perfusion analysis software to minimize human subjectivity. Furthermore, although our subgroup analysis demonstrated robust performance across different clinical backgrounds, the model’s sensitivity and specificity showed slight fluctuations in patients with liver cirrhosis. This suggests that in complex hemodynamic environments, the diagnostic weight of certain CEUS parameters may need dynamic adjustment. While we performed internal validation using training-validation-test split, and utilized the DeLong and McNemar tests to statistically confirm the LR model's superiority and stability over others (e.g., DT), external validation using data from independent institutions would further strengthen the generalizability of our findings. As demonstrated by Yu et al[42], the vast majority of deep learning algorithms for radiologic diagnosis show diminished performance on external datasets, with some reporting substantial decreases in AUC values. Therefore, multi-center prospective studies are essential before widespread clinical implementation. The application of traditional ML algorithms has not yet fully exploited the potential of deep learning. Future research should focus on multi-center prospective verification, increasing the number of centers to verify the stability and generalization of the model across heterogeneous patient populations.

CONCLUSION

In this study, the LR model constructed based on multi-parametric CEUS signatures showed excellent performance in the differential diagnosis of hepatic inflammatory lesions and malignant tumors, far outperforming the traditional biomarkers AFP score and MELD score. Through SHAP interpretability analysis, the key role of the blood flow signal and wash-in time is revealed, which provides reliable diagnostic clues for clinical practice. The results of subgroup verification showed that the models constructed in this study exhibit stable efficiency in different patient subgroups based on cirrhosis status and lesion size; XGBoost has the best consistency among subgroups. DeLong and McNemar tests showed that LR is significantly superior to DT in performance (P = 0.003 and 0.019, respectively), with its efficiency being non-inferior to those of RF, XGBoost, and SVM. Subsequent validation data from DCA showed that the models can exert stable effectiveness for clinical application within a wide threshold probability interval, laying a reliable foundation for model selection in different clinical applications.

References
1.  Leung PB, Davis AM, Kumar S. Diagnosis and Management of Nonalcoholic Fatty Liver Disease. JAMA. 2023;330:1687-1688.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 36]  [Reference Citation Analysis (0)]
2.  Nguyen TB, Do DN, Nguyen TTP, Nguyen TL, Nguyen-Thanh T, Nguyen HT. Immune-related biomarkers shared by inflammatory bowel disease and liver cancer. PLoS One. 2022;17:e0267358.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 3]  [Cited by in RCA: 15]  [Article Influence: 3.8]  [Reference Citation Analysis (1)]
3.  Ginès P, Krag A, Abraldes JG, Solà E, Fabrellas N, Kamath PS. Liver cirrhosis. Lancet. 2021;398:1359-1376.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 1345]  [Cited by in RCA: 1153]  [Article Influence: 230.6]  [Reference Citation Analysis (3)]
4.  Kawashima J, Akabane M, Khalil M, Woldesenbet S, Endo Y, Sahara K, Ruzzenente A, Ratti F, Marques HP, Oliveira S, Balaia J, Cauchy F, Lam V, Poultsides GA, Kitago M, Popescu I, Martel G, Gleisner A, Hugh T, Weiss M, Aucejo F, Aldrighetti L, Endo I, Pawlik TM. Model of End-Stage Liver Disease-alpha-fetoprotein-tumor burden (MELD-AFP-TBS) score to stratify prognosis after liver resection for hepatocellular carcinoma. Surgery. 2025;183:109388.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 2]  [Reference Citation Analysis (0)]
5.  Minami Y, Sugimoto K, Kuroda H, Kamiyama N, Ogawa C, Kudo M. Differentiating between hepatocellular carcinoma and its mimickers using contrast-enhanced ultrasound with perflubutane microbubbles. Expert Rev Med Devices. 2025;22:817-826.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 1]  [Cited by in RCA: 1]  [Article Influence: 1.0]  [Reference Citation Analysis (0)]
6.  Fan PL, Ding H, Mao F, Chen LL, Dong Y, Wang WP. Enhancement patterns of small hepatocellular carcinoma (≤ 30 mm) on contrast-enhanced ultrasound: Correlation with clinicopathologic characteristics. Eur J Radiol. 2020;132:109341.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 20]  [Cited by in RCA: 22]  [Article Influence: 3.7]  [Reference Citation Analysis (0)]
7.  Cerrito L, Ainora ME, Di Francesco S, Galasso L, Gasbarrini A, Zocco MA. The Role of Contrast-Enhanced Ultrasound (CEUS) in the Detection of Neoplastic Portal Vein Thrombosis in Patients with Hepatocellular Carcinoma. Tomography. 2023;9:1976-1986.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 5]  [Cited by in RCA: 6]  [Article Influence: 2.0]  [Reference Citation Analysis (0)]
8.  Xu W, Zhang H, Zhang R, Zhong X, Li X, Zhou W, Xie X, Wang K, Xu M. Deep learning model based on contrast-enhanced ultrasound for predicting vessels encapsulating tumor clusters in hepatocellular carcinoma. Eur Radiol. 2025;35:989-1000.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 8]  [Reference Citation Analysis (0)]
9.  Qin X, Hu X, Xiao W, Zhu C, Ma Q, Zhang C. Preoperative Evaluation of Hepatocellular Carcinoma Differentiation Using Contrast-Enhanced Ultrasound-Based Deep-Learning Radiomics Model. J Hepatocell Carcinoma. 2023;10:157-168.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 18]  [Reference Citation Analysis (2)]
10.  Zhang Y, Wei Q, Huang Y, Yao Z, Yan C, Zou X, Han J, Li Q, Mao R, Liao Y, Cao L, Lin M, Zhou X, Tang X, Hu Y, Li L, Wang Y, Yu J, Zhou J. Deep Learning of Liver Contrast-Enhanced Ultrasound to Predict Microvascular Invasion and Prognosis in Hepatocellular Carcinoma. Front Oncol. 2022;12:878061.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 13]  [Reference Citation Analysis (0)]
11.  Liao M, Wang C, Zhang B, Jiang Q, Liu J, Liao J. Distinguishing Hepatocellular Carcinoma From Hepatic Inflammatory Pseudotumor Using a Nomogram Based on Contrast-Enhanced Ultrasound. Front Oncol. 2021;11:737099.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 8]  [Cited by in RCA: 9]  [Article Influence: 1.8]  [Reference Citation Analysis (0)]
12.  Alsebaey A, Sabry A, Rashed HS, Elsabaawy MM, Ragab A, Aly RA, Badran H. MELD-Sarcopenia is Better than ALBI and MELD Score in Patients with Hepatocellular Carcinoma Awaiting Liver Transplantation. Asian Pac J Cancer Prev. 2021;22:2005-2009.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Reference Citation Analysis (0)]
13.  Arabameri A, Yamani M, Pradhan B, Melesse A, Shirani K, Tien Bui D. Novel ensembles of COPRAS multi-criteria decision-making with logistic regression, boosted regression tree, and random forest for spatial prediction of gully erosion susceptibility. Sci Total Environ. 2019;688:903-916.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 80]  [Cited by in RCA: 40]  [Article Influence: 5.7]  [Reference Citation Analysis (0)]
14.  Hu J, Szymczak S. A review on longitudinal data analysis with random forest. Brief Bioinform. 2023;24:bbad002.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 325]  [Reference Citation Analysis (0)]
15.  de Lima MD, de Oliveira Roque E Lima J, Barbosa RM. Medical data set classification using a new feature selection algorithm combined with twin-bounded support vector machine. Med Biol Eng Comput. 2020;58:519-528.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 24]  [Cited by in RCA: 17]  [Article Influence: 2.8]  [Reference Citation Analysis (0)]
16.  Sharma K, Tiwari PK, Sinha SK. Estimation of Hematocrit Volume Using Blood Glucose Concentration through Extreme Gradient Boosting Regressor Machine Learning Model. J Chem Inf Model. 2025;65:1736-1746.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 6]  [Reference Citation Analysis (0)]
17.  Hu HT, Li MD, Zhang JC, Ruan SM, Wu SS, Lin XX, Kang HY, Xie XY, Lu MD, Kuang M, Xu EJ, Wang W. Ultrasomics differentiation of malignant and benign focal liver lesions based on contrast-enhanced ultrasound. BMC Med Imaging. 2024;24:242.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 2]  [Reference Citation Analysis (0)]
18.  Liang ZN, Wang S, Yang W, Wang H, Zhao K, Bai XM, Zhang ZY, Wu W, Yan K. The added value of color parameter imaging for the evaluation of focal liver lesions with "homogenous hyperenhancement and no wash out" on contrast enhanced ultrasound. Front Oncol. 2023;13:1207902.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 1]  [Reference Citation Analysis (0)]
19.  Yuan JJ, Xu YD, Li H, Guo QJ, Li GC, Chai W, Zhang ZQ, Liu RH. Magnetic Resonance Imaging and Serum AFP-L3 and GP-73 in the Diagnosis of Primary Liver Cancer. J Oncol. 2022;2022:1192368.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 6]  [Reference Citation Analysis (0)]
20.  Long X, Zeng H, Zhang Y, Lu Q, Cao Z, Shu H. Development of a Reliable GADSAH Model for Differentiating AFP-negative Hepatic Benign and Malignant Occupying Lesions. J Hepatocell Carcinoma. 2024;11:607-618.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 2]  [Reference Citation Analysis (0)]
21.  Kong Y, Jing Y, Sun H, Zhou S. The Diagnostic Value of Contrast-Enhanced Ultrasound and Enhanced CT Combined with Tumor Markers AFP and CA199 in Liver Cancer. J Healthc Eng. 2022;2022:5074571.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 9]  [Reference Citation Analysis (0)]
22.  Wang Z, Yao J, Jing X, Li K, Lu S, Yang H, Ding H, Li K, Cheng W, He G, Jiang T, Liu F, Yu J, Han Z, Cheng Z, Tan S, Wang Z, Qi E, Wang S, Zhang Y, Li L, Dong X, Liang P, Yu X. A combined model based on radiomics features of Sonazoid contrast-enhanced ultrasound in the Kupffer phase for the diagnosis of well-differentiated hepatocellular carcinoma and atypical focal liver lesions: a prospective, multicenter study. Abdom Radiol (NY). 2024;49:3427-3437.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 8]  [Cited by in RCA: 7]  [Article Influence: 3.5]  [Reference Citation Analysis (0)]
23.  Yao Z, Dong Y, Wu G, Zhang Q, Yang D, Yu JH, Wang WP. Preoperative diagnosis and prediction of hepatocellular carcinoma: Radiomics analysis based on multi-modal ultrasound images. BMC Cancer. 2018;18:1089.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 113]  [Cited by in RCA: 101]  [Article Influence: 12.6]  [Reference Citation Analysis (2)]
24.  Wang G, Kang B, Cui J, Deng Y, Zhao Y, Ji C, Wang X. Two nomograms based on radiomics models using triphasic CT for differentiation of adrenal lipid-poor benign lesions and metastases in a cancer population: an exploratory study. Eur Radiol. 2023;33:1873-1883.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 7]  [Cited by in RCA: 9]  [Article Influence: 3.0]  [Reference Citation Analysis (0)]
25.  Ma Y, Gong Y, Qiu Q, Ma C, Yu S. Research on multi-model imaging machine learning for distinguishing early hepatocellular carcinoma. BMC Cancer. 2024;24:363.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in RCA: 10]  [Reference Citation Analysis (0)]
26.  DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837-845.  [PubMed]  [DOI]
27.  Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12-22.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 1585]  [Cited by in RCA: 1198]  [Article Influence: 171.1]  [Reference Citation Analysis (2)]
28.  Shen Q, Wu W, Wang R, Zhang J, Liu L. A non-invasive predictive model based on multimodality ultrasonography images to differentiate malignant from benign focal liver lesions. Sci Rep. 2024;14:23996.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 3]  [Reference Citation Analysis (1)]
29.  Xu W, Huang B, Zhang R, Zhong X, Zhou W, Zhuang S, Xie X, Fang J, Xu M. Diagnostic and Prognostic Ability of Contrast-Enhanced Unltrasound and Biomarkers in Hepatocellular Carcinoma Subtypes. Ultrasound Med Biol. 2024;50:617-626.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 6]  [Reference Citation Analysis (0)]
30.  Turco S, Tiyarattanachai T, Ebrahimkheil K, Eisenbrey J, Kamaya A, Mischi M, Lyshchik A, Kaffas AE. Interpretable Machine Learning for Characterization of Focal Liver Lesions by Contrast-Enhanced Ultrasound. IEEE Trans Ultrason Ferroelectr Freq Control. 2022;69:1670-1681.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 21]  [Cited by in RCA: 28]  [Article Influence: 7.0]  [Reference Citation Analysis (0)]
31.  Campello CA, Castanha EB, Vilardo M, Staziaki PV, Francisco MZ, Mohajer B, Watte G, Moraes FY, Hochhegger B, Altmayer S. Machine learning for malignant versus benign focal liver lesions on US and CEUS: a meta-analysis. Abdom Radiol (NY). 2023;48:3114-3126.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 3]  [Reference Citation Analysis (0)]
32.  Zhou H, Jiang T, Li Q, Zhang C, Zhang C, Liu Y, Cao J, Sun Y, Jin P, Luo J, Pan M, Huang P. US-Based Deep Learning Model for Differentiating Hepatocellular Carcinoma (HCC) From Other Malignancy in Cirrhotic Patients. Front Oncol. 2021;11:672055.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 2]  [Cited by in RCA: 7]  [Article Influence: 1.4]  [Reference Citation Analysis (0)]
33.  Schwarz S, Clevert DA, Ingrisch M, Geyer T, Schwarze V, Rübenthaler J, Armbruster M. Quantitative Analysis of the Time-Intensity Curve of Contrast-Enhanced Ultrasound of the Liver: Differentiation of Benign and Malignant Liver Lesions. Diagnostics (Basel). 2021;11:1244.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 1]  [Cited by in RCA: 14]  [Article Influence: 2.8]  [Reference Citation Analysis (0)]
34.  Hosny A, Parmar C, Quackenbush J, Schwartz LH, Aerts HJWL. Artificial intelligence in radiology. Nat Rev Cancer. 2018;18:500-510.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 3148]  [Cited by in RCA: 2105]  [Article Influence: 263.1]  [Reference Citation Analysis (9)]
35.  Vickers AJ, Woo S. Decision curve analysis in the evaluation of radiology research. Eur Radiol. 2022;32:5787-5789.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 7]  [Cited by in RCA: 27]  [Article Influence: 6.8]  [Reference Citation Analysis (0)]
36.  Vickers AJ, van Calster B, Steyerberg EW. A simple, step-by-step guide to interpreting decision curve analysis. Diagn Progn Res. 2019;3:18.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 901]  [Cited by in RCA: 792]  [Article Influence: 113.1]  [Reference Citation Analysis (0)]
37.  Lian SM, Cheng HJ, Li HJ, Wang H. Construction of nomogram model based on contrast-enhanced ultrasound parameters to predict the degree of pathological differentiation of hepatocellular carcinoma. Front Oncol. 2025;15:1519703.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 1]  [Reference Citation Analysis (0)]
38.  Xian MF, Lan WT, Zhang Z, Li MD, Lin XX, Huang Y, Huang H, Chen LD, Huang QH, Wang W. Enhancing hepatocellular carcinoma diagnosis in non-high-risk patients: a customized ChatGPT model integrating contrast-enhanced ultrasound. Radiol Med. 2025;130:1013-1023.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 3]  [Reference Citation Analysis (0)]
39.  Ban JY, Kang TW, Jeong WK, Lee MW, Park B, Song KD. Value of Sonazoid-enhanced ultrasonography in characterizing indeterminate focal liver lesions on gadoxetic acid-enhanced liver MRI in patients without risk factors for hepatocellular carcinoma. PLoS One. 2024;19:e0304352.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 3]  [Reference Citation Analysis (0)]
40.  Yoo J, Lee JM, Joo I, Yoon JH. Contrast vector imaging for differential diagnosis of focal liver lesions: Analysis of tumoral vascular structures and flow characteristics. PLoS One. 2024;19:e0314263.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in RCA: 2]  [Reference Citation Analysis (0)]
41.  Kelly BS, Judge C, Bollard SM, Clifford SM, Healy GM, Aziz A, Mathur P, Islam S, Yeom KW, Lawlor A, Killeen RP. Radiology artificial intelligence: a systematic review and evaluation of methods (RAISE). Eur Radiol. 2022;32:7998-8007.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 216]  [Cited by in RCA: 153]  [Article Influence: 38.3]  [Reference Citation Analysis (0)]
42.  Yu AC, Mohajer B, Eng J. External Validation of Deep Learning Algorithms for Radiologic Diagnosis: A Systematic Review. Radiol Artif Intell. 2022;4:e210064.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 3]  [Cited by in RCA: 238]  [Article Influence: 59.5]  [Reference Citation Analysis (0)]
Footnotes

Peer review: Externally peer reviewed.

Peer-review model: Single blind

Specialty type: Gastroenterology and hepatology

Country of origin: China

Peer-review report’s classification

Scientific quality: Grade B, Grade B

Novelty: Grade C, Grade C

Creativity or innovation: Grade B, Grade C

Scientific significance: Grade B, Grade C

P-Reviewer: Nakajima K, PhD, Japan; Poelmann FB, PhD, United States S-Editor: Lin C L-Editor: A P-Editor: Zhang L

Write to the Help Desk