Copyright
©The Author(s) 2021.
World J Hepatol. Oct 27, 2021; 13(10): 1417-1427
Published online Oct 27, 2021. doi: 10.4254/wjh.v13.i10.1417
Published online Oct 27, 2021. doi: 10.4254/wjh.v13.i10.1417
Table 1 Baseline characteristics of participants in training and testing data
| Training data (n = 2265) | Testing data (n = 970) | P value | |
| Demographic | |||
| Age (yr) | 43 (29) | 43.5 (28) | 0.328 |
| Gender (male) (%) | 944 (41.68) | 428 (44.12) | 0.197 |
| Race/ethnicity | |||
| White (non-Hispanic) (%) | 959 (42.34) | 392 (40.41) | 0.308 |
| Black (non-Hispanic) (%) | 627 (27.68) | 271 (27.94) | 0.882 |
| Mexican American (%) | 576 (25.43) | 254 (26.19) | 0.652 |
| Others (%) | 103 (4.55) | 53 (5.46) | 0.265 |
| Body measurement | |||
| Body mass index (kg/m2) | 26.4 (7.2) | 26.7 (7.4) | 0.120 |
| Waist circumference (cm) | 93 (20.5) | 93.5 (20.8) | 0.182 |
| Biochemistry tests | |||
| Iron (ug/dL) | 73 (39) | 74 (39) | 0.098 |
| Total iron-binding capacity (ug/dL) | 355 (72) | 356 (72) | 0.450 |
| Transferrin saturation (%) | 20.5 (11.1) | 20.8 (11.8) | 0.329 |
| Ferritin (ng/mL) | 87 (125) | 84.5 (124) | 0.508 |
| Cholesterol (mg/dL) | 201 (57) | 204 (59) | 0.155 |
| Triglyceride (mg/dL) | 120 (100.25) | 122.5 (102) | 0.562 |
| HDL cholesterol (mg/dL) | 48 (18) | 48.5 (18) | 0.585 |
| C-reactive protein (mg/dL) | 0.21 (0.29) | 0.21 (0.23) | 0.686 |
| Uric acid (mg/dL) | 5 (1.9) | 5.1 (2) | 0.427 |
| Liver chemistry | |||
| Aspartate aminotransferase (U/L) | 19 (8) | 19 (7) | 0.908 |
| Alanine aminotransferase (U/L) | 14 (10) | 14 (10) | 0.581 |
| Gamma glutamyl transferase (U/L) | 21 (18) | 21 (18) | 0.787 |
| Alkaline phosphatase (U/L) | 83 (33) | 81 (32) | 0.524 |
| Total bilirubin (mg/dL) | 0.5 (0.2) | 0.5 (0.2) | 0.855 |
| Total protein (g/dL) | 7.4 (0.6) | 7.4 (0.6) | 0.559 |
| Albumin (g/dL) | 4.1 (0.5) | 4.1 (0.4) | 0.543 |
| Serum globulin (g/dL) | 3.3 (0.6) | 3.3 (0.7) | 0.941 |
| Diabetes testing profile | |||
| Glycated hemoglobin (%) | 5.4 (0.8) | 5.4 (0.7) | 0.075 |
| Fasting plasma glucose (mg/dL) | 91.6 (12.52) | 92.05 (12.2) | 0.726 |
| Fasting C-peptide (pmol/mL) | 0.65 (0.68) | 0.66 (0.69) | 0.746 |
| Fasting insulin (uU/mL) | 9.36 (9.51) | 9.73 (10.04) | 0.378 |
| Diabetes medication | 165 (7.28%) | 68 (7.01%) | 0.782 |
Table 2 The performance comparison of machine learning models on training data
| No. | Description | Accuracy (%) | AUC | PPV/precision (%) | NPV (%) | Sensitivity/recall (%) | Specificity (%) | F1 |
| 1 | Fine tree | 71.6 | 0.64 | 42.9 | 79.8 | 37.8 | 83.0 | 0.40 |
| 2 | Medium tree | 74.4 | 0.70 | 48.9 | 79.1 | 30.1 | 89.4 | 0.37 |
| 3 | Coarse tree | 76.0 | 0.68 | 55.1 | 78.9 | 26.4 | 92.7 | 0.36 |
| 4 | Linear discriminant | 78.0 | 0.75 | 61.1 | 80.9 | 35.5 | 92.4 | 0.45 |
| 5 | Logistic regression | 78.1 | 0.75 | 62.2 | 80.6 | 33.9 | 93.0 | 0.44 |
| 6 | Gaussian naïve Bayes | 75.1 | 0.74 | 50.8 | 81.1 | 40.2 | 86.8 | 0.45 |
| 7 | Kernel naïve Bayes | 72.7 | 0.73 | 46.8 | 85.1 | 60.1 | 76.9 | 0.53 |
| 8 | Linear SVM | 77.0 | 0.74 | 64.4 | 78.1 | 19.9 | 96.3 | 0.30 |
| 9 | Quadratic SVM | 77.4 | 0.70 | 59.9 | 80.1 | 31.8 | 92.8 | 0.42 |
| 10 | Cubic SVM | 72.8 | 0.64 | 45.1 | 79.6 | 35.3 | 85.5 | 0.40 |
| 11 | Fine Gaussian SVM | 74.7 | 0.67 | 74.7 | 100.0 | |||
| 12 | Medium Gaussian SVM | 77.5 | 0.74 | 63.9 | 79.0 | 25.3 | 95.2 | 0.36 |
| 13 | Coarse Gaussian SVM | 75.7 | 0.74 | 66.2 | 76.0 | 7.9 | 98.6 | 0.14 |
| 14 | Fine KNN | 68.9 | 0.58 | 38.0 | 78.9 | 36.9 | 79.7 | 0.37 |
| 15 | Medium KNN | 76.5 | 0.71 | 59.7 | 78.1 | 21.0 | 95.2 | 0.31 |
| 16 | Coarse KNN | 76.6 | 0.75 | 78.1 | 76.5 | 10.0 | 99.1 | 0.18 |
| 17 | Cosine KNN | 76.6 | 0.72 | 57.9 | 79.2 | 27.6 | 93.2 | 0.37 |
| 18 | Cubic KNN | 77.0 | 0.72 | 62.0 | 78.5 | 22.6 | 95.3 | 0.33 |
| 19 | Weighted KNN | 76.5 | 0.71 | 56.7 | 79.4 | 28.8 | 92.6 | 0.38 |
| 20 | Ensemble of boosted trees | 76.9 | 0.74 | 57.3 | 80.3 | 33.6 | 91.6 | 0.42 |
| 21 | Ensemble of bagged trees | 77.2 | 0.74 | 58.9 | 80.2 | 32.5 | 92.3 | 0.42 |
| 22 | Ensemble of subspace discriminant | 78.3 | 0.76 | 66.7 | 79.7 | 28.3 | 95.2 | 0.40 |
| 23 | Ensemble of subspace KNN | 75.5 | 0.69 | 54.7 | 77.2 | 16.4 | 95.4 | 0.25 |
| 24 | Ensemble of RUS boosted trees | 70.4 | 0.76 | 44.2 | 86.3 | 66.4 | 71.7 | 0.53 |
Table 3 The performance of machine learning models and other non-alcoholic fatty liver disease indices on testing data
| No. | Description | Accuracy (%) | AUC | PPV/precision (%) | NPV (%) | Sensitivity/recall (%) | Specificity (%) | F1 |
| Machine learning models | ||||||||
| 1 | Ensemble of subspace discriminant | 77.7 | 0.78 | 66.7 | 78.8 | 23.7 | 96 | 0.35 |
| 2 | Coarse trees | 74.9 | 0.72 | 50.8 | 78.3 | 24.5 | 92 | 0.33 |
| 3 | Ensemble of RUS boosted trees | 71.1 | 0.79 | 45.5 | 88.4 | 72.7 | 70.6 | 0.56 |
| NAFLD indices | ||||||||
| 4 | Fatty liver index | 68.6 | 0.74 | 42.4 | 86.6 | 68.6 | 68.6 | 0.52 |
| 5 | Hepatic steatosis index | 65.1 | 0.70 | 37.9 | 83.3 | 60.4 | 66.6 | 0.47 |
| 6 | Triglyceride and glucose index | 56.9 | 0.69 | 34.8 | 88.3 | 80.8 | 48.8 | 0.49 |
Table 4 The performance comparison of published machine learning models on non-alcoholic fatty liver disease prediction
| Ref. | Type of data/country or territory of data | Number of train/ external testing data | Model | Accuracy (%) | AUC | Sensitivity (%) | Specificity (%) | F1 |
| Sorino et al[33], 2020 | Population/Italy | 2920/50 | Support vector machine | 681 | N/A | 98.5 | 100 | N/A |
| Wu et al[13], 2019 | Hospital/Taiwan | 577/NA | Random forest | 86.51 | 87.21 | 85.91 | N/A | |
| Islam et al[36], 2018 | Hospital/Taiwan | 994/NA | Logistic regression | 701 | 74.11 | 64.91 | N/A | |
| Ma et al[12], 2018 | Hospital/China | 10508/NA | Bayesian network | 82.921 | N/A | 67.51 | 87.81 | 0.6551 |
| Perveen et al[14], 2018 | Primary care network/Canada | 64%/34% of | Decision trees | N/A | 0.73 | 73 | N/A | 0.67 |
| Yip et al[15], 2017 | Hospital/Hong Kong | 500/442 | Ridge regression | 87 | 0.87 | 92 | 90 | N/A |
| Birjandi et al[37], 2016 | Hospital/Iran | 359/1241 | Decision trees | 75 | 0.75 | 73 | 77 | N/A |
| Our study | Population based/United States | 2265/970 | Ensemble of RUS boosted trees | 71.1 | 0.79 | 72.7 | 70.6 | 0.56 |
| Coarse trees | 74.9% | 0.72 | 24.5% | 92% | 0.33 |
- Citation: Atsawarungruangkit A, Laoveeravat P, Promrat K. Machine learning models for predicting non-alcoholic fatty liver disease in the general United States population: NHANES database. World J Hepatol 2021; 13(10): 1417-1427
- URL: https://www.wjgnet.com/1948-5182/full/v13/i10/1417.htm
- DOI: https://dx.doi.org/10.4254/wjh.v13.i10.1417
