Prediction of graft outcomes after kidney transplantation: When standard statistics compare to machine learning techniques

doi:10.5527/wjn.v15.i1.116879

Advanced Search

BPG is committed to discovery and dissemination of knowledge

Home / Archive / Volume 15, Issue 1

This Article

(0) (0) (0) (862)

Table of Contents

Peer-Review Report of This Article

CrossCheck and Google Search of This Article

Academic Rules and Norms of This Article

Citation of this article

Corresponding Author of This Article

Research Domain of This Article

Article-Type of This Article

Open-Access Policy of This Article

Times Cited Counts in Google of This Article

Journal Information of This Article

Publication Name

World Journal of Nephrology

ISSN

2220-6124

Publisher of This Article

Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA

Retrospective Study Open Access

Copyright: ©Author(s) 2026. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license. No commercial re-use. See permissions. Published by Baishideng Publishing Group Inc.

World J Nephrol. Mar 25, 2026; 15(1): 116879
Published online Mar 25, 2026. doi: 10.5527/wjn.v15.i1.116879

Prediction of graft outcomes after kidney transplantation: When standard statistics compare to machine learning techniques

Carolina Salgado, Francisca Gonzalez Cohens, Felipe A Vera, Rocío Ruiz, Juan D Velasquez, Fernando M Gonzalez

Carolina Salgado, Francisca Gonzalez Cohens, Felipe A Vera, Rocío Ruiz, Juan D Velasquez, Web Intelligence Centre, Faculty of Physics and Mathematical Sciences, Universidad de Chile, Santiago 7500922, Chile

Fernando M Gonzalez, Department of Nephrology, Faculty of Medicine, Universidad de Chile, Santiago 7500922, Chile

ORCID number: Carolina Salgado (0009-0005-4535-3461); Francisca Gonzalez Cohens (0000-0002-7703-4730); Felipe A Vera (0000-0001-8042-2542); Rocío Ruiz (0000-0001-7730-174X); Juan D Velasquez (0000-0003-4819-9680); Fernando M Gonzalez (0000-0003-2742-5220).

Author contributions: Salgado C built and ran all the models; Gonzalez Cohens F and Gonzalez FM revised the methodology, model performance, and interpretability; Salgado C, Gonzalez Cohens F, and Gonzalez FM wrote the manuscript; Vera FA, Ruiz R, and Velasquez JD reviewed, edited, and approved the manuscript.

Supported by a public grant from Agencia Nacional De Investigación Y Desarrollo, No. ID23I10232.

Institutional review board statement: The Comité Ético Científico del Servicio de Salud Metropolitano Oriente reviewed and approved the study (No. 29092020).

Informed consent statement: Written informed consent was obtained from all kidney transplant recipients upon their inclusion on the transplant centers’ waiting lists, authorizing the use of their anonymized clinical data.

Conflict-of-interest statement: All the authors report no relevant conflicts of interest for this article.

Data sharing statement: Data can be shared upon request.

Corresponding author: Fernando M Gonzalez, MD, Department of Nephrology, Faculty of Medicine, Universidad de Chile, Avenida Salvador 486, Providencia, Santiago 7500922, Chile. fgonzalf@uc.cl

Received: November 24, 2025
Revised: January 12, 2026
Accepted: February 9, 2026
Published online: March 25, 2026
Processing time: 111 Days and 7.3 Hours

Abstract

BACKGROUND

Over the last decade, the use of machine learning (ML) techniques in problem modeling and solving has increased significantly, including in kidney transplantation. Numerous studies have used ML to predict outcomes such as delayed graft function (DGF). This study compares various ML models with logistic regression (LR) in predicting DGF, focusing on donor characteristics.

AIM

To compare various ML models with LR in predicting DGF, focusing on donor characteristics.

METHODS

We analyzed 523 deceased donor kidney transplants performed between 2010 and 2020 across three transplant centers. The dataset included 14 donors, 3 transplants, and 64 recipient features. Four problem types were defined based on variable combinations: Donor-only, donor + transplant, donor + recipient, and donor + transplant + recipient. The dataset comprised 43.5% DGF-positive and 56.5% DGF-negative patients, split into 80% for training and 20% for validation/testing. Six ML models - support vector machine, decision trees, random forest (RF), gradient boost (GB), extreme gradient boost (XGB), and multilayer perceptron - were compared with LR. Hyperparameters were optimized using random search and 10-fold cross-validation. Accuracy was the primary performance metric.

RESULTS

The best-performing model for each problem type achieved accuracies of 70% (RF), 70% (RF), 58% (RF), and 61% (XGB) for donor-only, donor + transplant, donor + recipient, and donor + transplant + recipient, respectively. LR achieved accuracies of 57%, 66%, 52% and 66%; however, these models generally showed low sensitivity and high specificity. Across most of them, significant predictors included donor creatinine, age, and mean blood pressure, cold ischemia time (transplant variable), and recipient smoking condition.

CONCLUSION

While most ML models outperformed LR, the differences were not substantial. This may be attributed to the small dataset size, which likely contributed to the overall poor performance. We recommend using these complex models with high-quality datasets that include a sufficient number of variables and observations to fully leverage their potential. The key question for future research is determining the dataset size required for ML to become the primary analytic tool for predicting kidney transplant outcomes.

Key Words: Delayed graft function; Prediction; Logistic regression; Machine learning; Artificial intelligence

Core Tip: Machine learning (ML) is increasingly used in kidney transplantation research, including predicting delayed graft function. This study compares six ML models with logit across four donor, transplant, and recipient variable combinations. The dataset comprises 44.7% delayed graft function-positive cases. All methods have similar performances, with accuracies between 58%-70%. Important predictors included donor creatinine, age, and mean blood pressure, cold-ischemia time, and recipient smoking condition. Although ML approaches slightly outperformed logit, overall performance remained modest, likely due to limited sample-size. Further research should define dataset scale and quality for ML to become a primary analytic tool for predicting kidney transplant outcomes.

Citation: Salgado C, Gonzalez Cohens F, Vera FA, Ruiz R, Velasquez JD, Gonzalez FM. Prediction of graft outcomes after kidney transplantation: When standard statistics compare to machine learning techniques. World J Nephrol 2026; 15(1): 116879
URL: https://www.wjgnet.com/2220-6124/full/v15/i1/116879.htm
DOI: https://dx.doi.org/10.5527/wjn.v15.i1.116879

INTRODUCTION

Kidney transplantation is the standard treatment for end-stage renal disease. Its superior outcomes compared with hemodialysis and peritoneal dialysis have led to growing waiting lists worldwide, where new patients are added faster than they can receive a donor kidney[1]. As waiting lists grow and donor organs remain limited, selecting the most appropriate recipients and optimizing graft survival become increasingly important[2]. Accurately predicting key clinical outcomes following kidney transplantation is therefore a central goal for treating physicians. Such predictive capabilities are essential for improving patient prognosis, optimizing kidney allocation (particularly for organs from extended-criteria donors), and supporting clinical decision-making[1]. To meet this need, statistical models have become essential tools for predicting post-transplant outcomes. However, many commonly used models rely on linear assumptions and may fail to capture the complex, nonlinear interactions between clinical variables[3]. In the past decade, advances in computational power and data availability have enabled the development of more sophisticated analytical tools capable of analyzing large datasets and uncovering patterns often missed by traditional statistical models. These techniques, broadly referred to as machine learning (ML), can handle high-dimensional data, model nonlinear relationships, and improve predictive accuracy, making them particularly suitable for clinical outcome prediction and survival analysis[3]. In this study, we evaluate the ability of several ML algorithms to predict delayed graft function (DGF) after deceased-donor kidney transplantation. Specifically, we compare their performance with that of traditional logistic regression (LR) using donor, transplant, and recipient characteristics as predictor variables.

MATERIALS AND METHODS

Methods

This study draws on registry data from three kidney transplant centers in Chile, covering 765 deceased donor kidney transplants performed between 1989 and 2020. Nonetheless, based on significant differences found in clinical variables across decades (Kruskal-Wallis at 95%), we decided to include 523 observations corresponding to the 2010-2020 decade. This study was approved by the Institutional Review Board (Comité Ético Científico del Servicio de Salud Metropolitano Oriente). Written informed consent was obtained from all kidney transplant recipients upon their inclusion on the transplant centers’ waiting lists, authorizing the use of their anonymized clinical data.

The dataset included 14 donors, 3 transplants, and 64 recipient features, predictors, or independent variables. Nevertheless, most of them had a high number of missing values. To balance the reduction of data imputation and variable reduction, we decided to include variables with 50% or fewer missing values[4]. For those, we imputed missing values with the Iterative Imputer from the Scikit-learn library. The outcome variable, DGF, was defined as any dialysis procedure performed during the first week after the kidney transplantation surgery. We applied this strict definition to minimize bias arising from the treating physicians’ judgment regarding whether the procedure was performed for ultrafiltration only, depuration only, or both, and to avoid the use of more flexible definitions based on the number of dialysis sessions performed. Four predictor combinations were used to model the occurrence of DGF: Donor-only (D), donor + transplant (DT), donor + recipient (DR), and donor + transplant + recipient (DTR). We chose this approach to determine which feature sets were sufficient to achieve good results.

ML-based model development

The dataset comprised 43.5% DGF-positive and 56.5% DGF-negative patients. It was split into 80% for training and 20% for validation/testing. We utilized LR[5] together with six ML models to predict DGF after kidney transplantation, including support vector machine[6], decision trees[7], random forest (RF)[8], gradient boosting machine (GBM)[9], extreme gradient boost (XGB)[10], and multilayer perceptron[11]. We used a univariate-multivariate approach, where we fit univariate LRs and then multivariate LRs, starting with significant predictors (at a 95% significance level) from the univariate models. We applied iterative reduction, where we removed non-significant variables one by one while checking for confounding effects, until obtaining a model with only significant explanatory variables.

For all models, hyperparameters were optimized using random search and 10-fold cross-validation. Model performance was primarily assessed by the area under the receiving operating curve (AUC-ROC), as this metric is robust to class imbalance and has been used in previous studies on DGF prediction, enabling direct comparison with prior results. Additional performance metrics included accuracy, defined as the proportion of correctly classified instances, as well as sensitivity (also known as recall) and specificity, due to their straightforward interpretability. To account for influencing variables, we used permutation feature importance and Shapley values for ML models. The differences in performance between the different models (or model type) and datasets were accounted for by Kruskal-Wallis tests at 95% confidence, due to the limited sample size of this experiment, which may hinder normality assumptions needed for other parametric tests (like analysis of variance).

RESULTS

The following sections show the description of clinical characteristics for each predictor set, and the results of the LR and ML models.

Donor independent variables

The donors were mostly male (59.1%), 42.6 ± 14.6 years old, died mainly because of stroke (39.2%) or trauma (33.4%), 16.8% were considered as extended criteria donors, and most did not have hypertension (77.1%). Table 1 shows the summary of all the available characteristics, together with their completeness in the dataset (i.e., proportion of observations that had an available value).

Table 1 Clinical characteristics of kidney donors.

Donor feature	Summary statistics, mean ± SD (range)	Completeness (%)
Sex	Female: 40.9%; male: 59.1%	94
Age (years)	42.6 ± 14.6 (4-73)	98
Weight (kg)	74.5 ± 17.3 (15-180)	19
Height (cm)	166.7 ± 10.8 (97-190)	19
Body mass index (kg/m²)	26.6 ± 4.8 (16-56)	19
Cause of death	Hemorrhagic stroke: 39.2%, trauma: 33.4%, ischemic stroke: 11.7%, other: 11.0%, anoxia: 4.7%	100
Extended criteria donor (ECD)	Yes: 16.8%, no: 83.2%	100
Blood type	O: 55.8%, A: 31.8%, B: 11.4%, AB: 1.1%	98
Hypertension	Yes: 22.9%, no: 77.1%	83
Diabetes mellitus	Type 1 0.8%, type 2 4.4%, no: 94.8%	80
Serum creatinine (mg/dL)	0.88 ± 0.37 (0.20-2.82)	89
Mean blood pressure (mmHg)	83.9 ± 13.1 (50-120)	57
Diuresis (mL/hour)	154.6 ± 123.6 (0-700)	56

ECD: Extended criteria donor.

Open in New Tab Full Size Table

Kidney transplant independent variables

Table 2 shows the characteristics related to the transplant procedure itself, with their corresponding statistical metrics and data completeness percentages.

Table 2 Clinical characteristics of kidney transplants.

Transplant feature	Summary statistics, mean ± SD (range)	Completeness (%)
Origin of the kidney	Local: 38.6%, national: 61.4%	83
Cold ischemia time (hours)	18.99 ± 6.05 (3.50, 40.15)	93
Warm ischemia time (minutes)	39.49 ± 11.57 (10, 90)	71

Open in New Tab Full Size Table

Recipient independent variables

The recipients were mostly male too (54.3%), 46.4 ± 12.9 years old, and most of them had no or a small number of comorbidities (average Charlson score of less than 3), but most had hypertension (84.9%) and showed lab values according to their end-stage renal disease. Table 3 shows the summary of all the available characteristics, and Table 4 shows laboratory parameters, both with their completeness within the dataset.

Table 3 Clinical characteristics of kidney recipients.

Recipient feature	Summary statistics, mean ± SD (range)	Completeness (%)
Recipient characteristics
Sex	Female: 45.7%, male: 54.3%	100
Age (years)	46.4 ± 12.9 (2-76)	98
Weight (kg)	66.0 ± 11.8 (28.2-115)	92
Height (m)	1.63 ± 0.09 (1.00-1.87)	74
Body mass index (m/kg²)	24.6 ± 3.1 (16.6-38.9)	74
Blood type	O: 53.0%, A: 32.1%, B: 12.3%, AB: 2.51%	99
Number of transplants (n)	1: 83.5%, 2: 15%, 3: 1.5%	51
Time on the waiting list (months)	40.2 ± 31.7 (1-191)	57
Pre-transplant dialysis time (months)	69.3 ± 44.2 (0-384)	95
Residual diuresis (mL/day)	324.3 ± 491.5 (0-2500)	63
Comorbidities
Hypertension	Yes: 84.9%, no: 15.1%	94
Coronary artery disease	Yes: 5.9%, no: 94.1%	90
Congestive heart failure	Yes: 3.2%, no: 96.8%	90
Arrhythmias	Yes: 2.6%, no: 97.4%	90
Peripheral vascular disease	Symptomatic: 2.8%, asymptomatic: 0.6%, no: 96.6%	89
DM	Yes: 9.7%, no: 90.3%	91
DM type	Type 1: 16.2%, type 2: 82.4%	90
Cancer	Yes: 2.3%, no: 97.7%	90
Uropathy	Yes: 2.6%, no: 97.4%	90
HIV	Yes: 1.7%, no: 98.3%	90
Other physical	Yes: 5.0%, no: 95.0%	90
Other psychiatric	Yes: 2.6%, no: 97.4%	90
Charlson score	2.92 ± 1.23 (2-10)	42
Clinical history
Transfusions	Yes: 30.3%, no: 69.7%	64
Previous organ transplant	Yes: 0.56%, no: 99.44%	70
Tobacco use	Yes: 47.5%, no: 52.5%	69
Alcohol use	Yes: 26.1%, no: 73.9%	72
Other drugs	Yes: 0.2%, no: 99.8%	72
Cause of chronic kidney disease	Unknown: 44.5%, non-diabetes mellitus glomerulopathy: 27.1%, congenital and cystic: 8.0%, diabetic kidney disease: 6.9%, other: 6.8%, hypertensive or vascular: 3.8%, tubulointerstitial: 3.0%	99
Dialysis	Yes: 99.7%, no: 0.3%	97
Dialysis type	HD: 91.0%, PD: 5.1%, combination: 3.9%	90

DM: Diabetes mellitus; HIV: Human immunodeficiency virus; HD: Hemo dialysis; PD: Peritoneal dialysis.

Open in New Tab Full Size Table

Table 4 Laboratory features of kidney recipients.

Laboratory feature of the recipient	Summary statistics, mean ± SD (range)	Completeness (%)
Serum creatinine (mg/dL)	8.5 ± 2.6 (1.04-18.7)	55
Proteinuria (g)	12.5 ± 36.6 (0-148)	4
Cholesterol (mg/dL)	182.0 ± 44.0 (100-320)	11
Phosphorus (mg/dL)	5.0 ± 1.6 (1.4-9.9)	48
Calcium (mg/dL)	9.1 ± 1.0 (5.1-12.0)	48
PTH (pg/mL)	387.1 ± 371.5 (2.5-2292)	43
Albumin (g/dL)	4.3 ± 0.3 (3.1-5.5)	38
Hb (g/dL)	10.9 ± 1.6 (5.9-17.0)	37
CMV	Positive: 78.5%, negative: 21.5%	60
Chagas	Positive: 2.8%, negative: 97.2%	61
Toxoplasma	Positive: 31.9%, negative: 68.1%	61
HTLV-1	Positive: 23.1%, negative: 76.9%	2
PPD	Positive: 43.2%, negative: 56.8%	5

PTH: Parathyroid hormone; Hb: Hemoglobin; CMV: Cytomegalovirus; HTLV-1: Human T-lymphotropic virus 1; PPD: Purified protein derivative.

Open in New Tab Full Size Table

Model results

Both types of models exhibited variable and similar performance. While LR obtained an AUC-ROC ranging from 0.49 to 0.68 and an accuracy between 0.51 and 0.62, the six ML models obtained an AUC-ROC ranging from 0.35 to 0.81 and an accuracy between 0.48 and 0.70. The best-performing model for each dataset achieved AUC-ROCs of 0.81 (GB), 0.71 (RF), 0.67 (GB), and 0.62 (XGB and GB), and accuracies of 0.07 (RF), 0.70 (RF), 0.58 (decision trees, RF, and XGB), and 0.61 (XGB) respectively for D, DT, DR, and DTR. Nevertheless, as shown in Tables 5 and 6, none of the ML models showed consistent and substantially better results than LR, and even some of them performed slightly better (or worse) than tossing a coin. Indeed, when running Kruskal-Wallis tests, we found no significant differences between the models (P-value = 0.14 and P-value = 0.22, respectively for AUC-ROC and Accuracy), nor between LR and each model separately, as shown in Table 6.

Table 5 The area under the receiving operating curve and Accuracy metrics for the different delayed graft function classification models and different combinations of predictor variables.

Metric	AUC-ROC				Accuracy
Model and data	D	DT	DR	DTR	D	DT	DR	DTR
LR	0.49	0.68	0.53	0.67	0.51	0.58	0.58	0.62
SVM	0.35	0.62	0.51	0.51	0.57	0.57	0.53	0.53
DET	0.67	0.45	0.58	0.51	0.58	0.49	0.58	0.48
RF	0.78	0.71	0.57	0.52	0.70	0.70	0.58	0.50
GB	0.81	0.70	0.67	0.62	0.63	0.63	0.56	0.60
XGB	0.75	0.66	0.60	0.62	0.60	0.63	0.58	0.61
MLP	0.68	0.70	0.50	0.47	0.61	0.61	0.53	0.49

AUC-ROC: Area under the receiving operating curve; D: Donor data; DT: Donor and transplant data; DR: Donor and recipient data; DTR: Donor, transplant and recipient date; LR: Logistic regression; SVM: Support vector machine; DET: Decision trees; RF: Random forest; GB: Gradient boost; XGB: Extreme gradient boost; MLP: Multilayer perceptron.

Open in New Tab Full Size Table

Table 6 Sensitivity and specificity metrics for the different delayed graft function classification models and different combinations of predictor variables.

Metric	Sensitivity				Specificity
Model and data	D	DT	DR	DTR	D	DT	DR	DTR
LR	0.33	0.44	0.18	0.56	0.67	0.71	0.95	0.67
SVM	0	0	0	0	1	1	0.93	0.93
DET	0.44	0.29	0.38	0.50	0.69	0.64	0.73	0.47
RF	0.44	0.50	0.32	0.52	0.89	0.84	0.78	0.49
GB	0.21	0.50	0	0.47	0.96	0.73	0.98	0.69
XGB	0.09	0.50	0.41	0.61	0.98	0.73	0.71	0.64
MLP	0.53	0.56	0.56	0.44	0.67	0.64	0.51	0.53

D: Donor data; DT: Donor and transplant data; DR: Donor and recipient data; DTR: Donor, transplant and recipient date; LR: Logistic regression; SVM: Support vector machine; DET: Decision trees; RF: Random forest; GB: Gradient boost; XGB: Extreme gradient boost; MLP: Multilayer perceptron.

Open in New Tab Full Size Table

Table 6 shows the sensitivity and specificity of the corresponding models. The results are highly heterogeneous across datasets and models; sensitivity values are consistently low, indicating an overall poor ability to predict DGF (the positive class of the target variable). The only models achieving values above 0.5 for this metric were the Multilayer Perceptron models trained on the D, DT, and DR datasets, with scores ranging from 0.53 to 0.56. For the DTR dataset, LR, RF, and XGB also exceeded this threshold, obtaining values of 0.56, 0.52, and 0.61, respectively. As shown in Table 7, there were no statistically significant differences in these metrics between LR and each of the ML models. In terms of interpretability, Table 8 shows significant variables according to the final multivariate LRs. On the other hand, Table 9 presents permutation feature importance and Shapley values of some of the top-performing ML models, different from LR (GB with D, RF with DT, and GB with DR data). Across most models, LR and ML, significant predictors included the donor’s creatinine, age, cold ischemia time, and the recipient's smoking condition. For LR, the predictor that is found relevant in most models is the donor’s age. Whereas, from Table 9, the predictors that were found relevant in all models are the donor’s age, and in most models, the donor’s creatinine level, and mean blood pressure.

Table 7 Kruskal-Wallis P-values when comparing logistic regression to each Machine learning model independently.

Comparison	AUC-ROC P value	Accuracy P value	Sensitivity P value	Specificity P value
LR vs SVM	0.2454	0.2396	0.0139	0.0778
LR vs DET	0.4678	0.2186	0.8845	0.3836
LR vs RF	0.3865	0.5516	0.6631	0.7715
LR vs GB	0.1913	0.2425	0.7728	0.1465
LR vs XGB	0.5637	0.2367	0.7728	0.6612
LR vs MLP	0.8845	0.7702	0.1804	0.0384

LR: Logistic regression; SVM: Support vector machine; DET: Decision trees; RF: Random forest; GB: Gradient boost; XGB: Extreme gradient boost; MLP: Multilayer perceptron.

Open in New Tab Full Size Table

Table 8 Statistically significant variables at the 95% significance level from multivariate logistic regressions.

Data	Variable	Odds ratio	P value
DTR
Donor (D)	Age	82.4	< 0.01
Transplant (T)	Cold ischemia time (hours)	30.8	0.001
Recipient (R)	Residual diuresis (mL/day)	0.1	0.006
Recipient (R)	Smoking (BV¹ = no smoking)	15.5	0.02
DR
Recipient (R)	Residual diuresis (mL/day)	0.1	0.008
Recipient (R)	Smoking (BV¹ = no smoking)	10.4	0.02
DT
Transplant (T)	Cold ischemia time (hours)	28.9	0.001
Donor (D)	Age	60.8	< 0.01
D
Donor (D)	Age	35.9	< 0.01

¹Indicates the base value of the categorical predictor in the model.

D: Donor data; DT: Donor and transplant data; DR: Donor and recipient data; DTR: Donor, transplant and recipient date.

Open in New Tab Full Size Table

Table 9 Most important features according to permutation feature importance and Shapley additive explanations methods.

Model and data	Relevant predictors and the data set from which they come	Mean of error increase (from PFI)	Standard deviation of error increase (from PFI)	Mean and direction of SHAP values
GB with D	Creatinine (D)	0.05	0.02	0.07 (+ -)
	Age (D)	0.03	0.01	0.09 (+)
	Stroke death (D)	0.02	0.01	0.03 (+)
	ECD (D)	0.02	0.02	0.03 (+)
	MBP (D)	0.01	0.01	0.01 (-)
RF with DT	Age (D)	0.06	0.02	0.03 (+)
	MBP (D)	0.04	0.01	0.02 (-)
	Cold ischemia time (T)	0.03	0.04	0.05 (+)
	Creatinine (D)	0.01	0.01	0.02 (+ -)
	Warm ischemia time (T)	0.01	0.02	0.05 (+)
GB with DR	Age (D)	0.04	0.01	0.03 (+)
	Smoking (R)	0.03	0.02	0.02 (+)
	MBP (D)	0.02	0.01	0.01 (-)
	Creatinine (D)	0.02	0.01	0.01 (+ -)

SHAP: Shapley values; ECD: Extended criteria donor; MBP: Mean blood pressure; D: Donor data; DT: Donor and transplant data; DR: Donor and recipient data; RF: Random forest; GB: Gradient boost; PFI: Permutation feature importance.

Open in New Tab Full Size Table

DISCUSSION

DGF incidence is a critical adverse event following kidney transplantation, with implications for both early and potentially long-term graft function and survival[1,12]. Most physicians involved in post-transplant care are keen to predict DGF, as doing so enables more precise tailoring of immunosuppressive regimens and other pharmacological interventions. Early identification also facilitates timely diagnosis and management of complications associated with reduced glomerular filtration rate[1]. Traditional prediction tools rely primarily on statistical methods, such as linear and LR. Our results show that LR performs comparably to the 6 ML models evaluated. The top performing model incorporates clinical variables from the donor, the transplant procedure, and the recipient, achieving average Accuracy and AUC-ROC values of 0.59 ± 0.087, and the best ones (GB) achieved 0.81. These results, both predictive power and relevant variables, are comparable to (and even better than) the United States’ Kidney Donor Risk Index, which was developed using the extensive United Network for Organ Sharing dataset, and reported a performance of 0.62[13]. Nonetheless, there was a consistent trend of high specificity but low sensitivity, indicating that the models performed well in predicting the absence of DGF, while still struggling to correctly identify the patterns associated with its occurrence. This indicates that the models perform poorly, despite exhibiting AUC-ROC values and accuracies comparable to those reported in the literature. That is, both the ML and LR models demonstrate limited ability to correctly predict DGF, which is a major limitation. The imbalance between specificity and sensitivity likely reflects multifactorial decision-making when a physician prescribes immediate post-transplant dialysis, which may not be fully captured by the available predictors, which diminishes the clinical impact of the models.

At first glance, it may seem paradoxical that ML models did not outperform LR, given the prevailing trend toward artificial intelligence tools. However, this finding is not novel. Prior studies have shown that the AUC values of LR and ML models for clinical risk prediction are often comparable when the comparisons are conducted with low risk of bias. Interestingly, ML models tend to demonstrate superior performance in studies where the risk of bias is higher[3]. Another critical consideration before implementing ML models in clinical medicine is the quality and completeness of the data being analyzed. As shown in Tables 1, 2, 3, and 4, many predictors exhibited low or insufficient levels of completeness. Moreover, several clinically relevant features related to graft quality, such as the administration of vasoactive and vasopressor drugs to the donor, were absent from the available data[14].

ML models are designed to learn directly and automatically from data[15]. In contrast, regression models rely on predefined theoretical frameworks and assumptions, and their performance can be enhanced through human intervention and domain expertise during model specification. For instance, ML can more effortlessly incorporate nonlinear associations and interaction terms without manual adjustment[16]. In this sense, the primary goal of ML may be considered more flexible than classical regression analysis, as it adapts to the data to optimize outcome prediction without requiring a predefined model structure. From another perspective, the primary goal of ML is to predict outcomes based on input data, whereas linear and LR aim to ensure that predicted outputs do not deviate significantly from observed values within the constraints of the specified model. This raises a new and practical concern: Interpretability. Will clinicians trust the findings produced by ML models? Unlike LR, which yields parameter estimates that are straightforward to interpret (as presented in Table 8), ML outputs are shown as a probability of effect with a direction, but they are not a specific value as in LR, which can appear opaque or even “magical” to users, who may struggle to objectively understand how predictions are generated.

In our case, we believe that the similar results between LR and ML models are driven by a “small” dataset with a large amount of missing data for some potentially relevant variables. ML models couldn’t leverage all their potential since the explanatory variables included were already the variables that traditional models have shown to be relevant, and no new non-linear interactions could be obtained from ML models. Nonetheless, and despite the overall moderate performance, the models revealed several interesting patterns in the predictor variables. As expected from the literature, most LR and ML models identified donor age and donor serum creatinine as important predictors. However, it is interesting that the mean blood pressure of the donor and whether the recipient was an active smoker stood out as consistently relevant variables. Definitely, interactions worth exploring in future work.

CONCLUSION

Although the predictive performance was lower than expected, and the dataset lacked some of the quality typically required for ML models to excel, our results achieved performance comparable to the Kidney Donor Risk Index while using only about one-tenth of the original sample size. Moreover, the identification of a potentially novel association suggests a promising avenue for future investigation.

References

1.	Hariharan S, Israni AK, Danovitch G. Long-Term Survival after Kidney Transplantation. N Engl J Med. 2021;385:729-743. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 540] [Cited by in RCA: 470] [Article Influence: 94.0] [Reference Citation Analysis (0)]

Lentine KL, Smith JM, Lyden GR, Miller JM, Booker SE, Dolan TG, Temple KR, Weiss S, Handarova D, Israni AK, Snyder JJ. OPTN/SRTR 2023 Annual Data Report: Kidney. Am J Transplant. 2025;25:S22-S137. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 90] [Cited by in RCA: 113] [Article Influence: 113.0] [Reference Citation Analysis (0)]

Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12-22. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 1585] [Cited by in RCA: 1206] [Article Influence: 172.3] [Reference Citation Analysis (4)]

4.	Widaman KF. III. Missing data: What to do with or without them. Monogr Soc Res Child Dev. 2006;71:42-64. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 79] [Cited by in RCA: 71] [Article Influence: 3.6] [Reference Citation Analysis (0)]

5.	Lai TL, Robbins H, Wei CZ. Strong consistency of least squares estimates in multiple regression. Proc Natl Acad Sci U S A. 1978;75:3034-3036. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 60] [Cited by in RCA: 25] [Article Influence: 1.6] [Reference Citation Analysis (0)]

6.	Cortes C, Vapnik VN. Support-vector networks. Mach Learn. 1995;20:273-297. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 30157] [Cited by in RCA: 10481] [Article Influence: 616.5] [Reference Citation Analysis (3)]

7.	von Winterfeldt D, Edwards W. Decision Analysis and Behavioral Research. Cambridge University Press, 1986. [PubMed] [DOI]

8.	Rebala G, Ravi A, Churiwala S. An Introduction to Machine Learning. Springer, 2019. [PubMed] [DOI] [Full Text]

9.	Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Statist. 2001;29. [PubMed] [DOI] [Full Text]

10.

Chen TQ, Guestrin C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794. [RCA] [DOI] [Full Text] [Cited by in Crossref: 12755] [Cited by in RCA: 10034] [Article Influence: 1003.4] [Reference Citation Analysis (1)]

11.	Popescu MC, Balas VE, Perescu-Popescu L, Mastorakis N. Multilayer perceptron and neural networks. WSEAS Trans Cir Sys. 2009;8:579-588. [PubMed] [DOI]

12.

Zens TJ, Danobeitia JS, Leverson G, Chlebeck PJ, Zitur LJ, Redfield RR, D'Alessandro AM, Odorico S, Kaufman DB, Fernandez LA. The impact of kidney donor profile index on delayed graft function and transplant outcomes: A single-center analysis. Clin Transplant. 2018;32:e13190. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 117] [Cited by in RCA: 106] [Article Influence: 13.3] [Reference Citation Analysis (0)]

13.

Rao PS, Schaubel DE, Guidinger MK, Andreoni KA, Wolfe RA, Merion RM, Port FK, Sung RS. A comprehensive risk quantification score for deceased donor kidneys: the kidney donor risk index. Transplantation. 2009;88:231-236. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 949] [Cited by in RCA: 833] [Article Influence: 49.0] [Reference Citation Analysis (0)]

14.

Almasri J, Tello M, Benkhadra R, Morrow AS, Hasan B, Farah W, Alvarez Villalobos N, Mohammed K, Allen JP, Prokop LJ, Wang Z, Kasiske BL, Israni AK, Murad MH. A Systematic Review for Variables to Be Collected in a Transplant Database for Improving Risk Prediction. Transplantation. 2019;103:2591-2601. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 5] [Cited by in RCA: 12] [Article Influence: 2.0] [Reference Citation Analysis (0)]

15.	Mitchell JB. Machine learning methods in chemoinformatics. Wiley Interdiscip Rev Comput Mol Sci. 2014;4:468-481. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 243] [Cited by in RCA: 263] [Article Influence: 21.9] [Reference Citation Analysis (0)]

16.	Boulesteix AL, Schmid M. Machine learning versus statistical modeling. Biom J. 2014;56:588-593. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 87] [Cited by in RCA: 62] [Article Influence: 5.2] [Reference Citation Analysis (0)]

Footnotes

Peer review: Externally peer reviewed.

Peer-review model: Single blind

Specialty type: Urology and nephrology

Country of origin: Chile

Peer-review report’s classification

Scientific quality: Grade B, Grade C

Novelty: Grade B, Grade B

Creativity or innovation: Grade B, Grade C

Scientific significance: Grade B, Grade B

P-Reviewer: Zerem E, MD, PhD, Professor, Bosnia and Herzegovina S-Editor: Bai SR L-Editor: A P-Editor: Zhang L