ASN's Mission

ASN leads the fight to prevent, treat, and cure kidney diseases throughout the world by educating health professionals and scientists, advancing research and innovation, communicating new knowledge, and advocating for the highest quality care for patients.

learn more

Contact ASN

1401 H St, NW, Ste 900, Washington, DC 20005


The Latest on Twitter

Kidney Week

Abstract: TH-PO386

Optimizing Machine Learning Methods for Clinical Outcome Prediction

Session Information

Category: CKD (Non-Dialysis)

  • 2101 CKD (Non-Dialysis): Epidemiology, Risk Factors, and Prevention


  • Liu, Qian, Arbor Research Collaborative for Health, Ann Arbor, Michigan, United States
  • Smith, Abigail R., Arbor Research Collaborative for Health, Ann Arbor, Michigan, United States
  • Mariani, Laura H., University of Michigan, Ann Arbor, Michigan, United States
  • Zee, Jarcy, Arbor Research Collaborative for Health, Ann Arbor, Michigan, United States

Machine learning (ML) is useful to identify novel biomarkers and predict clinical outcomes, especially when predictors outnumber patients, but model building procedures are underutilized. We compared two ML methods and the impact of pre-specifying covariate functional forms on predictive accuracy and variable importance using data from NEPTUNE, a prospective cohort study of glomerular disease patients.


The sample was split into training (70%) and validation (30%) sets. Ridge regression and random forest models were developed in the training set to predict time to two clinical outcomes: disease progression (ESRD or ≥40% eGFR decline with last eGFR <60) and complete remission of proteinuria (UPCR <0.3), with and without categorizing continuous covariates to accomodate non-linear associations with outcomes. Predictors included 56 demographic/clinical characteristics, which were ranked by variable importance. Discrimination was estimated in the validation set using integrated area under the curve (iAUC).


Using pre-specified covariate functional forms in ridge regression increased iAUC from 0.68 to 0.74 for the progression outcome, but had little impact for remission (0.79 vs. 0.78; Fig) or the random forest method for both outcomes. iAUCs from random forest were higher than those from ridge for progression but not remission. After pre-specifying functional forms in ridge regression, variable importance ranks increased for some known risk factors: rank of UPCR for predicting remission rose from 48 to 5 and rank of eGFR for predicting progression rose from 52 to 1. Other important predictors were disease diagnosis, age, and immunosuppression use for remission and disease diagnosis, race, and hypertension for progression.


For ML methods assuming linear associations, like ridge regression, pre-specifying covariate functional forms is important for predictive accuracy and detecting important predictors. Different ML methods may improve prediction for different outcomes. Higher ranking of known risk factors improves face validity in prediction models and may have positive implications for external validation performance.


  • Other NIH Support