Abstract: TH-PO386
Optimizing Machine Learning Methods for Clinical Outcome Prediction
Session Information
- CKD: Risk Scores and Translational Epidemiology
November 07, 2019 | Location: Exhibit Hall, Walter E. Washington Convention Center
Abstract Time: 10:00 AM - 12:00 PM
Category: CKD (Non-Dialysis)
- 2101 CKD (Non-Dialysis): Epidemiology, Risk Factors, and Prevention
Authors
- Liu, Qian, Arbor Research Collaborative for Health, Ann Arbor, Michigan, United States
- Smith, Abigail R., Arbor Research Collaborative for Health, Ann Arbor, Michigan, United States
- Mariani, Laura H., University of Michigan, Ann Arbor, Michigan, United States
- Zee, Jarcy, Arbor Research Collaborative for Health, Ann Arbor, Michigan, United States
Background
Machine learning (ML) is useful to identify novel biomarkers and predict clinical outcomes, especially when predictors outnumber patients, but model building procedures are underutilized. We compared two ML methods and the impact of pre-specifying covariate functional forms on predictive accuracy and variable importance using data from NEPTUNE, a prospective cohort study of glomerular disease patients.
Methods
The sample was split into training (70%) and validation (30%) sets. Ridge regression and random forest models were developed in the training set to predict time to two clinical outcomes: disease progression (ESRD or ≥40% eGFR decline with last eGFR <60) and complete remission of proteinuria (UPCR <0.3), with and without categorizing continuous covariates to accomodate non-linear associations with outcomes. Predictors included 56 demographic/clinical characteristics, which were ranked by variable importance. Discrimination was estimated in the validation set using integrated area under the curve (iAUC).
Results
Using pre-specified covariate functional forms in ridge regression increased iAUC from 0.68 to 0.74 for the progression outcome, but had little impact for remission (0.79 vs. 0.78; Fig) or the random forest method for both outcomes. iAUCs from random forest were higher than those from ridge for progression but not remission. After pre-specifying functional forms in ridge regression, variable importance ranks increased for some known risk factors: rank of UPCR for predicting remission rose from 48 to 5 and rank of eGFR for predicting progression rose from 52 to 1. Other important predictors were disease diagnosis, age, and immunosuppression use for remission and disease diagnosis, race, and hypertension for progression.
Conclusion
For ML methods assuming linear associations, like ridge regression, pre-specifying covariate functional forms is important for predictive accuracy and detecting important predictors. Different ML methods may improve prediction for different outcomes. Higher ranking of known risk factors improves face validity in prediction models and may have positive implications for external validation performance.
Funding
- Other NIH Support