Developing Clinical Prediction Models Using Primary Care Electronic Health Record Data: The Impact of Data Preparation Choices on Model Performance

van Os, Hendrikus J.A.; Kanning, Jos P.; Wermer, Marieke J.H.; Chavannes, Niels H.; Numans, Mattijs E.; Ruigrok, Ynte M.; van Zwet, Erik W.; Putter, Hein; Steyerberg, Ewout W.; Groenwold, Rolf H.H.

doi:https://doi.org/10.3389/fepid.2022.871630

Developing Clinical Prediction Models Using Primary Care Electronic Health Record Data: The Impact of Data Preparation Choices on Model Performance

Files

fepid-02-871630.pdf (554.37 KB)

Publication date

2022

Authors

van Os, Hendrikus J.A.

Kanning, Jos

Wermer, Marieke J.H.

Chavannes, Niels H.

Numans, Mattijs E.

Ruigrok, Ynte

van Zwet, Erik W.

Putter, Hein

Steyerberg, Ewout W.

Groenwold, RHH

DOI

https://doi.org/10.3389/fepid.2022.871630

Document Type

Article

Metadata

Show full item record

Collections

UMC Repository

License

cc_by

Abstract

Objective: To quantify prediction model performance in relation to data preparation choices when using electronic health records (EHR). Study Design and Setting: Cox proportional hazards models were developed for predicting the first-ever main adverse cardiovascular events using Dutch primary care EHR data. The reference model was based on a 1-year run-in period, cardiovascular events were defined based on both EHR diagnosis and medication codes, and missing values were multiply imputed. We compared data preparation choices based on (i) length of the run-in period (2- or 3-year run-in); (ii) outcome definition (EHR diagnosis codes or medication codes only); and (iii) methods addressing missing values (mean imputation or complete case analysis) by making variations on the derivation set and testing their impact in a validation set. Results: We included 89,491 patients in whom 6,736 first-ever main adverse cardiovascular events occurred during a median follow-up of 8 years. Outcome definition based only on diagnosis codes led to a systematic underestimation of risk (calibration curve intercept: 0.84; 95% CI: 0.83–0.84), while complete case analysis led to overestimation (calibration curve intercept: −0.52; 95% CI: −0.53 to −0.51). Differences in the length of the run-in period showed no relevant impact on calibration and discrimination. Conclusion: Data preparation choices regarding outcome definition or methods to address missing values can have a substantial impact on the calibration of predictions, hampering reliable clinical decision support. This study further illustrates the urgency of transparent reporting of modeling choices in an EHR data setting.

Keywords

clinical prediction model, data preparation, electronic health records (EHRs), model performance, model transportability, prediction model, Epidemiology, Infectious Diseases, Public Health, Environmental and Occupational Health, Health(social science)

Citation

van Os, H J A, Kanning, J P, Wermer, M J H, Chavannes, N H, Numans, M E, Ruigrok, Y M, van Zwet, E W, Putter, H, Steyerberg, E W & Groenwold, R H H 2022, 'Developing Clinical Prediction Models Using Primary Care Electronic Health Record Data : The Impact of Data Preparation Choices on Model Performance', Frontiers in Epidemiology, vol. 2, 871630. https://doi.org/10.3389/fepid.2022.871630

URI

https://dspace.library.uu.nl/handle/1874/458354

Developing Clinical Prediction Models Using Primary Care Electronic Health Record Data: The Impact of Data Preparation Choices on Model Performance

Files

Publication date

Authors

Editors

Advisors

Supervisors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI