Automatic identification of variables in epidemiological datasets using logic regression
Files
Publication date
2017-04-13
Editors
Advisors
Supervisors
Document Type
Article
Metadata
Show full item recordCollections
License
Abstract
Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
Keywords
Data management, Epidemiology, Logic regression, Meta-analysis, Health Policy, Health Informatics, Journal Article
Citation
Lorenz, M W, Abdi, N A, Scheckenbach, F, Pflug, A, Bülbül, A, Catapano, A L, Agewall, S, Ezhov, M, Bots, M L, Kiechl, S, Orth, A, Norata, G D, Empana, J P, Lin, H J, McLachlan, S, Bokemark, L, Ronkainen, K, Amato, M, Schminke, U, Srinivasan, S R, Lind, L, Kato, A, Dimitriadis, C, Przewlocki, T, Okazaki, S, Stehouwer, C D A, Lazarevic, T, Willeit, P, Yanez, D N, Steinmetz, H, Sander, D, Poppert, H, Desvarieux, M, Ikram, M A, Bevc, S, Staub, D, Sirtori, C R, Iglseder, B, Engström, G, Tripepi, G, Beloqui, O, Lee, M S, Friera, A, Xie, W, Grigore, L, Plichart, M, Su, T C, Robertson, C, Schmidt, C, Tuomainen, T P, Veglia, F, Völzke, H, Nijpels, G, Jovanovic, A, Willeit, J, Sacco, R L, Franco, O H, Hojs, R, Uthoff, H, Hedblad, B, Park, H W, Suarez, C, Zhao, D, Catapano, A, Ducimetiere, P, Chien, K L, Price, J F, Bergström, G, Kauhanen, J, Tremoli, E, Dörr, M, Berenson, G, Papagianni, A, Kablak-Ziembicka, A, Kitagawa, K, Dekker, J M, Stolic, R, Polak, J F, Sitzer, M, Bickel, H, Rundek, T, Hofman, A, Ekart, R, Frauchiger, B, Castelnuovo, S, Rosvall, M, Zoccali, C, Landecho, M F, Bae, J H, Gabriel, R, Liu, J, Baldassarre, D & Kavousi, M 2017, 'Automatic identification of variables in epidemiological datasets using logic regression', BMC medical informatics and decision making [E], vol. 17, no. 1, 40. https://doi.org/10.1186/s12911-017-0429-1