A workflow for missing values imputation of untargeted metabolomics data

Faquih, Tariq; van Smeden, Maarten; Luo, Jiao; Le Cessie, Saskia; Kastenmüller, Gabi; Krumsiek, Jan; Noordam, Raymond; van Heemst, Diana; Rosendaal, Frits R.; Vlieg, Astrid van Hylckama; van Dijk, Ko Willems; Mook-Kanamori, Dennis O.

doi:https://doi.org/10.3390/metabo10120486

A workflow for missing values imputation of untargeted metabolomics data

Files

metabolites-10-00486-v2.pdf (2.91 MB)

Publication date

2020-12

Authors

Faquih, Tariq

van Smeden, Maarten

Luo, Jiao

Le Cessie, Saskia

Kastenmüller, Gabi

Krumsiek, Jan

Noordam, Raymond

van Heemst, Diana

Rosendaal, Frits R.

Vlieg, Astrid van Hylckama

DOI

https://doi.org/10.3390/metabo10120486

Document Type

Article

Metadata

Show full item record

Collections

UMC Repository

License

cc_by

Abstract

Metabolomics studies have seen a steady growth due to the development and implementation of affordable and high-quality metabolomics platforms. In large metabolite panels, measurement values are frequently missing and, if neglected or sub-optimally imputed, can cause biased study results. We provided a publicly available, user-friendly R script to streamline the imputation of missing endogenous, unannotated, and xenobiotic metabolites. We evaluated the multivariate imputation by chained equations (MICE) and k-nearest neighbors (kNN) analyses implemented in our script by simulations using measured metabolites data from the Netherlands Epidemiology of Obesity (NEO) study (n = 599). We simulated missing values in four unique metabolites from different pathways with different correlation structures in three sample sizes (599, 150, 50) with three missing percentages (15%, 30%, 60%), and using two missing mechanisms (completely at random and not at random). Based on the simulations, we found that for MICE, larger sample size was the primary factor decreasing bias and error. For kNN, the primary factor reducing bias and error was the metabolite correlation with its predictor metabolites. MICE provided consistently higher performance measures particularly for larger datasets (n > 50). In conclusion, we presented an imputation workflow in a publicly available R script to impute untargeted metabolomics data. Our simulations provided insight into the effects of sample size, percentage missing, and correlation structure on the accuracy of the two imputation methods.

Keywords

Imputation, K-nearest neighbors, Metabolon, Multiple imputation using chained equations, Simulation, Untargeted metabolomics, Workflow, Endocrinology, Diabetes and Metabolism, Biochemistry, Molecular Biology

Citation

Faquih, T, van Smeden, M, Luo, J, Le Cessie, S, Kastenmüller, G, Krumsiek, J, Noordam, R, van Heemst, D, Rosendaal, F R, Vlieg, A V H, van Dijk, K W & Mook-Kanamori, D O 2020, 'A workflow for missing values imputation of untargeted metabolomics data', Metabolites, vol. 10, no. 12, 486, pp. 1-23. https://doi.org/10.3390/metabo10120486

URI

https://dspace.library.uu.nl/handle/1874/457641

A workflow for missing values imputation of untargeted metabolomics data

Files

Publication date

Authors

Editors

Advisors

Supervisors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI