Rate variation and recurrent sequence errors in pandemic-scale phylogenetics
Publication date
2026-03
Authors
De Maio, Nicola
Willemsen, Myrthe
Martin, Samuel
Guo, Zihao
Saha, Abhratanu
Hunt, Martin
Ly-Trong, Nhan
Minh, Bui Quang
Iqbal, Zamin
Goldman, Nick
Editors
Advisors
Supervisors
Document Type
Article
Metadata
Show full item recordCollections
License
cc_by
Abstract
Phylogenetic analyses of genome sequences from infectious pathogens reveal essential information regarding their evolution and transmission, as seen during the coronavirus disease 2019 pandemic. Recently developed pandemic-scale phylogenetic inference methods reduce the computational demand of phylogenetic reconstruction from genomic epidemiological datasets, allowing the analysis of millions of closely related genomes. However, widespread homoplasies, due to recurrent mutations and sequence errors, cause phylogenetic uncertainty and biases. We present algorithms and models to substantially improve the computational performance and accuracy of pandemic-scale phylogenetics. In particular, we account for, and identify, mutation rate variation and recurrent sequence errors. We reconstruct a reliable and public sequence alignment and phylogenetic tree of >2 million severe acute respiratory syndrome coronavirus 2 genomes encapsulating the evolutionary history and global spread of the virus up to February 2023.
Keywords
Biotechnology, Biochemistry, Molecular Biology, Cell Biology
Citation
De Maio, N, Willemsen, M, Martin, S, Guo, Z, Saha, A, Hunt, M, Ly-Trong, N, Minh, B Q, Iqbal, Z & Goldman, N 2026, 'Rate variation and recurrent sequence errors in pandemic-scale phylogenetics', Nature Methods, vol. 23, no. 3, pp. 565-573. https://doi.org/10.1038/s41592-025-02932-8