RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content

Publication date

2021-07-09

Authors

Coutinho, Felipe HernandesISNI 0000000517853488
Zaragoza-Solas, Asier
López-Pérez, Mario
Barylski, Jakub
Zielezinski, Andrzej
Dutilh, Bas E.ISNI 0000000389464735
Edwards, Robert
Rodriguez-Valera, Francisco

Editors

Advisors

Supervisors

Document Type

Article
Open Access logo

License

cc_by

Abstract

Culture-independent approaches have recently shed light on the genomic diversity of viruses of prokaryotes. One fundamental question when trying to understand their ecological roles is: which host do they infect? To tackle this issue we developed a machine-learning approach named Random Forest Assignment of Hosts (RaFAH), that uses scores to 43,644 protein clusters to assign hosts to complete or fragmented genomes of viruses of Archaea and Bacteria. RaFAH displayed performance comparable with that of other methods for virus-host prediction in three different benchmarks encompassing viruses from RefSeq, single amplified genomes, and metagenomes. RaFAH was applied to assembled metagenomic datasets of uncultured viruses from eight different biomes of medical, biotechnological, and environmental relevance. Our analyses led to the identification of 537 sequences of archaeal viruses representing unknown lineages, whose genomes encode novel auxiliary metabolic genes, shedding light on how these viruses interfere with the host molecular machinery. RaFAH is available at https://sourceforge.net/projects/rafah/.

Keywords

DSML 2: Proof-of-concept: Data science output has been formulated, implemented, and tested for one domain/problem, host prediction, machine learning, random forest, viral diversity, viral ecology, virome, virus, virus-host associations, General Decision Sciences

Citation

Coutinho, F H, Zaragoza-Solas, A, López-Pérez, M, Barylski, J, Zielezinski, A, Dutilh, B E, Edwards, R & Rodriguez-Valera, F 2021, 'RaFAH : Host prediction for viruses of Bacteria and Archaea based on protein content', Patterns (New York, N.Y.), vol. 2, no. 7, 100274, pp. 1-9. https://doi.org/10.1016/j.patter.2021.100274