Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization
Publication date
2021-11
Authors
Riccardo Bassani
Anders S{\o}gaard
Bassani, Riccardo
Editors
Advisors
Supervisors
Document Type
Part of book
Metadata
Show full item recordCollections
License
cc_by
Abstract
Multilingual language models exhibit better performance for some languages than for others (Singh et al., 2019), and many languages do not seem to benefit from multilingual sharing at all, presumably as a result of poor multilingual segmentation (Pyysal o et al., 2020). This work explores the idea of learning multilingual language models based on clustering of monolingual segments. We show significant improvements over standard multilingual segmentation and training across nine languages on a question answering task, both in a small model regime and for a model of the size of BERT-base.
Keywords
Citation
Riccardo Bassani, Anders S{\o}gaard & Bassani, R 2021, Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization. in Proceedings of the Conference on Empirical Methods in Natural Language Processing : 1st Workshop on Multilingual Representation Learning. Association for Computational Linguistics, pp. 32-40. https://doi.org/10.18653/v1/2021.mrl-1.3