Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization

Publication date

2021-11

Authors

Riccardo Bassani
Anders S{\o}gaard
Bassani, Riccardo

Editors

Advisors

Supervisors

Document Type

Part of book
Open Access logo

License

cc_by

Abstract

Multilingual language models exhibit better performance for some languages than for others (Singh et al., 2019), and many languages do not seem to benefit from multilingual sharing at all, presumably as a result of poor multilingual segmentation (Pyysal o et al., 2020). This work explores the idea of learning multilingual language models based on clustering of monolingual segments. We show significant improvements over standard multilingual segmentation and training across nine languages on a question answering task, both in a small model regime and for a model of the size of BERT-base.

Keywords

Citation

Riccardo Bassani, Anders S{\o}gaard & Bassani, R 2021, Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization. in Proceedings of the Conference on Empirical Methods in Natural Language Processing : 1st Workshop on Multilingual Representation Learning. Association for Computational Linguistics, pp. 32-40. https://doi.org/10.18653/v1/2021.mrl-1.3