Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain

Publication date

2023

Authors

Herrewijnen, ElizeORCID 0000-0002-2729-6599ISNI 0000000523876731
Craandijk, DennisISNI 0000000492830166

Editors

Lagioia, Francesca
Mumford, Jack
Odekerken, Daphne
Westermann, Hannes

Advisors

Supervisors

DOI

Document Type

Part of book
Open Access logo

License

cc_by

Abstract

Creating meaningful text embeddings using BERT-based language models involves pre-training on large amounts of data. For domain-specific use cases where data is scarce (e.g., the law enforcement domain) it might not be feasible to pre-train a whole new language model. In this paper, we examine how extending BERT-based tokenizers and further pre-training BERT-based models can benefit downstream classification tasks. As a proxy for domain-specific data, we use the European Convention of Human Rights (ECtHR) dataset. We find that for down-stream tasks, further pre-training a language model on a small domain dataset can rival models that are completely retrained on large domain datasets. This indicates that completely retraining a language model may not be necessary to improve down-stream task performance. Instead, small adaptions to existing state-of-the-art language models like BERT may suffice.

Keywords

Transformers, BERT, Language Models, Legal Text Classification, ECtHR dataset, Text Embeddings

Citation

Herrewijnen, E & Craandijk, D F W 2023, Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain. in F Lagioia, J Mumford, D Odekerken & H Westermann (eds), Proceedings of the 6th Workshop on Automated Semantic Analysis of Information in Legal Text co-located with the 19th International Conference on Artificial Intelligence and Law (ICAIL 2023), Braga, Portugal, 23rd September, 2023. vol. 3441, CEUR Workshop Proceedings, CEUR WS, pp. 13-18, 19th International Conference on Artificial Intelligence and Law, Braga, Portugal, 19/06/23. < https://ceur-ws.org/Vol-3441/paper2.pdf >, conference