Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain
Files
Publication date
2023
Editors
Lagioia, Francesca
Mumford, Jack
Odekerken, Daphne
Westermann, Hannes
Advisors
Supervisors
DOI
Document Type
Part of book
Metadata
Show full item recordCollections
License
cc_by
Abstract
Creating meaningful text embeddings using BERT-based language models involves pre-training on large amounts of data. For domain-specific use cases where data is scarce (e.g., the law enforcement domain) it might not be feasible to pre-train a whole new language model. In this paper, we examine how extending BERT-based tokenizers and further pre-training BERT-based models can benefit downstream classification tasks. As a proxy for domain-specific data, we use the European Convention of Human Rights (ECtHR) dataset. We find that for down-stream tasks, further pre-training a language model on a small domain dataset can rival models that are completely retrained on large domain datasets. This indicates that completely retraining a language model may not be necessary to improve down-stream task performance. Instead, small adaptions to existing state-of-the-art language models like BERT may suffice.
Keywords
Transformers, BERT, Language Models, Legal Text Classification, ECtHR dataset, Text Embeddings
Citation
Herrewijnen, E & Craandijk, D F W 2023, Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain. in F Lagioia, J Mumford, D Odekerken & H Westermann (eds), Proceedings of the 6th Workshop on Automated Semantic Analysis of Information in Legal Text co-located with the 19th International Conference on Artificial Intelligence and Law (ICAIL 2023), Braga, Portugal, 23rd September, 2023. vol. 3441, CEUR Workshop Proceedings, CEUR WS, pp. 13-18, 19th International Conference on Artificial Intelligence and Law, Braga, Portugal, 19/06/23. < https://ceur-ws.org/Vol-3441/paper2.pdf >, conference