Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain

Herrewijnen, Elize; Craandijk, Dennis F. W.

Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain

Files

paper2.pdf (1.01 MB)

Publication date

2023

Authors

Herrewijnen, Elize

Craandijk, Dennis

Editors

Lagioia, Francesca

Mumford, Jack

Odekerken, Daphne

Westermann, Hannes

Document Type

Part of book

Metadata

Show full item record

Collections

Utrecht University Repository

License

cc_by

Abstract

Creating meaningful text embeddings using BERT-based language models involves pre-training on large amounts of data. For domain-specific use cases where data is scarce (e.g., the law enforcement domain) it might not be feasible to pre-train a whole new language model. In this paper, we examine how extending BERT-based tokenizers and further pre-training BERT-based models can benefit downstream classification tasks. As a proxy for domain-specific data, we use the European Convention of Human Rights (ECtHR) dataset. We find that for down-stream tasks, further pre-training a language model on a small domain dataset can rival models that are completely retrained on large domain datasets. This indicates that completely retraining a language model may not be necessary to improve down-stream task performance. Instead, small adaptions to existing state-of-the-art language models like BERT may suffice.

Keywords

Transformers, BERT, Language Models, Legal Text Classification, ECtHR dataset, Text Embeddings

Citation

Herrewijnen, E & Craandijk, D F W 2023, Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain. in F Lagioia, J Mumford, D Odekerken & H Westermann (eds), Proceedings of the 6th Workshop on Automated Semantic Analysis of Information in Legal Text co-located with the 19th International Conference on Artificial Intelligence and Law (ICAIL 2023), Braga, Portugal, 23rd September, 2023. vol. 3441, CEUR Workshop Proceedings, CEUR WS, pp. 13-18, 19th International Conference on Artificial Intelligence and Law, Braga, Portugal, 19/06/23. < https://ceur-ws.org/Vol-3441/paper2.pdf >, conference

URI

http://hdl.handle.net/1874/431789

Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain

Files

Publication date

Authors

Editors

Advisors

Supervisors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI