The more the better? The effect of domain-specific dataset on entity extraction from Dutch criminal records

Publication date

2021

Authors

Norder, Amber
Sogancioglu, GizemISNI 0000000493066008
Kaya, HeysemORCID 0000-0001-7947-5508ISNI 000000049289651X

Editors

Advisors

Supervisors

DOI

Document Type

Contribution to conference

License

Abstract

The Dutch police force generates very high amounts of documents such as transcripts of interrogations, evidence findings, statements of people involved, all of which need to be read and processed by analysts. Automating the entity extraction in the documents would greatly help the police force. Neural network-based approaches using contextual word embeddings are considered the current state-of-the-art approach to tackle the named entity recognition (NER) problem in the Dutch. There are available domain-independent NER datasets in the literature well as pre-trained NER models. However, earlier studies show that domain-independent models do not work well for domain-specific tasks. As annotation is highly costly, in this study, we train a set of BERTje embeddings based NER models with the varying size of police dataset in addition to the domain-independent set to observe the effect of domain-specific dataset in the training. We follow a training, validation, and test split to ensure a proper experimental protocol. We observe that the slope of the performance increase is decreasing with the number of target domain documents in the training set and stabilizes on the validation set around 250-300 documents. The NER system has a better performance on the held-out test set (85\% macro-average F1 score over five entity categories) compared to the validation set, showing the generalization power of the investigated framework.

Keywords

Natural Language Processing, Named Entity Recognition, Coreference Resolution, Dutch NLP

Citation

Norder, A, Sogancioglu, G & Kaya, H 2021, 'The more the better? The effect of domain-specific dataset on entity extraction from Dutch criminal records', Paper presented at The 31st Meeting of Computational Linguistics in The Netherlands, Ghent, Belgium, 9/07/21 - 9/07/21., conference