The more the better? The effect of domain-specific dataset on entity extraction from Dutch criminal records

Norder, Amber; Sogancioglu, Gizem; Kaya, Heysem

The more the better? The effect of domain-specific dataset on entity extraction from Dutch criminal records

Files

How and where to find an Open Access version of this publication?

Publication date

2021

Authors

Norder, Amber

Sogancioglu, Gizem

Kaya, Heysem

Document Type

Contribution to conference

Metadata

Show full item record

Collections

Utrecht University Repository

License

Abstract

The Dutch police force generates very high amounts of documents such as transcripts of interrogations, evidence findings, statements of people involved, all of which need to be read and processed by analysts. Automating the entity extraction in the documents would greatly help the police force. Neural network-based approaches using contextual word embeddings are considered the current state-of-the-art approach to tackle the named entity recognition (NER) problem in the Dutch. There are available domain-independent NER datasets in the literature well as pre-trained NER models. However, earlier studies show that domain-independent models do not work well for domain-specific tasks. As annotation is highly costly, in this study, we train a set of BERTje embeddings based NER models with the varying size of police dataset in addition to the domain-independent set to observe the effect of domain-specific dataset in the training. We follow a training, validation, and test split to ensure a proper experimental protocol. We observe that the slope of the performance increase is decreasing with the number of target domain documents in the training set and stabilizes on the validation set around 250-300 documents. The NER system has a better performance on the held-out test set (85\% macro-average F1 score over five entity categories) compared to the validation set, showing the generalization power of the investigated framework.

Keywords

Natural Language Processing, Named Entity Recognition, Coreference Resolution, Dutch NLP

Citation

Norder, A, Sogancioglu, G & Kaya, H 2021, 'The more the better? The effect of domain-specific dataset on entity extraction from Dutch criminal records', Paper presented at The 31st Meeting of Computational Linguistics in The Netherlands, Ghent, Belgium, 9/07/21 - 9/07/21., conference

URI

https://dspace.library.uu.nl/handle/1874/431766

The more the better? The effect of domain-specific dataset on entity extraction from Dutch criminal records

Files

Publication date

Authors

Editors

Advisors

Supervisors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI