Decoding 16th-Century Letters: From Topic Models to GPT-Based Keyword Mapping

Ströbel, Phillip; Aderhold, Stefan; Roller, Ramona

Decoding 16th-Century Letters: From Topic Models to GPT-Based Keyword Mapping

Files

2024.konvens-main.23.pdf (488.82 KB)

Publication date

2024-09-10

Authors

Ströbel, Phillip

Aderhold, Stefan

Roller, Ramona

Editors

Luz de Araujo, Pedro Henrique

Baumann, Andreas

Gromann, Dagmar

Krenn, Brigitte

Roth, Benjamin

Wiegand, Michael

Document Type

Part of book

Metadata

Show full item record

Collections

Utrecht University Repository

License

cc_by

Abstract

Probabilistic topic models for categorising or exploring large text corpora are notoriously difficult to interpret. Making sense of them has thus justifiably been compared to “readding tea leaves.” Involving humans in labelling topics consisting of words is feasible but time-consuming, especially if one infers many topics from a text collection. Moreover, it is a coggnitively demanding task, and domain knowledge might be required depending on the text corpus. We thus examine how using a Large Language Model (LLM) offers support in text classification. We compare how the LLM summarises topics produced by Latent Dirichlet Allocation, Non-negative Matrix Factorisation and BERTopic. We investigate which topic modelling technique provides the best representations by applying these models to a 16th-century correspondence corpus in Latin and Early New High German and inferring keywords from the topics in a low-resource setting. We experiment with including domain knowledge in the form of already existing keyword lists. Our main findings are that the LLM alone provides usable topics already. However, guiding the LLM towards what is expected benefits the interpretability. We further want to highlight that using nouns and proper nouns only makes for good topic representations.

Citation

Ströbel, P, Aderhold, S & Roller, R 2024, Decoding 16th-Century Letters: From Topic Models to GPT-Based Keyword Mapping. in P H Luz de Araujo, A Baumann, D Gromann, B Krenn, B Roth & M Wiegand (eds), 20th Conference on Natural Language Processing (KONVENS 2024). Association for Computational Linguistics (ACL), Viennna, pp. 209–221. < https://aclanthology.org/2024.konvens-main.23/ >

URI

https://dspace.library.uu.nl/handle/1874/482636

Decoding 16th-Century Letters: From Topic Models to GPT-Based Keyword Mapping

Files

Publication date

Authors

Editors

Advisors

Supervisors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI