Decoding 16th-Century Letters: From Topic Models to GPT-Based Keyword Mapping

Publication date

2024-09-10

Authors

Ströbel, Phillip
Aderhold, Stefan
Roller, RamonaORCID 0000-0003-0146-4264

Editors

Luz de Araujo, Pedro Henrique
Baumann, Andreas
Gromann, Dagmar
Krenn, Brigitte
Roth, Benjamin
Wiegand, Michael

Advisors

Supervisors

DOI

Document Type

Part of book
Open Access logo

License

cc_by

Abstract

Probabilistic topic models for categorising or exploring large text corpora are notoriously difficult to interpret. Making sense of them has thus justifiably been compared to “readding tea leaves.” Involving humans in labelling topics consisting of words is feasible but time-consuming, especially if one infers many topics from a text collection. Moreover, it is a coggnitively demanding task, and domain knowledge might be required depending on the text corpus. We thus examine how using a Large Language Model (LLM) offers support in text classification. We compare how the LLM summarises topics produced by Latent Dirichlet Allocation, Non-negative Matrix Factorisation and BERTopic. We investigate which topic modelling technique provides the best representations by applying these models to a 16th-century correspondence corpus in Latin and Early New High German and inferring keywords from the topics in a low-resource setting. We experiment with including domain knowledge in the form of already existing keyword lists. Our main findings are that the LLM alone provides usable topics already. However, guiding the LLM towards what is expected benefits the interpretability. We further want to highlight that using nouns and proper nouns only makes for good topic representations.

Keywords

Citation

Ströbel, P, Aderhold, S & Roller, R 2024, Decoding 16th-Century Letters: From Topic Models to GPT-Based Keyword Mapping. in P H Luz de Araujo, A Baumann, D Gromann, B Krenn, B Roth & M Wiegand (eds), 20th Conference on Natural Language Processing (KONVENS 2024). Association for Computational Linguistics (ACL), Viennna, pp. 209–221. < https://aclanthology.org/2024.konvens-main.23/ >