Full Text or Abstract?: Examining Topic Coherence Scores Using Latent Dirichlet Allocation

Syed, S.; Spruit, M.

doi:https://doi.org/10.1109/DSAA.2017.61

Full Text or Abstract?: Examining Topic Coherence Scores Using Latent Dirichlet Allocation

Files

Examining.pdf (914.18 KB)

Publication date

2017

Authors

Syed, S

Spruit, Marco

DOI

https://doi.org/10.1109/DSAA.2017.61

Document Type

Part of book

Metadata

Show full item record

Collections

Utrecht University Repository

License

taverne

Abstract

This paper assesses topic coherence and human topic ranking of uncovered latent topics from scientific publications when utilizing the topic model latent Dirichlet allocation (LDA) on abstract and full-text data. The coherence of a topic, used as a proxy for topic quality, is based on the distributional hypothesis that states that words with similar meaning tend to co-occur within a similar context. Although LDA has gained much attention from machine-learning researchers, most notably with its adaptations and extensions, little is known about the effects of different types of textual data on generated topics. Our research is the first to explore these practical effects and shows that document frequency, document word length, and vocabulary size have mixed practical effects on topic coherence and human topic ranking of LDA topics. We furthermore show that large document collections are less affected by incorrect or noise terms being part of the topic-word distributions, causing topics to be more coherent and ranked higher. Differences between abstract and full-text data are more apparent within small document collections, with differences as large as 90% high-quality topics for full-text data, compared to 50% high-quality topics for abstract data.

Keywords

Abstract, Full-text, Latent Dirichlet Allocation, Topic coherence, Human topic ranking, Taverne

Citation

Syed, S & Spruit, M 2017, Full Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation. in 4th IEEE International Conference on Data Science and Advanced Analytics. IEEE, Tokyo, pp. 165-174. https://doi.org/10.1109/DSAA.2017.61

URI

https://dspace.library.uu.nl/handle/1874/358288

Full Text or Abstract?: Examining Topic Coherence Scores Using Latent Dirichlet Allocation

Files

Publication date

Authors

Editors

Advisors

Supervisors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI