Challenges in Reproducing Human Evaluation Results for Role-Oriented Dialogue Summarization

Publication date

2023-08-15

Authors

Ito, TakumiISNI 0000000523804922
Fang, QixiangORCID 0000-0003-2689-6653ISNI 0000000493063739
Mosteiro Romero, PabloORCID 0000-0001-7231-2773ISNI 0000000493075828
Gatt, AlbertORCID 0000-0001-6388-8244ISNI 0000000048277966
van Deemter, KeesISNI 0000000115590531

Editors

Advisors

Supervisors

DOI

Document Type

Part of book
Open Access logo

License

cc_by

Abstract

There is a growing concern regarding the reproducibility of human evaluation studies in NLP. As part of the ReproHum campaign, we conducted a study to assess the reproducibility of a recent human evaluation study in NLP. Specifically, we attempted to reproduce a human evaluation of a novel approach to enhance Role-Oriented Dialogue Summarization by considering the influence of role interactions. Despite our best efforts to adhere to the reported setup, we were unable to reproduce the statistical results as presented in the original paper. While no contradictory evidence was found, our study raises questions about the validity of the reported statistical significance results, and/or the comprehensiveness with which the original study was reported. In this paper, we provide a comprehensive account of our reproduction study, detailing the methodologies employed, data collection, and analysis procedures. We discuss the implications of our findings for the broader issue of reproducibility in NLP research. Our findings serve as a cautionary reminder of the challenges in conducting reproducible human evaluations and prompt further discussions within the NLP community.

Keywords

Citation

Ito, T, Fang, Q, Mosteiro Romero, P, Gatt, A & van Deemter, K 2023, Challenges in Reproducing Human Evaluation Results for Role-Oriented Dialogue Summarization. in The 3rd Workshop on Human Evaluation of NLP Systems (HumEval’23). Association for Computational Linguistics. < https://aclanthology.org/2023.humeval-1.9 >