Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models

Steging, Cor; Renooij, Silja; Verheij, Bart

doi:https://doi.org/10.1145/3769126.3769230

Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models

Files

3769126.3769230.pdf (766.48 KB)

Publication date

2026-01-13

Authors

Steging, Cor

Renooij, Silja

Verheij, Bart

Editors

Maranhão, Juliano

DOI

https://doi.org/10.1145/3769126.3769230

Document Type

Part of book

Metadata

Show full item record

Collections

Utrecht University Repository

License

cc_by

Abstract

Generative large language models as tools in the legal domain have the potential to improve the justice system. However, the reasoning behavior of current generative models is brittle and poorly understood, hence cannot be responsibly applied in the domains of law and evidence. This paper presents reasoning benchmarks that are dynamically varied, scalable in their complexity, and have formally unambiguous interpretations. In this study, we illustrate the approach on the basis of witness testimony, focusing on the underlying argument attack structure. We dynamically generate both linear and non-linear argument attack graphs of varying complexity and translate these into reasoning puzzles about witness testimony expressed in natural language. We show that state-of-the-art large language models often fail in these reasoning puzzles, already at low complexity. Obvious mistakes are made by the models, and their inconsistent performance indicates that their reasoning capabilities are brittle. Furthermore, at higher complexity, even state-of-the-art models specifically designed for reasoning make mistakes. We show the viability of using a parametrized benchmark with varying complexity to evaluate the reasoning capabilities of generative language models, which contribute to a better understanding of the limitations of the reasoning capabilities of generative models.

Keywords

LLMs, argumentation, benchmarks, generative AI, reasoning, Artificial Intelligence, Software, Law

Citation

Steging, C, Renooij, S & Verheij, B 2026, Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models. in J Maranhão (ed.), Proceedings of the Twentieth International Conference on Artificial Intelligence and Law. Association for Computing Machinery, pp. 455-459, International Conference on Artificial Intelligence and Law, Chicago, United States, 16/06/25. https://doi.org/10.1145/3769126.3769230, conference

URI

https://dspace.library.uu.nl/handle/1874/483413

Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models

Files

Publication date

Authors

Editors

Advisors

Supervisors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI