Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification

Eshuijs, Leon; Wang, Shihan; Fokkens, Antske

Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification

Files

2025.conll-1.8.pdf (2.39 MB)

Publication date

2025-07-01

Authors

Eshuijs, Leon

Wang, Shihan

Fokkens, Antske

Document Type

Part of book

Metadata

Show full item record

Collections

Utrecht University Repository

License

cc_by

Abstract

Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism.We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.

Citation

Eshuijs, L, Wang, S & Fokkens, A 2025, Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification. in Proceedings of the 29th Conference on Computational Natural Language Learning. Association for Computational Linguistics (ACL), Vienna, Austria, pp. 105-125. < https://aclanthology.org/2025.conll-1.8/ >

URI

https://dspace.library.uu.nl/handle/1874/483015

Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification

Files

Publication date

Authors

Editors

Advisors

Supervisors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI