How and where does CLIP process negation?
Publication date
2024-08
Editors
Advisors
Supervisors
DOI
Document Type
Part of book
Metadata
Show full item recordCollections
License
cc_by
Abstract
Various benchmarks have been proposed to test linguistic understanding in pre-trained vision & language (VL) models. Here we build on the existence task from the VALSE benchmark (Parcalabescu et al., 2022) which we use to test models’ understanding of negation, a par- ticularly interesting issue for multimodal mod- els. However, while such VL benchmarks are useful for measuring model performance, they do not reveal anything about the internal pro- cesses through which these models arrive at their outputs in such visio-linguistic tasks. We take inspiration from the growing literature on model interpretability to explain the behaviour of VL models on the understanding of nega- tion. Specifically, we approach these questions through an in-depth analysis of the text encoder in CLIP (Radford et al., 2021), a highly influen- tial VL model. We localise parts of the encoder that process negation and analyse the role of at- tention heads in this task. Our contributions are threefold. We demonstrate how methods from the language model interpretability literature (such as causal tracing) can be translated to mul- timodal models and tasks; we provide concrete insights into how CLIP processes negation on the VALSE existence task; and we highlight inherent limitations in the VALSE dataset as a benchmark for linguistic understanding.
Keywords
Language and Linguistics, Computer Science Applications, Software, Ophthalmology, Linguistics and Language
Citation
Quantmeyer, V, Mosteiro Romero, P & Gatt, A 2024, How and where does CLIP process negation? in ALVR 2024. Association for Computational Linguistics, pp. 59-72, Advances in Language and Vision Research (ALVR), Bangkok, Thailand, 16/08/24. < https://aclanthology.org/2024.alvr-1.5 >, workshop