Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning

Song, Yingjin; Paperno, Denis; Gatt, Albert

Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning

Files

2024.inlg-main.32.pdf (2.13 MB)

Publication date

2024

Authors

Song, Yingjin

Paperno, Denis

Gatt, Albert

Document Type

Contribution to conference

Metadata

Show full item record

Collections

Utrecht University Repository

License

cc_by

Abstract

Visual storytelling systems generate multi-sentence stories from image sequences. In this task, capturing contextual information and bridging visual variation bring additional challenges. We propose a simple yet effective framework that leverages the generalization capabilities of pretrained foundation models, only training a lightweight vision-language mapping network to connect modalities, while incorporating context to enhance coherence. We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness. Extensive experimental results, across both automatic metrics and human evaluations, demonstrate that the stories generated by our framework are diverse, coherent, informative, and interesting.

Citation

Song, Y, Paperno, D & Gatt, A 2024, 'Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning', Paper presented at 17th International Natural Language Generation Conference, Tokyo, Japan, 23/09/24 - 27/09/24 pp. 384-401. < https://aclanthology.org/2024.inlg-main.32 >, conference

URI

https://dspace.library.uu.nl/handle/1874/463136

Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning

Files

Publication date

Authors

Editors

Advisors

Supervisors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI