Improved Sentence Alignment for Building a Parallel Subtitle Corpus : Building a Multilingual Parallel Subtitle Corpus

Tiedemann, Jörg

Improved Sentence Alignment for Building a Parallel Subtitle Corpus : Building a Multilingual Parallel Subtitle Corpus

Files

bookpart.pdf (384.72 KB)

Publication date

2007-10

Authors

Tiedemann, Jörg

Document Type

Part of book or chapter of book

Metadata

Show full item record

Collections

LOTOS

Abstract

In this paper on-going work of creating an extensive multilingual parallel corpus of movie subtitles is presented. The corpus currently contains roughly 23,000 pairs of aligned subtitles covering about 2,700 movies in 29 languages. Subtitles mainly consist of transcribed speech, sometimes in a very condensed way. Insertions, deletions and paraphrases are very frequent which makes them a challenging data set to work with especially when applying automatic sentence alignment. Standard alignment approaches rely on translation consistency either in terms of length or term translations or a combination of both. In the paper, we show that these approaches are not applicable for subtitles and we propose a new alignment approach based on time overlaps specifically designed for subtitles. In our experiments we obtain a significant improvement of alignment accuracy compared to standard length-based approaches.

URI

https://dspace.library.uu.nl/handle/1874/296753

Improved Sentence Alignment for Building a Parallel Subtitle Corpus : Building a Multilingual Parallel Subtitle Corpus

Files

Publication date

Authors

Editors

Advisors

Supervisors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI