Progressive Entity Resolution: A Design Space Exploration
Publication date
2025-03-11
Editors
Advisors
Supervisors
Document Type
/dk/atira/pure/researchoutput/researchoutputtypes/workingpaper/preprint
Metadata
Show full item recordCollections
License
cc_by
Abstract
Entity Resolution (ER) is typically implemented as a batch task that processes all available data before identifying duplicate records. However, applications with time or computational constraints, e.g., those running in the cloud, require a progressive approach that produces results in a pay-as-you-go fashion. Numerous algorithms have been proposed for Progressive ER in the literature. In this work, we propose a novel framework for Progressive Entity Resolution that organizes relevant techniques into four consecutive steps: (i) filtering, which reduces the search space to the most likely candidate matches, (ii) weighting, which associates every pair of candidate matches with a similarity score, (iii) scheduling, which prioritizes the execution of the candidate matches so that the real duplicates precede the non-matching pairs, and (iv) matching, which applies a complex, matching function to the pairs in the order defined by the previous step. We associate each step with existing and novel techniques, illustrating that our framework overall generates a superset of the main existing works in the field. We select the most representative combinations resulting from our framework and fine-tune them over 10 established datasets for Record Linkage and 8 for Deduplication, with our results indicating that our taxonomy yields a wide range of high performing progressive techniques both in terms of effectiveness and time efficiency.
Keywords
Citation
Maciejewski, J, Nikoletos, K, Papadakis, G & Velegrakis, Y 2025 'Progressive Entity Resolution : A Design Space Exploration' arXiv. https://doi.org/10.48550/arXiv.2503.08298