A robust unsupervised method for outlier set detection

Publication date

2025-11-04

Authors

Sarfraz, AmalORCID 0000-0001-6554-4920
Birnbaum, Abigail
Dolan, Flannery
Lamontagne, Jonathan
Mihaylova, Lyudmila
Rougé, Charles

Editors

Advisors

Supervisors

Document Type

Article
Open Access logo

License

taverne

Abstract

This paper proposes a robust method that identifies sets of points that collectively deviate from typical patterns in a dataset, which it calls “outlier sets”, while excluding individual points from detection. This new methodology, Outlier Set Two-step Identification (OSTI) employs a two-step approach to detect and label these outlier sets. First, it uses Gaussian Mixture Models for probabilistic clustering, identifying candidate outlier sets based on cluster weights below a hyperparameter threshold. Second, OSTI measures the Inter-cluster Mahalanobis distance between each candidate outlier set's centroid and the overall dataset mean. OSTI then tests the null hypothesis that this distance does not significantly differ from its theoretical chi-square distribution, enabling the formal detection of outlier sets. We test OSTI systematically on 8000 synthetic 2D datasets across various inlier configurations and thousands of possible outlier set characteristics. Results show OSTI robustly and consistently detects outlier sets with an average F1 score of 0.92 and an average purity (the degree to which outlier sets identified correspond to those generated synthetically, i.e., our ground truth) of 98.58 %. We also compare OSTI with state-of-the-art outlier detection methods, to illuminate how OSTI fills a gap as a tool for the exclusive detection of outlier sets.

Keywords

Gaussian mixture models, Inter-cluster Mahalanobis distance, Outlier set two-step identification (OSTI), Outlier sets, Taverne, Management Information Systems, Software, Information Systems and Management, Artificial Intelligence

Citation

Sarfraz, A, Birnbaum, A, Dolan, F, Lamontagne, J, Mihaylova, L & Rougé, C 2025, 'A robust unsupervised method for outlier set detection', Knowledge-Based Systems, vol. 329, 114274. https://doi.org/10.1016/j.knosys.2025.114274