DetCat: Detecting Categorical Outliers in Relational Datasets

Publication date

2024-10-21

Authors

Zylinski, Arthur
Qahtan, A.A.A.ORCID 0000-0001-8254-1764ISNI 0000000492915493

Editors

Advisors

Supervisors

Document Type

Part of book
Open Access logo

License

taverne

Abstract

Poor data quality significantly affects different data analytics tasks, leading to inaccurate decisions and poor predictions of the machine learning models. Outliers represent one of the most common data glitches that impact data quality. While detecting outliers in numerical data has been extensively studied, few attempts were made to solve the problem of detecting categorical outliers. In this paper, we introduce DetCat for detecting categorical outliers in relational datasets, by utilizing the syntactic structure of the values. For a given attribute, DetCat identifies a set of patterns that represents the majority of the values as dominating patterns. Data values that cannot be generated by the dominating patterns are declared as outliers. The demo will show the effectiveness of our tool in detecting categorical outliers and discovering the syntactical data patterns.

Keywords

categorical values, outliers, similarity metrics, syntactic structure, Taverne, General Business,Management and Accounting, General Decision Sciences

Citation

Zylinski, A & Qahtan, A A 2024, DetCat : Detecting Categorical Outliers in Relational Datasets. in CIKM 2024 - Proceedings of the 33rd ACM International Conference on Information and Knowledge Management. International Conference on Information and Knowledge Management, Proceedings, Association for Computing Machinery, pp. 5318-5322, 33rd ACM International Conference on Information and Knowledge Management, CIKM 2024, Boise, United States, 21/10/24. https://doi.org/10.1145/3627673.3679212, conference