DetCat: Detecting Categorical Outliers in Relational Datasets
Publication date
2024-10-21
Editors
Advisors
Supervisors
Document Type
Part of book
Metadata
Show full item recordCollections
License
taverne
Abstract
Poor data quality significantly affects different data analytics tasks, leading to inaccurate decisions and poor predictions of the machine learning models. Outliers represent one of the most common data glitches that impact data quality. While detecting outliers in numerical data has been extensively studied, few attempts were made to solve the problem of detecting categorical outliers. In this paper, we introduce DetCat for detecting categorical outliers in relational datasets, by utilizing the syntactic structure of the values. For a given attribute, DetCat identifies a set of patterns that represents the majority of the values as dominating patterns. Data values that cannot be generated by the dominating patterns are declared as outliers. The demo will show the effectiveness of our tool in detecting categorical outliers and discovering the syntactical data patterns.
Keywords
categorical values, outliers, similarity metrics, syntactic structure, Taverne, General Business,Management and Accounting, General Decision Sciences
Citation
Zylinski, A & Qahtan, A A 2024, DetCat : Detecting Categorical Outliers in Relational Datasets. in CIKM 2024 - Proceedings of the 33rd ACM International Conference on Information and Knowledge Management. International Conference on Information and Knowledge Management, Proceedings, Association for Computing Machinery, pp. 5318-5322, 33rd ACM International Conference on Information and Knowledge Management, CIKM 2024, Boise, United States, 21/10/24. https://doi.org/10.1145/3627673.3679212, conference