DOI: 10.3724/SP.J.1249.2019.01033

Journal of Shenzhen University Science and Engineering (深圳大学学报理工版) 2019/36:1 PP.33-42

An unsupervised outlier detection algorithm for categorical matrix-object data

Outlier detection is an important branch of data mining,aiming at finding the objects in a data set that are significantly different from most objects. In this paper, we define the outlier factor of a matrix-object and propose an outlier detection algorithm for categorical matrix-object data by defining the cohesion degree of a matrix-object itself and the coupling degree with other matrix-objects. The experimental results on real data sets, i.e.,Market basket, Microsoft web, and MovieLens, show that the proposed algorithm can effectively detect the outliers for the matrix-object data set compared with common-neighbor-based (CNB), local outlier factor (LOF), and information entropy-based (IE-based) algorithms.

Key words:artificial intelligence,outlier detection,categorical matrix-object data,coupling degree,cohesion degree,data mining

ReleaseDate:2019-01-28 09:56:34

[1] KNORR E M, NG R T. Algorithms for mining distance-based outliers in large data sets[C]//Proceedings of the 24th International Conference on Very Large Data Bases. New York, USA:Morgan Kaufmann Publishers Inc, 1998:392-403.

[2] HODGE V J, AUSTIN J. A survey of outlier detection methodologies[J]. Artificial Intelligence Review, 2004, 22(2):85-126.

[3] HAWKINS D M. Identification of outliers[M]. London:Chapman and Hall, 1980:11.

[4] CHANDOLA V, BANERJEE A, KUMAR V. Anomaly detection:a survey[J]. ACM Computing Surveys, 2009, 41(3):1-58.

[5] BREUNIG M M. LOF:identifying density-based local outliers[C]//ACM SIGMOD International Conference on Management of Data. New York, USA:ACM, 2000:93-104.

[6] ARNING A, AGRAWAL R, RAGHAVAN P. A linear method for deviation detection in large databases[C]//Proceedings of the 2nd international Conference on Knowledge Discovery and Data Mining. Portland, USA:AAAI, 1996:164-169.

[7] HUTTENLOCHER D P, KLANDERMAN G A, RUCKLIDGE W. Comparing images using the Hausdorff distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1993, 15(9):850-863.

[8] JIANG Feng, SUI Yuefei, CAO Cungen. An information entropy-based approach to outlier detection in rough sets[J]. Expert Systems with Applications:an International Journal, 2010, 37(9):6338-6344.

[9] WU Shu, WANG Shengrui. Information-theoretic outlier detection for large-scale categorical data[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(3):589-602.

[10] PANG Guangsong, XU Hongzuo, CAO Longbin. Selective value coupling learning for detecting outliers in high-dimensional categorical data[C]//Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. New York, USA:ACM, 2017:807-816.

[11] DANESHPAZHOUH A, SAMI A. Entropy-based outlier detection using semi-supervised approach with few positive examples[J]. Pattern Recognition Letters, 2014, 49(C):77-84.

[12] TANG Bo, HE Haibo. A local density-based approach for local outlier detection[J]. Neurocomputing, 2017, 241:171-180.

[13] NOTO K, BRODLEY C, SLONIM D. FRaC:a feature-modeling approach for semi-supervised and unsupervised anomaly detection[J]. Data Mining and Knowledge Discovery, 2012, 25(1):109-133.

[14] LENCO D, PENSA R G, MEO R. A semi-supervised approach to the detection and characterization of outliers in categorical data[J]. IEEE Transactions on Neural Networks and Learning Systems, 2016, 28(5):1017-1029.

[15] SCHIFFMAN S S. Introduction to multidimensional scaling[J]. Math Practice Theory, 1981, 3:54-62.

[16] MARKOU M, SINGH S. A neural network-based novelty detector for image sequence analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 28(10):1664-1677.

[17] LI Shuxin, LEE R, LANG S D. Mining distance-based outliers from categorical data[C]//IEEE International Conference on Data Mining Workshops. Washington D C, IEEE Computer Society, 2007:225-230.