DOI: 10.3724/SP.J.1249.2019.01024

Journal of Shenzhen University Science and Engineering (深圳大学学报理工版) 2019/36:1 PP.24-32

Stratified sampling based ensemble classification for imbalanced data

The imbalanced data set is ubiquitous in the real-world applications. There are two key issues for the downsampling processing based imbalanced data classification. One is how to maximize the mining of minority type of information from the data set with absolute dominance of several types. The second one is how to ensure that the learner is built without excessive loss of most types of information. A simple and effective strategy is to conduct the downsampling on the majority class. The existing methods usually suffer from losing the information or destroying the intrinsic structure of the original data set. In this paper, we propose a new imbalanced data ensemble classification method with the aid of stratified sampling on majority class (EC-SS). To mine the hidden structure in majority class sufficiently, an adaptive self-tuning clustering strategy is adopted to separate the major-class samples into different strata and then the stratified sampling is used to under-sample the majority class. This strategy works well to generate the data components for subsequent ensemble learning, and its main advantage is to keep the data structure of the original data set. A series of experiments on the real benchmark datasets Musk1, Ecoli3, Glass2, and Yeast6, show that the proposed EC-SS outperforms the baselines of ensemble classification based on random sampling (EC-RS), adaptive sampling with optimal cost for class-imbalance learning (AdaS), kernel based adaptive synthetic data generation (KernelADASYN) and cost-sensitive large margin distribution machine (CS-LDM).

Key words:artificial intelligence,imbalance classification,stratified sampling,ensemble learning,clustering,data mining

ReleaseDate:2019-01-28 09:56:34

[1] GUO Haixiang, LI Yijing, JENNIFER S, et al. Learning from class-imbalanced data:review of methods and applications[J]. Expert Systems with Applications, 2017, 73:220-239.

[2] PROVOST F. Machine learning from imbalanced data sets 101[C]//AAAI'2000 Workshop on Imbalanced Data Sets. Austin, USA:AAAI Publications, 2000:1-3.

[3] SATYASREE K P N V, MURTHY J. An exhaustive literature review on class imbalance problem[J]. International Journal of Emerging Trends and Technology in Computer Science, 2013, 2(3):109-118.

[4] HAN Hui, WANG Wenyuan, MAO Binghuan. Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//Proceedings of the International Conference on Advances in Intelligent Computing. Berlin:Springer-Verlag, 2005:878-887.

[5] HE Haibo, BAI Yang, GARCIA E A, et al. ADASYN:adaptive synthetic sampling approach for imbalanced learning[C]//IEEE International Joint Conference on Neural Networks. Hong Kong, China:IEEE, 2008:1322-1328.

[6] BARUA S, ISLAM Md M, YAO Xiao, et al. MWMOTE:majority weighted minority oversampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 26(2):405-425.

[7] TANG Bo, HE Haibo. KernelADASYN:kernel based adaptive synthetic data generation for imbalanced learning[C]//IEEE Congress on Evolutionary Computation. Sendai, Japan:IEEE, 2015:664-671.

[8] CHUJAI P, CHOMBOON K, CHAIYAKHAN K. A cluster based classification of imbalanced data with overlapping regions between classes[C/OL]//Proceedings of the International Multi Conference of Engineers and Computer Scientists.[2017-10-02].

[9] VEROPOULOS K, CAMPBELL C, CRISTIANINI N. Controlling the sensitivity of support vector machines[C]//Proceedings of the International Joint Conference on Artificial Intelligence. Stockholm:[s.n.], 1999:55-60.

[10] CHEN Chao, LIAW A, BREIMAN L. Using random forest to learn imbalanced data:technical report 666[R]. Berkeley, USA:University of California, Berkeley, 2004.

[11] ZHOU Zhihua, LIU Xuying. Training cost-sensitive neural networks with methods addressing the class imbalance problem[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1):63-77.

[12] CHEN Fangyong, ZHANG Jing, WEN Cuihong. Cost-sensitive large margin distribution machine for classification of imbalanced data[J]. Pattern Recognition Letters, 2016, 80:107-112.

[13] REN Yazhou, ZHAO Peng, SHENG Yongpan, et al. Robust softmax regression for multi-class classification with self-paced learning[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia:[s. n.], 2017:2641-2647.

[14] SUN Zhongbin, SONG Qinbao, ZHU Xiaoyan, et al. A novel ensemble method for classifying imbalanced data[J]. Pattern Recognition, 2015, 48(5):1623-1637.

[15] GALAR M, FERNÁNDEZ A, BARRENECHEA E, et al. EUSBoost:enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling[J]. Pattern Recognition, 2013, 46(12):3460-3471.

[16] PENG Yuxin. Adaptive sampling with optimal cost for class-imbalance learning[C]//Proceedings of the 29th Conference on Artificial Intelligence. Austin, USA:AAAI Publications, 2015:2921-2927.

[17] JO T, JAPKOWICZ N. Class imbalances versus small disjuncts[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1):40-49.

[18] CIESLAK D, CHAWLA N, STRIEGEL A. Combating imbalance in network intrusion datasets[C]//IEEE International Conference on Granular Computing. Atlanta, USA:IEEE, 2006:732-737.

[19] BARUA S, ISLAM M, YAO Xin, et al. MWMOTE:majority weighted minority oversampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(2):405-425.

[20] SOWAH R, AGEBURE M, MILLS G, et al. New cluster undersampling technique for class imbalance learning[J]. International Journal of Machine Learning and Computing, 2016, 6(3):205-214.

[21] ZELNIK-MANOR L, PERONA P. Self-tuning spectral clustering[C]//Advances in Neural Information Processing Systems. Cambridge, USA:MIT Press, 2004:1601-1608.

[22] JING Liping, TIAN Kuang, HUANG J Z. Stratified feature sampling method for ensemble clustering of high dimensional data[J]. Pattern Recognition, 2015, 48(11):3688-3702.

[23] ZHANG Teng, ZHOU Zhihua. Large margin distribution machine[C]//In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, USA:ACM:313-322.

[24] DEMŠAR J. Statistical comparisons of classifiers over multiple data sets[J]. Journal of Machine Learning Research, 2006, 7(1):1-30.