DOI: 10.3724/SP.J.1249.2019.01004

Journal of Shenzhen University Science and Engineering (深圳大学学报理工版) 2019/36:1 PP.4-17

A review on clustering algorithms for large-scale data sets

Clustering is an important research branch of machine learning. In the past decades, many well-known clustering algorithms have been designed to handle the clustering problems of small-scale and medium-scale data sets. Although these algorithms have obtained the good clustering performances, they are usually inefficient when dealing with the clustering tasks of large-scale data sets due to the high computation complexity and weak capability of handling the high-dimensional data. In the age of big data, the collection and storage of data become easier and more convenient. The clustering technologies are desperately needed to satisfy the requirements of real applications which generate a great deal of large-scale data sets. Thus, the clustering for large-scale data sets becomes an important research direction in the field of machine learning. In this paper, the current clustering algorithms are reviewed and analyzed for large-scale data sets under both the sequential clustering algorithms based on instance selection, incremental learning, feature subset and feature transformation and the parallel clustering algorithms based on MapReduce, Spark and Storm computational frameworks, respectively. Unlike the existing literature reviews, we focus on the computability of large-scale data sets. Meanwhile, we provide some new thoughts for the designs and applications of clustering algorithms for large-scale data sets, including the design strategies of clustering algorithms based on data parallelization, automation of training process, and some understandings of clustering algorithms for large-scale data in social networks.

Key words:artificial intelligence,large-scale data,clustering,sequential computing,parallel computing,data mining,review

ReleaseDate:2019-01-28 09:56:33

[1] HAN J W, KAMBER M, PEI J. Data mining:concepts and techniques[M]. 3rd ed. San Francisco, USA:Morgan Kaufmann Publishers Inc., 2012.

[2] 明勇. 基于购买行为的客户分群研究[D]. 深圳:深圳大学, 2017. MING Yong. The research of customer segmentation based on purchase behavior[D]. Shenzhen:Shenzhen University, 2017.(in Chinese)

[3] 孙杰. 基于图形表示的DNA序列聚类与可靠性分析改进[D]. 杭州:浙江理工大学, 2017. SUN Jie. Graphical-model-based DNA sequence clustering with improved certainty estimation[D]. Hangzhou:Zhejiang Sci-Tech University, 2017.(in Chinese)

[4] 蒋盛益, 庞观松, 张建军. 基于聚类的垃圾邮件识别技术研究[J]. 山东大学学报理学版, 2011, 46(5):71-76. JIANG Shengyi, PANG Guansong, ZHANG Jianjun. Research on spam detection techniques based on clustering[J]. Journal of Shandong University Natural Science, 2011, 46(5):71-76.(in Chinese)

[5] 周小明, 苏安龙, 杨宏宇. 基于k-Means聚类算法的行业用电行为分析[J]. 电气应用, 2015, 34(增刊1):178-182. ZHOU Xiaoming, SU Anlong, YANG Hongyu. Electricity behavior analysis based on k-means clustering algorithm[J]. Electrotechnical Application, 2015, 34(Suppl 1):178-182.(in Chinese)

[6] HARTIGAN J A, WONG M A. Algorithm A S 136:a k-means clustering algorithm[J]. Journal of the Royal Statistical Society, Series C:Applied Statistics, 1979, 28(1):100-108.

[7] BEZDEK J C, EHRLICH R, FULL W. FCM:the fuzzy c-means clustering algorithm[J]. Computers & Geosciences, 1984, 10(2/3):191-203.

[8] HUANG Zhexue. Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Knowledge Discovery, 1998, 2(3):283-304.

[9] HUANG Zhexue, NG M K. A note on k-modes clustering[J]. Journal of Classification, 2003, 20(2):257-261.

[10] HUANG Zhexue, NG M K. A fuzzy k-modes algorithm for clustering categorical data[J]. IEEE Transactions on Fuzzy Systems, 1999, 7(4):446-452.

[11] 陈宁, 陈安, 周龙骧. 数值型和分类型混合数据的模糊k-Prototypes聚类算法[J]. 软件学报, 2001, 12(8):1107-1119. CHEN Ning, CHEN An, ZHOU Longxiang. Fuzzy k-prototypes algorithm for clustering mixed numeric and categorical valued data[J]. Journal of Software, 2001, 12(8):1107-1119.(in Chinese)

[12] ROUSSEEUW P J, KAUFMAN L. Finding groups in data[J]. Series in Probability & Mathematical Statistics, 1990, 34(1):111-112.

[13] ESTER M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Menlo Park. USA:AAAI Press, 1996:226-231.

[14] WANG Wei, YANG Jiong, MUNTZ R R. Sting:a statistical information grid approach to spatial data mining[C]//Proceedings of the 23rd International Conference on Very Large Data Bases. San Francisco, USA:Morgan Kaufmann Publishers Inc., 1997:186-195.

[15] FAHAD A, ALSHATRI N, TARI Z, et al. A survey of clustering algorithms for big data:taxonomy and empirical analysis[J]. IEEE Transactions on Emerging Topics in Computing, 2014, 2(3):267-279.

[16] BRADLEY P S, FAYYAD U M, REINA C. Scaling clustering algorithms to large databases[C]//Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining.[S.l.]:American Association for Artificial Intelligence, 1998:9-15.

[17] PALMER C R, FALOUTSOS C. Density biased sampling:an improved method for data mining and clustering[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York, USA:ACM, 2000:82-92.

[18] ZHANG Tian, RAMAKRISHNAN R, LIVNY M. BIRCH:a new data clustering algorithm and its applications[J]. Data Mining and Knowledge Discovery, 1997, 1(2):141-182.

[19] KOLLIOS G, GUNOPULOS D, KOUDAS N, et al. Efficient biased sampling for approximate clustering and outlier detection in large data sets[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(5):1170-1187.

[20] 纪良浩. 基于密度偏差抽样的聚类算法研究[J]. 重庆邮电大学学报自然科学版, 2007, 19(6):729-732. JI Lianghao. Research on clustering algorithm based on density biased sampling[J]. Journal of Chongqing University of Posts and Telecommunications Natural Science Edition, 2007, 19(6):729-732.(in Chinese)

[21] 胡彩平, 秦小麟. 一种改进的基于密度的抽样聚类算法[J]. 中国图象图形学报, 2007, 12(11):2031-2036. HU Caiping, QIN Xiaolin. An improved density-based spatial clustering algorithm with sampling[J]. Journal of Image and Graphics, 2007, 12(11):2031-2036.(in Chinese)

[22] 张驹, 黄汉永, 肖杰. 一种基于Hash函数抽样的数据流聚类算法[J]. 计算机系统应用, 2009, 18(3):73-75. ZHANG Ju, HUANG Hanyong, XIAO Jie. A data stream clustering algorithm based on hash sampling[J]. Computer Systems and Applications, 2009, 18(3):73-75.(in Chinese)

[23] KAUFMAN L, ROUSSEEUW P J. Partitioning around medoids (program PAM)[M]//Finding Groups in Data:an Introduction to Cluster Analysis. Hoboken, USA:John Wiley & Sons, Inc., 1990:68-125.

[24] AGGARWAL C C, HAN J, WANG J, et al. A framework for clustering evolving data streams[C]//Proceedings of the 29th International Conference on Very Large Data Bases. Berlin:VLDB Endowment, 2003:81-92.

[25] 邱云飞, 孙梦冉. 基于差异性采样的流数据聚类算法[J]. 计算机应用研究, 2019, 36(6). doi:10.3969/j.issn.1001-3695.2017.12.0808. QIU Yunfei, SUN Mengran. Stream data clustering algorithm based on differential sampling[J]. Application Research of Computers, 2019, 36(6). doi:10.3969/j.issn.1001-3695.2017.12.0808.(in Chinese)

[26] 王秀华. 基于随机抽样的加速k-均值聚类方法[J]. 计算机与现代化, 2013(12):27-29. WANG Xiuhua. k-means clustering algorithm based on random sampling[J]. Computer and Modernization, 2013(12):27-29, 33.(in Chinese)

[27] 罗军锋, 洪丹丹. 基于数据抽样的自动k-means聚类算法[J]. 现代电子技术, 2014, 37(8):19-21. LUO Junfeng, HONG Dandan. Automatic k-means clustering algorithm based on data sampling[J]. Modern Electronics Technique, 2014, 37(8):19-21.(in Chinese)

[28] CHEN Xiaojun, NIE Feiping, HUANG J Z. Scalable normalized cut with improved spectral rotation[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia:[s.n.], 2017:1518-1524.

[29] CHEN Xiaojun, HONG Weijun, NIE Feiping, et al. Directly minimizing normalized cut for large scale data[C]//Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA:ACM, 2018:1206-1215.

[30] Von LUXBURG U. A tutorial on spectral clustering[J]. Statistics and Computing, 2007, 17(4):395-416.

[31] ESTER M, KRIEGEL H P, SANDER J, et al. Incremental clustering for mining in a data warehousing environment[C]//Proceedings of the 24th International Conference on Very Large Data Bases. San Franciscan, USA:Morgan Kaufmann Publishers Inc., 1998:323-333.

[32] 徐新华, 谢永红. 增量聚类综述及增量DBSCAN聚类算法研究[J]. 北华航天工业学院学报, 2006, 16(2):15-17. XU Xinhua, XIE Yonghong. Summarization on incremental clustering and research of incremental DBSCAN algorithm[J]. Journal of North China Institute of Astronautic Engineering, 2006, 16(2):15-17.(in Chinese)

[33] 黄永平, 邹力鹍. 数据仓库中基于密度的批量增量聚类算法[J]. 计算机工程与应用, 2004, 40(29):206-208. HUANG Yongping, ZOU Likun. An incremental density-based clustering algorithm in a batch mode used in a data warehouse[J]. Computer Engineering and Applications, 2004, 40(29):206-208.(in Chinese)

[34] 易宝林, 伍仪强, 丰大洋, 等. 基于DBSCAN的批量更新聚类算法[J]. 计算机工程, 2009, 35(2):63-64. YI Baolin, WU Yiqiang, FENG Dayang, et al. Batch update clustering algorithm based on DBSCAN[J]. Computer Engineering, 2009, 35(2):63-64.(in Chinese)

[35] CHAKRABORTY S, NAGWANNI N K, DEY L. Performance comparison of incremental k-means and incremental DBSCAN algorithms[J]. International Journal of Computer Applications, 2011, 27(11):14-18.

[36] BAKR A M, GHANEM N M, ISMAIL M A. Efficient incremental density-based algorithm for clustering large datasets[J]. Alexandria Engineering Journal, 54(4):1147-1154.

[37] PHAM D T, DIMOV S S, NGUYEN C D. An incremental k-means algorithm[J]. Proceedings of the Institution of Mechanical Engineers, Part C:Journal of Mechanical Engineering Science, 2004, 218(7):783-795.

[38] CHAKRABORTY S, NAGWANI N K. Analysis and study of incremental k-means clustering algorithm[M]//High Performance Architecture and Grid Computing. Berlin:Springer, 2011:338-341.

[39] 高小梅, 冯云, 冯兴杰. 增量式k-Medoids聚类算法[J]. 计算机工程, 2005, 31(增刊1):181-183. GAO Xiaomei, FENG Yun, FENG Xingjie. Incremental clustering algorithm based on k-medoids[J]. Computer Engineering, 2005, 31(Suppl 1):181-183.(in Chinese)

[40] 王洪春,彭宏. 基于模糊C-均值的增量式聚类算法[J].微电子学与计算机, 2007, 24(6):156-157. WANG Hongchun, PENG Hong. Incremental clustering algorithm based on FCM[J]. Microelectronics & Computer, 2007, 24(6):156-157.(in Chinese)

[41] ZHAO Liang, CHEN Zhikui, YANG Yi. Incremental CFS clustering on large data[C]//Proceedings of the 2017 IEEE Global Conference on Signal and Information Processing. Montreal, Canada:IEEE, 2017:687-690.

[42] 李桃迎, 陈燕, 秦胜君, 等. 增量聚类算法综述[J]. 科学技术与工程, 2010, 10(35):8752-8759. LI Taoying, CHEN Yan, QIN Shengjun, et al. Survey of incremental clustering algorithms[J]. Science Technology and Engineering, 2010, 10(35):8752-8759.(in Chinese)

[43] ANG J C, MIRZAL A, HARON H, et al. Supervised, unsupervised, and semi-supervised feature selection:a review on gene selection[J]. IEEE-ACM Transactions on Computational Biology and Bioinformatics, 2016, 13(5):971-989.

[44] LI Jundong, CHENG Kewei, WANG Suhang, et al. Feature selection:a data perspective[J]. ACM Computing Surveys, 2018, 50(6):1-94.

[45] 李斌, 王劲松, 黄玮. 一种大数据环境下的新聚类算法[J]. 计算机科学, 2015, 42(12):247-250. LI Bin, WANG Jinsong, HUANG Wei. Novel global k-means clustering algorithm for big data[J]. Computer Science, 2015, 42(12):247-250.(in Chinese)

[46] 杜世强.基于维数约简的无监督聚类算法研究[D].兰州:兰州大学,2017. DU Shiqiang. Unsupervised clustering algorithm based on dimension reduction[D]. Lanzhou:Lanzhou University, 2017.(in Chinese)

[47] AGRAWAL R, GEHRKE J, GUNOPULOS D, et al. Automatic subspace clustering of high dimensional data for data mining applications[C]//Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 1998:94-105.

[48] CHENG C H, FU A W, ZHANG Yi. Entropy-based subspace clustering for mining numerical data[C]//Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA:ACM, 1999:84-93.

[49] NAGESH H, GOIL S, CHOUDHARY A. Adaptive grids for clustering massive data sets[C]//Proceedings of the 2001 SIAM International Conference on Data Mining. Chicago, USA:SIAM, 2001:1-17.

[50] KAILING K, KRIEGEL H P, KRÖGER P. Density-connected subspace clustering for high-dimensional data[C]//Proceedings of the 2004 SIAM International Conference on Data Mining. Orlando, USA:SIAM, 2004:246-256.

[51] HUANG J Z, NG M K, RONG Hongqiang, et al. Automated variable weighting in k-means type clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(5):657-668.

[52] JING Liping, NG M K, HUANG J X. An entropy weighting k-Means algorithm for subspace clustering of high-dimensional sparse data[J]. Knowledge and Data Engineering, IEEE Transactions on, 2007, 19(8):1026-1041.

[53] CHEN Xiaojun, YE Yunming, XU Xiaofei, et al. A feature group weighting method for subspace clustering of high-dimensional data[J]. Pattern Recognition, 2012, 45(1):434-446.

[54] 赵鹤. 面向高维大数据的子空间集成学习方法研究[D]. 合肥:中国科学院大学, 2017. ZHAO He. Study on subspace ensemble learning methods towards high-dimensional big data[D]. Hefei:University of Chinese Academy of Sciences, 2017.(in Chinese)

[55] PARSONS L, HAQUE E, LIU Huan. Subspace clustering for high dimensional data:a review[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1):90-105.

[56] KRIEGEL H P, KRÖGER P, ZIMEK A. Clustering high-dimensional data:a survey on subspace clustering, pattern-based clustering, and correlation clustering[J]. ACM Transactions on Knowledge Discovery from Data, 2009, 3(1):1-58.

[57] 王卫卫, 李小平, 冯象初, 等. 稀疏子空间聚类综述[J]. 自动化学报, 2015, 41(8):1373-1384. WANG Weiwei, LI Xiaoping, FENG Xiangchu, et al. A survey on sparse subspace clustering[J]. Acta Automatica Sinica, 2015, 41(8):1373-1384.(in Chinese)

[58] DENG Zhaohong, CHOI K S, JIANG Yizhang, et al. A survey on soft subspace clustering[J]. Information Sciences, 2016, 348:84-106.

[59] ABDI H, WILLIAMS L J. Principal component analysis[J]. Wiley Interdisciplinary Reviews:Computational Statistics, 2010, 2(4):433-459.

[60] 赵蔷. 主成分分析方法综述[J]. 软件工程, 2016, 19(6):1-3. ZHAO Qiang. A review of principal component analysis[J]. Software Engineering, 2016, 19(6):1-3.(in Chinese)

[61] YEUNG K Y, RUZZO W L. Principal component analysis for clustering gene expression data[J]. Bioinformatics, 2001, 17(9):763-774.

[62] DING C, HE Xiaofeng. k-means clustering via principal component analysis[C]//Proceedings of the 21st International Conference on Machine Learning. New York, USA:ACM, 2004:29.

[63] ALZATE C, SUYKENS J A. Multiway spectral clustering with out-of-sample extensions through weighted kernel PCA[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(2):335-347.

[64] 郁雪, 李敏强. 一种结合有效降维和k-means聚类的协同过滤推荐模型[J]. 计算机应用研究, 2009, 26(10):3718-3720. YU Xue, LI Minqiang. Collaborative filtering recommendation model based on effective dimension reduction and k-means clustering[J]. Application Research of Computers, 2009, 26(10):3718-3720.(in Chinese)

[65] 苏木亚, 郭崇慧. 基于主成分分析的单变量时间序列聚类方法[J]. 运筹与管理, 2011, 20(6):66-72. SU Muya, GUO Chonghui. Univariate time series clustering method based on principal component analysis[J]. Operations Research and Management Science, 2011, 20(6):66-72.(in Chinese)

[66] 马金龙, 景新幸, 杨海燕, 等. 主成分分析和k-means聚类在说话人识别中的应用[J]. 计算机应用, 2015, 35(增刊1):127-129. MA Jinlong, JING Xinxing, YANG Haiyan, et al. Application of principal component analysis and k-means clustering in speaker recognition[J]. Journal of Computer Applications, 2015, 35(Suppl 1):127-129.(in Chinese)

[67] 马国峻, 王水波, 裴庆祺, 等. 基于主成分分析和k-means聚类的平行坐标可视化技术研究[J]. 网络与信息安全学报, 2017, 3(8):18-27. MA Guojun, WANG Shuibo, PEI Qingqi, et al. Research on parallel coordinate visualization technology based on principal component analysis and k-means clustering[J]. Chinese Journal of Network and Information Security, 2017, 3(8):18-27.(in Chinese)

[68] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521:436-444.

[69] 原旭, 杨镇楠, 赵亮, 等. 基于AutoEncoder的增量式聚类算法[J]. 微电子学与计算机, 2016, 33(3):120-124. YUAN Xu, YANG Zhennan, ZHAO Liang, et al. Incremental clustering based on AutoEncoder[J]. Microelectronics & Computer, 2016, 33(3):120-124.(in Chinese)

[70] 杨琪. 基于深度学习的聚类关键技术研究[D]. 成都:西南交通大学, 2016. YANG Qi. Research on key technologies of clustering based on deep learning[D]. Chengdu:Southwest Jiaotong University, 2016.(in Chinese)

[71] DEAN J, GHEMAWAT S. Mapreduce:simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1):107-113..

[72] ZHAO Weizhong, MA Huifang, HE Qing. Parallel k-means clustering based on MapReduce[C]//Proceedings of the 1st International Conference on Cloud Computing. Beijing:Springer-Verlag, 2009:674-679.

[73] 江小平,李成华,向文,等. k-means聚类算法的MapReduce并行化实现[J].华中科技大学学报自然科学版, 2011, 39(增刊1):120-124. JIANG Xiaoping, LI Chenghua, XIANG Wen, et al. Parallel implementing k-means clustering algorithm using MapReduce programming mode[J]. Journal of Huazhong University of Science and Technology Natural Science Edition, 2011, 39(s1):120-124.(in Chinese)

[74] CUI Xiaoli, ZHU Pingfei, YANG Xin, et al. Optimized big data k-means clustering using MapReduce[J]. Journal of Supercomputing, 2014, 70(3):1249-1259.

[75] 虞倩倩,戴月明. 基于MapReduce的并行模糊C均值算法[J]. 计算机工程与应用, 2013, 49(14):133-137. YU Qianqian, DAI Yueming. Parallel fuzzy C-means algorithm based on MapReduce[J]. Computer Engineering and Applications, 2013, 49(14):133-137.(in Chinese)

[76] LUDWIG S A. Map Reduce-based fuzzy c-means clustering algorithm:implementation and scalability[J]. International Journal of Machine Learning and Cybernetics, 2015, 6(6):923-934.

[77] LI Xiu, SONG Jingdong, ZHANG Fan, et al. MapReduce-based fast fuzzy c-means algorithm for large-scale underwater image segmentation[J]. Future Generation Computer Systems, 2016, 65:90-101.

[78] HE Yaobin, TAN Haoyu, LUO Wuman, et al. MR-DBSCAN:an efficient parallel density-based clustering algorithm using MapReduce[C]//Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems. Washington D C:IEEE Computer Society, 2011:473-480.

[79] NOTICEWALA, VAGHELA D. MR-IDBSCAN:efficient parallel incremental DBSCAN algorithm using MapReduce[J]. International Journal of Computer Applications, 2014, 93(4):13-18.

[80] KIM Y, SHIM K, KIM M S, et al. DBCURE-MR:an efficient density-based clustering algorithm for large data using MapReduce[J]. Information Systems, 2014, 42:15-35.

[81] 王兴, 吴艺, 蒋新华, 等. 大规模数据集下基于DBSCAN算法的增量并行化快速聚类[J]. 计算机应用与软件, 2018, 35(4):269-275. WANG Xing, WU Yi, JIANG Xinhua, et al. Incremental parallelization of fast clustering based on DBSCAN algorithm under large-scale data set[J]. Computer Applications and Software, 2018, 35(4):269-275.(in Chinese).

[82] 张磊, 张公让, 张金广. 一种网格化聚类算法的MapReduce并行化研究[J]. 计算机技术与发展, 2013, 23(2):60-64. ZHANG Lei, ZHANG Gongrang, ZHANG Jinguang. MapReduce parallelization research of a clustering algorithm based on grid[J]. Computer Technology and Development, 2013, 23(2):60-64.(in Chinese)

[83] 张雪萍, 龚康莉, 赵广才. 基于MapReduce的k-medoids并行算法[J]. 计算机应用, 2013, 33(4):1023-1025. ZHANG Xueping, GONG Kangli, ZHAO Guangcai. Parallel k-medoids algorithm based on MapReduce[J]. Journal of Computer Applications, 2013, 33(4):1023-1025.(in Chinese)

[84] 涂金金, 杨明, 郭丽娜. 基于MapReduce的基因数据密度层次聚类算法[J]. 中国科学技术大学学报, 2014, 44(7):537-543. TU Jinjin, YANG Ming, GUO Lina. A density-based hierarchical clustering algorithm of gene data based on MapReduce[J]. Journal of University of Science and Technology of China, 2014, 44(7):537-543.(in Chinese)

[85] 蔡斌雷, 任家东, 朱世伟, 等. 基于Hadoop MapReduce的分布式数据流聚类算法研究[J]. 信息工程大学学报, 2014, 15(4):472-478. CAI Binlei, REN Jiadong, ZHU Shiwei, et al. Research on distributed clustering over data stream using Hadoop MapReduce[J]. Journal of Information Engineering University, 2014, 15(4):472-478.(in Chinese)

[86] 杨煜, 赵成贵. 基于Hadoop MapReduce并行近似谱聚类算法研究与实现[J]. 计算机应用与软件, 2015, 32(8):17-21. YANG Yu, ZHAO Chenggui. Research and implementation of Hadoop MapReduce-based parallel approximate spectral clustering algorithm[J]. Computer Applications and Software, 2015, 32(8):17-21.(in Chinese)

[87] 张伟鹏, 李振军, 李荣华, 等. 基于MapReduce的图结构聚类算法[J]. 软件学报, 2018, 29(3):627-641. ZHANG Weipeng, LI Zhenjun, LI Ronghua, et al. MapReduce-based graph structural clustering algorithm[J]. Journal of Software, 2018, 29(3):627-641.(in Chinese)

[88] ENE A, IM S, MOSELEY B. Fast clustering using MapReduce[C]//Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Ney York, USA:ACM, 2011:681-689.

[89] CORDEIRO R F, JUNIOR C T, TRAINA A M, et al. Clustering very large multi-dimensional datasets with MapReduce[C]//Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA:ACM, 2011:690-698.

[90] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark:cluster computing with working sets[C]//Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing. Berkeley, USA:USENIX Association, 2010:1-7.

[91] MENG Xiangrui, BRANDLEY J, YAVUZ B, et al. MLlib:machine learning in apache Spark[J]. Journal of Machine Learning Research, 2016, 17(1):1235-1241.

[92] SALLOUM S, DAUTOV R, CHEN Xiaojun, et al. Big data analytics on Apache Spark[J]. International Journal of Data Science and Analytics, 2016, 1(3/4):145-164.

[93] KUSUMA I, MA'SUM M A, HABIBIE N, et al. Design of intelligent k-means based on Spark for big data clustering[C]//Proceedings of the 2016 International Workshop on Big Data and Information Security.[S.l.:s.n.], 2016:89-96.

[94] 吴哲夫, 张彤, 肖鹰. 基于Spark平台的k-means聚类算法改进及并行化实现[J]. 互联网天地, 2016, 1(1):44-50. WU Zhefu, ZHANG Tong, XIAO Ying. Improvement and parallel implementation of k-means clustering algorithm based on the Spark platform[J]. China Internet, 2016, 1(1):44-50.(in Chinese)

[95] 侯敬儒, 吴晟, 李英娜. 基于Spark的并行k-means聚类模型研究[J]. 计算机与数字工程, 2018, 46(3):537-540. HOU Jingru, WU Sheng, LI Yingna. Research on parallel k-means clustering model based on Spark[J]. Computer & Digital Engineering, 2018, 46(3):537-540.(in Chinese)

[96] 徐健锐, 詹永照. 基于Spark的改进k-means快速聚类算法[J]. 江苏大学学报自然科学版, 2018, 39(3):316-323. XU Jianrui, ZHAN Yongzhao. Improved k-means fast clustering algorithm based on Spark[J]. Journal of Jiangsu University Natural Science Edition, 2018, 39(3):316-323.(in Chinese).

[97] WANG Bowen, YIN Jun, HUA Qi, et al. Parallelizing k-means-based clustering on Spark[C]//Proceedings of the 2016 International Conference on Advanced Cloud and Big Data. Chengdu, China:[s. n.], 2016:31-36.

[98] 梁鹏. 基于Spark的模糊c均值聚类算法研究[D]. 哈尔滨:哈尔滨工业大学, 2015. LIANG Peng. Research on spark oriented fuzzy c-means clustering algorithm[D]. Harbin:Harbin Institute of Technology, 2015.(in Chinese).

[99] 王桂兰, 周国亮, 萨初日拉, 等. Spark环境下的并行模糊c均值聚类算法[J]. 计算机应用, 2016, 36(2):342-347. WANG Guilan, ZHOU Guoliang, SA Churila, et al. Parallel fuzzy c-means clustering algorithm in Spark[J]. Journal of Computer Applications, 2016, 36(2):342-347.(in Chinese).

[100] BHARILL N, TIWARI A, MALVIYA A. Fuzzy based scalable clustering algorithms for handling big data using Apache Spark[J]. IEEE Transactions on Big Data, 2016, 2(4):339-352.

[101] CORDOVA I, MOH T S. DBSCAN on resilient distributed datasets[C]//Proceedings of the 2015 International Conference on High Performance Computing & Simulation. Berlin:Springer-Verlag Berlin Heidelberg, 2015:531-540.

[102] LULLI A, DELL'AMICO M, MICHIARDI P, et al. NG-DBSCAN:scalable density-based clustering for arbitrary data[J]. Proceedings of the VLDB Endowment, 2016, 10(3):157-168.

[103] 金都. 基于Spark平台的空间数据挖掘DBSCAN聚类算法并行化研究[D]. 成都:电子科技大学,2017. JIN Dou. Research on parallization of DBSCAN clustering algorithm for spatial data mining based on Spark platform[D]. Chengdu:University of Electronic Science and Technology of China, 2017.(in Chinese).

[104] 黄明吉, 张倩. 基于Spark的并行DBSCAN算法的设计与实现[J]. 计算机科学, 2017, 44(增刊2):524-529. HUANG Mingji, ZHANG Qian. Design and implementation of parallel DBSCAN algorithm based on Spark[J]. Computer Science, 2017, 44(Suppl 2):524-529.(in Chinese).

[105] 朱子龙, 李玲娟. 基于Spark的密度聚类算法并行化研究[J]. 计算机技术与发展, 2018, 28(6):80-84. ZHU Zilong, LI Lingjuan. Research on parallelization of density clustering algorithm based on Spark[J]. Computer Technology and Development, 2018, 28(6):80-84.(in Chinese).

[106] 吴稀钰. 基于Spark的谱聚类算法及其在QAR数据中的应用[D]. 天津:中国民航大学, 2017. WU Xiyu. Spectral clustering algorithm based on Spark and the application on QAR data[D]. Tianjin:Civil Aviation University of China, 2017.(in Chinese)

[107] 朱光辉, 黄圣彬, 袁春风, 等. SCoS:基于Spark的并行谱聚类算法设计与实现[J]. 计算机学报, 2018, 41(4):868-885. ZHU Guanghui, HUANG Shengbin, YUAN Chunfeng, et al. SCoS:the design and implementation of parallel spectral clustering algorithm based on Spark[J]. Chinese Journal of Computers, 2018, 41(4):868-885.(in Chinese)

[108] 邱荣财. 基于Spark平台的CURE算法并行化设计与应用[D]. 广州:华南理工大学, 2014. QIU Rongcai. The parallel design and application of the CURE algorithm based on Spark platform[D]. Guangzhou:South China University of Technology, 2014.(in Chinese)

[109] JIN Chen, LIU Ruoqian, HENDRIX W, et al. A scalable hierarchical clustering algorithm using Spark[C]//Proceedings of the 2015 IEEE 1st International Conference on Big Data Computing Service and Applications. Washington D C:IEEE Computer Society, 2015:418-426.

[110] ZHU Bo, MARA A, MOZO A. CLUS:parallel subspace clustering algorithm on Spark[C]//Proceedings of the 2015 East European Conference on Advances in Databases and Information Systems.[S.l.:s.n.], 2015:175-185.

[111] 刘小龙. 基于Spark的超图聚类方法研究[D]. 广州:华南理工大学, 2016. LIU Xiaolong. Hypergraph clustering method based on Spark[D]. Guangzhou:South China University of Technology, 2016.(in Chinese).

[112] 臧兆杰. 基于Spark的k-medoids聚类算法的研究[D]. 大连:大连大学, 2018. ZANG Zhaojie. Research on k-medoids clustering algorithm based on Spark[D]. Dalian:Dalian University, 2018.(in Chinese)

[113] IQBAL M H, SOOMRO T R. Big data analysis:apache storm perspective[J]. International Journal of Computer Trends and Technology, 2015, 19(1):9-14.

[114] 王铭坤, 袁少光, 朱永利, 等. 基于Storm的海量数据实时聚类[J]. 计算机应用, 2014, 34(11):3078-3081. WANG Mingkun, YUAN Shaoguang, ZHU Yongli, et al. Real-time clustering for massive data using storm[J]. Journal of Computer Applications, 2014, 34(11):3078-3081.(in Chinese).

[115] 马可. 基于Storm的流数据聚类挖掘算法的研究[D]. 南京:南京邮电大学, 2016. MA Ke. Research on stream data clustering algorithm based on storm[D]. Nanjing:Nanjing University of Posts and Telecommunications, 2016.(in Chinese)

[116] 李伟. 基于Storm的数据流聚类研究[D]. 大连:大连海事大学, 2017. LI Wei. Research on data stream clustering based on Storm[D]. Dalian:Dalian Maritime University, 2017.(in Chinese)

[117] 王向阳. 基于Storm的微博聚类算法的研究与实现[D]. 北京:北京交通大学, 2018. WANG Xiangyang. Research and implementation of micro-blog clustering algorithm based on Storm[D]. Beijing:Beijing Jiaotong University, 2018.(in Chinese)

[118] 牛丽媛, 张桂芸. 基于Storm的分布式实时数据流密度聚类算法[J]. 天津师范大学学报自然科学版, 2018, 38(3):72-76. NIU Liyuan, ZHANG Guiyun. Distributed real-time data flow density clustering algorithm based on storm[J]. Journal of Tianjin Normal University Natural Science Edition, 2018, 38(3):72-76.(in Chinese)

[119] KARUNARATNE P, KARUNASEKERA S, HARWOOD A. Distributed stream clustering using micro-clusters on Apache storm[J]. Journal of Parallel and Distributed Computing, 2017, 108:74-84.

[120] WEI Chenghao, SALMAN S, EMARA T Z, et al. A two-stage data processing algorithm to generate random sample partitions for big data analysis[C]//International Conference on Cloud Computing:Cloud Computing. Seattle, USA:Springer, 2018, 10967:347-364.

[121] VEGA-PONS S, RUIZ-SHULCLOPER J. A survey of clustering ensemble algorithms[J]. International Journal of Pattern Recognition and Artificial Intelligence, 2011, 25(3):337-372.

[122] MASUD M A, HUANG J Z, WEI Chenghao, et al. I-nice:a new approach for identifying the number of clusters and initial cluster centres[J]. Information Sciences, 2018, 466:129-151.

[123] TAN Wei, BLAKE M B, SALEH I, et al. Social-networ k-sourced big data analytics[J]. IEEE Internet Computing, 2013, 17(5):62-69.