DOI: 10.3724/SP.J.1249.2019.01004

Journal of Shenzhen University Science and Engineering (深圳大学学报理工版) 2019/36:1 PP.4-17

A review on clustering algorithms for large-scale data sets

Clustering is an important research branch of machine learning. In the past decades, many well-known clustering algorithms have been designed to handle the clustering problems of small-scale and medium-scale data sets. Although these algorithms have obtained the good clustering performances, they are usually inefficient when dealing with the clustering tasks of large-scale data sets due to the high computation complexity and weak capability of handling the high-dimensional data. In the age of big data, the collection and storage of data become easier and more convenient. The clustering technologies are desperately needed to satisfy the requirements of real applications which generate a great deal of large-scale data sets. Thus, the clustering for large-scale data sets becomes an important research direction in the field of machine learning. In this paper, the current clustering algorithms are reviewed and analyzed for large-scale data sets under both the sequential clustering algorithms based on instance selection, incremental learning, feature subset and feature transformation and the parallel clustering algorithms based on MapReduce, Spark and Storm computational frameworks, respectively. Unlike the existing literature reviews, we focus on the computability of large-scale data sets. Meanwhile, we provide some new thoughts for the designs and applications of clustering algorithms for large-scale data sets, including the design strategies of clustering algorithms based on data parallelization, automation of training process, and some understandings of clustering algorithms for large-scale data in social networks.

Key words:artificial intelligence,large-scale data,clustering,sequential computing,parallel computing,data mining,review

ReleaseDate:2019-01-28 09:56:33

