Date of Submission
Academic Programs and Concentrations
Project Advisor 1
The K-means clustering algorithm works on a data set with n data points in d dimensional space R^d. It determines a set of K centroids in Rd. Clustering is accomplished by assigning each point in the data set to its closest centroid. However, the K-means algorithm has a few draw backs. First, it requires the value of K, number of centroids, to be pre-specified. Second, the algorithm begins by randomly selecting the centroid locations. Third the accuracy of the output clusters in K-means is dependent on the type of clustering in the data. In this paper we propose a distance based definition for the clusterability of a data set using the edges in a Delaunay triangulation. We propose an algorithm to pre-process K-means; the output of the algorithm contains a range for K and initial centroid information. Our results show that a pre-processed K-means requires a lower number of iterations to reach completion. Also, by using cluster evaluation techniques such as the F-measure, Purity, and Entropy, we show that the results obtained from a pre-processed K-means consistently produces more accurate clusters. We also propose a new clustering algorithm which uses Delaunay triangulation to obtain clusters. We also show that this algorithm produces very accurate clusters in large number of data sets, even in data sets where $K$-means fails.
Open Access Agreement
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.
Khan, Mohd Ahnaf Habib, "Pre-processing for K-means Clustering Algorithm" (2015). Senior Projects Spring 2015. 260.