Date of Submission

Spring 2015

Academic Programs and Concentrations

Mathematics

Project Advisor 1

Mary Krembs

Abstract/Artist's Statement

The K-means clustering algorithm works on a data set with n data points in d dimensional space R^d. It determines a set of K centroids in Rd. Clustering is accomplished by assigning each point in the data set to its closest centroid. However, the K-means algorithm has a few draw backs. First, it requires the value of K, number of centroids, to be pre-specified. Second, the algorithm begins by randomly selecting the centroid locations. Third the accuracy of the output clusters in K-means is dependent on the type of clustering in the data. In this paper we propose a distance based definition for the clusterability of a data set using the edges in a Delaunay triangulation. We propose an algorithm to pre-process K-means; the output of the algorithm contains a range for K and initial centroid information. Our results show that a pre-processed K-means requires a lower number of iterations to reach completion. Also, by using cluster evaluation techniques such as the F-measure, Purity, and Entropy, we show that the results obtained from a pre-processed K-means consistently produces more accurate clusters. We also propose a new clustering algorithm which uses Delaunay triangulation to obtain clusters. We also show that this algorithm produces very accurate clusters in large number of data sets, even in data sets where $K$-means fails.

Open Access Agreement

On-Campus only

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.

Share

COinS