irregular kmeans clusters on binary data

Masoud Moshref Javadi Fri, 13 Jul 2012 10:35:29 -0700

I am clustering binary data (feature vaues are 0 or 1) over 20k pointswith 200k columns. I use canopy to find initial clusters and then dokmeans using Manhattan distance in 10 iterations.After clustering I found that there are many clusters with just onepoint and a few very large clusters. I draw the similarity matrix ofclusters (not centroid but OR of bits for points in each cluster but asmost of clusters have only 1 point this is the same as centroid). Itshows that there is a kind of pattern in similarity of matrices.

http://enl.usc.edu/~moshref/cluster_100.jpg

I also run clustering with fewer clusters (by increasing the canopy t2threshold) and the same pattern occurs.

http://enl.usc.edu/~moshref/cluster_200.jpg

- Am I doing something wrong?

- I want to find uniform size clusters, is kmeans enough for that? ishierarchical method good for this goal? why?The definition of size of cluster is the number of 1 bits when we OR allbits for members of a cluster. However, the number of points in eachcluster may also work



Thanks in advance

--
Masoud Moshref Javadi
Computer Engineering PhD Student
Ming Hsieh Department of Electrical Engineering
University of Southern California

irregular kmeans clusters on binary data

Reply via email to