This often happens when the coordinates are poorly scaled. Often some feature or other has a very skewed distribution which makes the data look like a big mass plus scattered outliers. This leads to your sort of problem.
For instance, see https://dl.dropbox.com/u/36863361/plot.png for an example. This data is actually the result of an exp transformation of a normal distribution. The original distribution is shown here https://dl.dropbox.com/u/36863361/plot2.png As you can see, the story you would tell about these plots would be very different. Likewise, the story that k-means would tell is also very different. That first example only has a single cluster, but this pair has three obvious clusters (in the original scaling) but the distribution looks about the same as before in the badly scaled frame. Bad scaling: https://dl.dropbox.com/u/36863361/plot3.png Good scaling: https://dl.dropbox.com/u/36863361/plot4.png On Fri, Jul 13, 2012 at 10:34 AM, Masoud Moshref Javadi <[email protected]>wrote: > I am clustering binary data (feature vaues are 0 or 1) over 20k points > with 200k columns. I use canopy to find initial clusters and then do kmeans > using Manhattan distance in 10 iterations. > After clustering I found that there are many clusters with just one point > and a few very large clusters. I draw the similarity matrix of clusters > (not centroid but OR of bits for points in each cluster but as most of > clusters have only 1 point this is the same as centroid). It shows that > there is a kind of pattern in similarity of matrices. > http://enl.usc.edu/~moshref/**cluster_100.jpg<http://enl.usc.edu/~moshref/cluster_100.jpg> > > I also run clustering with fewer clusters (by increasing the canopy t2 > threshold) and the same pattern occurs. > http://enl.usc.edu/~moshref/**cluster_200.jpg<http://enl.usc.edu/~moshref/cluster_200.jpg> > > - Am I doing something wrong? > - I want to find uniform size clusters, is kmeans enough for that? is > hierarchical method good for this goal? why? > The definition of size of cluster is the number of 1 bits when we OR all > bits for members of a cluster. However, the number of points in each > cluster may also work > > > Thanks in advance > > -- > Masoud Moshref Javadi > Computer Engineering PhD Student > Ming Hsieh Department of Electrical Engineering > University of Southern California > >
