Re: irregular kmeans clusters on binary data

Masoud Moshref Javadi Fri, 13 Jul 2012 12:10:14 -0700

First of all thank you for your response with pictures.

That's true. Some features are 1 in many points and some are not. That'sthe nature of my problem. But I did not scale features.

Should I do scaling? may be using a dimension reduction algorithm?



On 7/13/2012 11:58 AM, Ted Dunning wrote:

This often happens when the coordinates are poorly scaled.  Often some
feature or other has a very skewed distribution which makes the data look
like a big mass plus scattered outliers.  This leads to your sort of
problem.

For instance, see https://dl.dropbox.com/u/36863361/plot.png for an
example.  This data is actually the result of an exp transformation of a
normal distribution.  The original distribution is shown here
https://dl.dropbox.com/u/36863361/plot2.png

As you can see, the story you would tell about these plots would be very
different.  Likewise, the story that k-means would tell is also very
different.

That first example only has a single cluster, but this pair has three
obvious clusters (in the original scaling) but the distribution looks about
the same as before in the badly scaled frame.

Bad scaling: https://dl.dropbox.com/u/36863361/plot3.png
Good scaling: https://dl.dropbox.com/u/36863361/plot4.png

On Fri, Jul 13, 2012 at 10:34 AM, Masoud Moshref Javadi <[email protected]>wrote:

I am clustering binary data (feature vaues are 0 or 1) over 20k points
with 200k columns. I use canopy to find initial clusters and then do kmeans
using Manhattan distance in 10 iterations.
After clustering I found that there are many clusters with just one point
and a few very large clusters. I draw the similarity matrix of clusters
(not centroid but OR of bits for points in each cluster but as most of
clusters have only 1 point this is the same as centroid). It shows that
there is a kind of pattern in similarity of matrices.
http://enl.usc.edu/~moshref/**cluster_100.jpg<http://enl.usc.edu/~moshref/cluster_100.jpg>

I also run clustering with fewer clusters (by increasing the canopy t2
threshold) and the same pattern occurs.
http://enl.usc.edu/~moshref/**cluster_200.jpg<http://enl.usc.edu/~moshref/cluster_200.jpg>

- Am I doing something wrong?
- I want to find uniform size clusters, is kmeans enough for that? is
hierarchical method good for this goal? why?
The definition of size of cluster is the number of 1 bits when we OR all
bits for members of a cluster. However, the number of points in each
cluster may also work


Thanks in advance

--
Masoud Moshref Javadi
Computer Engineering PhD Student
Ming Hsieh Department of Electrical Engineering
University of Southern California


--
Masoud Moshref Javadi
Computer Engineering PhD Student
Ming Hsieh Department of Electrical Engineering
University of Southern California

Re: irregular kmeans clusters on binary data

Reply via email to