Re: irregular kmeans clusters on binary data

Masoud Moshref Javadi Fri, 13 Jul 2012 12:44:22 -0700

I need to compute the statistics you said and it takes time.

However, my data is about geometric objects. Suppose that there is an ndimensional space with lots of overlapping boxes parallel to the axis.We see the space through many windows parallel to the axis. Each windowis our point and if a window can see a box the corresponding feature is1. Now I want to cluster those windows based on the similarity of boxesthey see!

There are many large boxes that can be present in many windows and thereare small boxes that can be seen only in one window. My goal is to findclusters with approximately similar # of boxes that can be seen throughall windows in the cluster.

You talked about scaling. By scaling do you mean I should not use binarydata and for example use 0 to 1 values and try to maximize variance ofdata in each dimension?


On 7/13/2012 12:13 PM, Ted Dunning wrote:

On Fri, Jul 13, 2012 at 12:09 PM, Masoud Moshref Javadi <[email protected]>wrote:

First of all thank you for your response with pictures.
That's true. Some features are 1 in many points and some are not. That's
the nature of my problem. But I did not scale features.
Should I do scaling? may be using a dimension reduction algorithm?

Can you say more about your data?  Can you provide the output of something
like the summary function from R?

Dimensionality reduction will be a disaster if you have badly scaled data.
  Dimensionality reduction preserves L_2 distances.  If those distances are
already messed up, then it will preserve the mess.  You don't want that.
  Get the metric right first.  Then let's talk.

Also, if you have 1 of n features, you should almost certainly encode them
as n binary values, not as a single variable with n values.


--
Masoud Moshref Javadi
Computer Engineering PhD Student
Ming Hsieh Department of Electrical Engineering
University of Southern California

Re: irregular kmeans clusters on binary data

Reply via email to