I need to compute the statistics you said and it takes time.
However, my data is about geometric objects. Suppose that there is an n
dimensional space with lots of overlapping boxes parallel to the axis.
We see the space through many windows parallel to the axis. Each window
is our point and if a window can see a box the corresponding feature is
1. Now I want to cluster those windows based on the similarity of boxes
they see!
There are many large boxes that can be present in many windows and there
are small boxes that can be seen only in one window. My goal is to find
clusters with approximately similar # of boxes that can be seen through
all windows in the cluster.
You talked about scaling. By scaling do you mean I should not use binary
data and for example use 0 to 1 values and try to maximize variance of
data in each dimension?
On 7/13/2012 12:13 PM, Ted Dunning wrote:
On Fri, Jul 13, 2012 at 12:09 PM, Masoud Moshref Javadi <[email protected]>wrote:
First of all thank you for your response with pictures.
That's true. Some features are 1 in many points and some are not. That's
the nature of my problem. But I did not scale features.
Should I do scaling? may be using a dimension reduction algorithm?
Can you say more about your data? Can you provide the output of something
like the summary function from R?
Dimensionality reduction will be a disaster if you have badly scaled data.
Dimensionality reduction preserves L_2 distances. If those distances are
already messed up, then it will preserve the mess. You don't want that.
Get the metric right first. Then let's talk.
Also, if you have 1 of n features, you should almost certainly encode them
as n binary values, not as a single variable with n values.
--
Masoud Moshref Javadi
Computer Engineering PhD Student
Ming Hsieh Department of Electrical Engineering
University of Southern California