This may indicate that your data are too sparse to get useful clustering. Smoothing using svd or second order distance might help.
Sent from my iPhone On Oct 19, 2011, at 8:47, "Bae, Jae Hyeon" <[email protected]> wrote: > I am sorry, I am confused about distance and similarity. Distance between > pairs is mostly 1 with CosineDistanceMeasure. > > 2011/10/19 Ted Dunning <[email protected]> > >> Distance between pairs is mostly zero? This indicates a real problem. It >> the pairs that you mean are pairs of examples it isn't so bad but pairs of >> canopies should have non zero distance. >> >> Or did you mean pairs of coordinates? >> >> Sent from my iPhone >> >> On Oct 19, 2011, at 8:36, "Bae, Jae Hyeon" <[email protected]> wrote: >> >>> Hi >>> >>> I am trying to do clustering very sparse data. With canopy clustering, it >>> generates so many canopies causing GC overhead limit. I can change >>> parameters of canopy clustering but distances between most pairs are 0, >>> changing parameters does not affect so much. Even if I increase -Xmx >> size, a >>> lot of canopies will drive single reducer of canopy clustering to the GC >>> overhead limit. >>> >>> Could you suggest any better idea for this situation? I can try K-means >>> clustering with K as a big number and Locality Sensitive Hashing can be a >>> good candidate but I am not sure Likelike implementation is robust and >>> flexible to use. >>> >>> Thank you >>> >>> Best, Jae >>
