I am sorry, I am confused about distance and similarity. Distance between pairs is mostly 1 with CosineDistanceMeasure.
2011/10/19 Ted Dunning <[email protected]> > Distance between pairs is mostly zero? This indicates a real problem. It > the pairs that you mean are pairs of examples it isn't so bad but pairs of > canopies should have non zero distance. > > Or did you mean pairs of coordinates? > > Sent from my iPhone > > On Oct 19, 2011, at 8:36, "Bae, Jae Hyeon" <[email protected]> wrote: > > > Hi > > > > I am trying to do clustering very sparse data. With canopy clustering, it > > generates so many canopies causing GC overhead limit. I can change > > parameters of canopy clustering but distances between most pairs are 0, > > changing parameters does not affect so much. Even if I increase -Xmx > size, a > > lot of canopies will drive single reducer of canopy clustering to the GC > > overhead limit. > > > > Could you suggest any better idea for this situation? I can try K-means > > clustering with K as a big number and Locality Sensitive Hashing can be a > > good candidate but I am not sure Likelike implementation is robust and > > flexible to use. > > > > Thank you > > > > Best, Jae >
