What about factorizing the matrix with SVD to get dense vectors? 2011/10/19 Bae, Jae Hyeon <[email protected]>
> I am sorry, I am confused about distance and similarity. Distance between > pairs is mostly 1 with CosineDistanceMeasure. > > 2011/10/19 Ted Dunning <[email protected]> > > > Distance between pairs is mostly zero? This indicates a real problem. It > > the pairs that you mean are pairs of examples it isn't so bad but pairs > of > > canopies should have non zero distance. > > > > Or did you mean pairs of coordinates? > > > > Sent from my iPhone > > > > On Oct 19, 2011, at 8:36, "Bae, Jae Hyeon" <[email protected]> wrote: > > > > > Hi > > > > > > I am trying to do clustering very sparse data. With canopy clustering, > it > > > generates so many canopies causing GC overhead limit. I can change > > > parameters of canopy clustering but distances between most pairs are 0, > > > changing parameters does not affect so much. Even if I increase -Xmx > > size, a > > > lot of canopies will drive single reducer of canopy clustering to the > GC > > > overhead limit. > > > > > > Could you suggest any better idea for this situation? I can try K-means > > > clustering with K as a big number and Locality Sensitive Hashing can be > a > > > good candidate but I am not sure Likelike implementation is robust and > > > flexible to use. > > > > > > Thank you > > > > > > Best, Jae > > >
