Hi

I am trying to do clustering very sparse data. With canopy clustering, it
generates so many canopies causing GC overhead limit. I can change
parameters of canopy clustering but distances between most pairs are 0,
changing parameters does not affect so much. Even if I increase -Xmx size, a
lot of canopies will drive single reducer of canopy clustering to the GC
overhead limit.

Could you suggest any better idea for this situation? I can try K-means
clustering with K as a big number and Locality Sensitive Hashing can be a
good candidate but I am not sure Likelike implementation is robust and
flexible to use.

Thank you

Best, Jae

Reply via email to