Hi I am trying to do clustering very sparse data. With canopy clustering, it generates so many canopies causing GC overhead limit. I can change parameters of canopy clustering but distances between most pairs are 0, changing parameters does not affect so much. Even if I increase -Xmx size, a lot of canopies will drive single reducer of canopy clustering to the GC overhead limit.
Could you suggest any better idea for this situation? I can try K-means clustering with K as a big number and Locality Sensitive Hashing can be a good candidate but I am not sure Likelike implementation is robust and flexible to use. Thank you Best, Jae
