Need to reduce execution time of RowSimilarityJob

yamo93 Tue, 18 Sep 2012 05:50:09 -0700

Hi,

I have 30.000 items and the computation takes more than 2h on apseudo-cluster, which is too long in my case.

I think of some ways to reduce the execution time of RowSimilarityJoband I wonder if some of you have implemented them and how, or exploredother ways.

1. tune the JVM
2. developing an in memory implementation (i.e. without hadoop)

3. reduce the size of the matrix (by removing those which have no wordsin common, for example)4. run on real hadoop cluster with several nodes (does anyone have anidea of the number of nodes to make it interesting)


Thanks for your help,
Yann.

Need to reduce execution time of RowSimilarityJob

Reply via email to