Performance of RowSimilarityJob

Burke Webster Fri, 26 Sep 2014 13:25:13 -0700

I've been implementing the RowSimilarityJob on our 40-node cluster and have
run into so serious performance issues.


Trying to run the job on a corpus of just over 2 million documents using
bi-grams.  When I get to the pairwise similarity step (CooccurrencesMapper
and SimilarityReducer) I am running out of space on hdfs because the job is
generating over 5 terabytes of output data.

Has anybody else run into similar issues?  What other info can I provide
that would be helpful?

Thanks,
Burke

Performance of RowSimilarityJob

Reply via email to