I've been implementing the RowSimilarityJob on our 40-node cluster and have
run into so serious performance issues.

Trying to run the job on a corpus of just over 2 million documents using
bi-grams.  When I get to the pairwise similarity step (CooccurrencesMapper
and SimilarityReducer) I am running out of space on hdfs because the job is
generating over 5 terabytes of output data.

Has anybody else run into similar issues?  What other info can I provide
that would be helpful?

Thanks,
Burke

Reply via email to