I've been implementing the RowSimilarityJob on our 40-node cluster and have run into so serious performance issues.
Trying to run the job on a corpus of just over 2 million documents using bi-grams. When I get to the pairwise similarity step (CooccurrencesMapper and SimilarityReducer) I am running out of space on hdfs because the job is generating over 5 terabytes of output data. Has anybody else run into similar issues? What other info can I provide that would be helpful? Thanks, Burke
