What's the Mahout version? Please work off of 0.9, there was a performance issue in RSJ that was fixed in 0.9.
On Fri, Sep 26, 2014 at 4:23 PM, Burke Webster <[email protected]> wrote: > I've been implementing the RowSimilarityJob on our 40-node cluster and have > run into so serious performance issues. > > Trying to run the job on a corpus of just over 2 million documents using > bi-grams. When I get to the pairwise similarity step (CooccurrencesMapper > and SimilarityReducer) I am running out of space on hdfs because the job is > generating over 5 terabytes of output data. > > Has anybody else run into similar issues? What other info can I provide > that would be helpful? > > Thanks, > Burke >
