What's the Mahout version? Please work off of 0.9, there was a performance
issue in RSJ that was fixed in 0.9.

On Fri, Sep 26, 2014 at 4:23 PM, Burke Webster <[email protected]>
wrote:

> I've been implementing the RowSimilarityJob on our 40-node cluster and have
> run into so serious performance issues.
>
> Trying to run the job on a corpus of just over 2 million documents using
> bi-grams.  When I get to the pairwise similarity step (CooccurrencesMapper
> and SimilarityReducer) I am running out of space on hdfs because the job is
> generating over 5 terabytes of output data.
>
> Has anybody else run into similar issues?  What other info can I provide
> that would be helpful?
>
> Thanks,
> Burke
>

Reply via email to