Re: Performance of RowSimilarityJob

Ted Dunning Fri, 26 Sep 2014 15:56:41 -0700

Can you say how many words you are seeing?

How many unique bigrams?


As Suneel asked, which version of Mahout?



On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster <[email protected]>
wrote:

> I've been implementing the RowSimilarityJob on our 40-node cluster and have
> run into so serious performance issues.
>
> Trying to run the job on a corpus of just over 2 million documents using
> bi-grams.  When I get to the pairwise similarity step (CooccurrencesMapper
> and SimilarityReducer) I am running out of space on hdfs because the job is
> generating over 5 terabytes of output data.
>
> Has anybody else run into similar issues?  What other info can I provide
> that would be helpful?
>
> Thanks,
> Burke
>

Re: Performance of RowSimilarityJob

Reply via email to