Can you say how many words you are seeing? How many unique bigrams?
As Suneel asked, which version of Mahout? On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster <[email protected]> wrote: > I've been implementing the RowSimilarityJob on our 40-node cluster and have > run into so serious performance issues. > > Trying to run the job on a corpus of just over 2 million documents using > bi-grams. When I get to the pairwise similarity step (CooccurrencesMapper > and SimilarityReducer) I am running out of space on hdfs because the job is > generating over 5 terabytes of output data. > > Has anybody else run into similar issues? What other info can I provide > that would be helpful? > > Thanks, > Burke >
