Yeah... that is pretty ancient. On Fri, Sep 26, 2014 at 4:02 PM, Burke Webster <[email protected]> wrote:
> We are currently using 0.7 so that could be the issue. Last I looked I > believe we had around 22 million unique bi-grams in the dictionary. > > I can look into the newer code and see if that fixes our problems. > > On Fri, Sep 26, 2014 at 4:26 PM, Ted Dunning <[email protected]> > wrote: > > > Can you say how many words you are seeing? > > > > How many unique bigrams? > > > > As Suneel asked, which version of Mahout? > > > > > > > > On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster <[email protected]> > > wrote: > > > > > I've been implementing the RowSimilarityJob on our 40-node cluster and > > have > > > run into so serious performance issues. > > > > > > Trying to run the job on a corpus of just over 2 million documents > using > > > bi-grams. When I get to the pairwise similarity step > > (CooccurrencesMapper > > > and SimilarityReducer) I am running out of space on hdfs because the > job > > is > > > generating over 5 terabytes of output data. > > > > > > Has anybody else run into similar issues? What other info can I > provide > > > that would be helpful? > > > > > > Thanks, > > > Burke > > > > > >
