Thanks for the feedback everybody. I'll give 0.9 a run. Thanks! Sent from my iPhone
> On Sep 26, 2014, at 5:10 PM, Suneel Marthi <[email protected]> wrote: > > I had seen the issue u r reporting when running CooccurrencesMapper on a 2M > document corpus on an 80 node cluster. > The job would be stuck in cooccurencesMapper forever. > > This has been fixed in 0.9 (I have not had a chance to try it out on the > size and cluster I had before), so it would be good if u could try running > with 0.9. > > P.S. 0.7 is not supported anymore and Mahout's come a long way since 0.7, > please upgrade to 0.9. > > On Fri, Sep 26, 2014 at 7:02 PM, Burke Webster <[email protected]> > wrote: > >> We are currently using 0.7 so that could be the issue. Last I looked I >> believe we had around 22 million unique bi-grams in the dictionary. >> >> I can look into the newer code and see if that fixes our problems. >> >> On Fri, Sep 26, 2014 at 4:26 PM, Ted Dunning <[email protected]> >> wrote: >> >>> Can you say how many words you are seeing? >>> >>> How many unique bigrams? >>> >>> As Suneel asked, which version of Mahout? >>> >>> >>> >>> On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster <[email protected]> >>> wrote: >>> >>>> I've been implementing the RowSimilarityJob on our 40-node cluster and >>> have >>>> run into so serious performance issues. >>>> >>>> Trying to run the job on a corpus of just over 2 million documents >> using >>>> bi-grams. When I get to the pairwise similarity step >>> (CooccurrencesMapper >>>> and SimilarityReducer) I am running out of space on hdfs because the >> job >>> is >>>> generating over 5 terabytes of output data. >>>> >>>> Has anybody else run into similar issues? What other info can I >> provide >>>> that would be helpful? >>>> >>>> Thanks, >>>> Burke >>
