Re: Performance of RowSimilarityJob

Ted Dunning Fri, 26 Sep 2014 16:09:40 -0700

Yeah... that is pretty ancient.

On Fri, Sep 26, 2014 at 4:02 PM, Burke Webster <[email protected]>
wrote:


> We are currently using 0.7 so that could be the issue.  Last I looked I
> believe we had around 22 million unique bi-grams in the dictionary.
>
> I can look into the newer code and see if that fixes our problems.
>
> On Fri, Sep 26, 2014 at 4:26 PM, Ted Dunning <[email protected]>
> wrote:
>
> > Can you say how many words you are seeing?
> >
> > How many unique bigrams?
> >
> > As Suneel asked, which version of Mahout?
> >
> >
> >
> > On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster <[email protected]>
> > wrote:
> >
> > > I've been implementing the RowSimilarityJob on our 40-node cluster and
> > have
> > > run into so serious performance issues.
> > >
> > > Trying to run the job on a corpus of just over 2 million documents
> using
> > > bi-grams.  When I get to the pairwise similarity step
> > (CooccurrencesMapper
> > > and SimilarityReducer) I am running out of space on hdfs because the
> job
> > is
> > > generating over 5 terabytes of output data.
> > >
> > > Has anybody else run into similar issues?  What other info can I
> provide
> > > that would be helpful?
> > >
> > > Thanks,
> > > Burke
> > >
> >
>

Re: Performance of RowSimilarityJob

Reply via email to