Re: Performance of RowSimilarityJob

Burke Webster Sat, 27 Sep 2014 17:00:06 -0700

Thanks for the feedback everybody. I'll give 0.9 a run. Thanks!

Sent from my iPhone


> On Sep 26, 2014, at 5:10 PM, Suneel Marthi <[email protected]> wrote:
> 
> I had seen the issue u r reporting when running CooccurrencesMapper on a 2M
> document corpus on an 80 node cluster.
> The job would be stuck in cooccurencesMapper forever.
> 
> This has been fixed in 0.9 (I have not had a chance to try it out on the
> size and cluster I had before), so it would be good if u could try running
> with 0.9.
> 
> P.S. 0.7 is not supported anymore and Mahout's come a long way since 0.7,
> please upgrade to 0.9.
> 
> On Fri, Sep 26, 2014 at 7:02 PM, Burke Webster <[email protected]>
> wrote:
> 
>> We are currently using 0.7 so that could be the issue.  Last I looked I
>> believe we had around 22 million unique bi-grams in the dictionary.
>> 
>> I can look into the newer code and see if that fixes our problems.
>> 
>> On Fri, Sep 26, 2014 at 4:26 PM, Ted Dunning <[email protected]>
>> wrote:
>> 
>>> Can you say how many words you are seeing?
>>> 
>>> How many unique bigrams?
>>> 
>>> As Suneel asked, which version of Mahout?
>>> 
>>> 
>>> 
>>> On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster <[email protected]>
>>> wrote:
>>> 
>>>> I've been implementing the RowSimilarityJob on our 40-node cluster and
>>> have
>>>> run into so serious performance issues.
>>>> 
>>>> Trying to run the job on a corpus of just over 2 million documents
>> using
>>>> bi-grams.  When I get to the pairwise similarity step
>>> (CooccurrencesMapper
>>>> and SimilarityReducer) I am running out of space on hdfs because the
>> job
>>> is
>>>> generating over 5 terabytes of output data.
>>>> 
>>>> Has anybody else run into similar issues?  What other info can I
>> provide
>>>> that would be helpful?
>>>> 
>>>> Thanks,
>>>> Burke
>>

Re: Performance of RowSimilarityJob

Reply via email to