Re: RowSimilarity error

Sebastian Schelter Thu, 12 Jul 2012 08:06:29 -0700

It's important to note that the performance of RowSimilarityJob
heavily depends on the sparsity of the input data, because in general
comparing all pairs of things is a quadratic (non-scalable) problem.


2012/7/12 Sebastian Schelter <[email protected]>:
> Sorry, I overread that its more than one machine. Could you provide
> the values for the counters from RowSimilarityJob (ROWS,
> COOCCURRENCES, PRUNED_COOCCURRENCES)?
>
> Best,
> Sebastian
>
> 2012/7/12 Pat Ferrel <[email protected]>:
>> Thanks, actually there are two machines. I am testing before spending on
>> AWS. It's failing the test in this case.
>>
>> BTW I ran the same setup with 150,000 docs and 250,000 terms with a much
>> lower timeout (30000000) all worked fine. I was using 0.6 at the time and
>> not sure if 0.8 has ever completed a rowsimilarity of any size. Small runs
>> work fine on my laptop.
>>
>> I smell some kind of other problem than simple performance. In any case in a
>> perfect world isn't the code supposed to check in often enough so the
>> cluster config doesn't need to be tweaked for a specific job?
>>
>> It may be some problem of mine, of course. I see no obvious hadoop or mahout
>> errors but there are many places to look.
>>
>> With a 100 minute timeout I am currently at the pause between map and
>> reduce. If it fails would you like any specific logs?
>>
>>
>> On 7/11/12 4:00 PM, Sebastian Schelter wrote:
>>>
>>> To be honest, I don't think it makes a lot of sense to test a Hadoop
>>> job on a single machine. It's pretty obvious that you will get
>>> terrible performance.
>>>
>>> 2012/7/12 Pat Ferrel <[email protected]>:
>>>>
>>>> BTW the timeout is 1800 but the task in total runs over 9 hours before
>>>> each
>>>> failure. This causes the job to take (after three tries) 27 hrs to
>>>> completely fail. Oh, bother...
>>>>
>>>> The timeout seems to be during the last map, so when the mappers reach
>>>> 100%
>>>> but still running. Maybe some kind of cleanup is happening?
>>>> The first reducer is still "pending". The reducer never gets a chance to
>>>> start.
>>>>
>>>> 12/07/11 11:09:45 INFO mapred.JobClient:  map 92% reduce 0%
>>>> 12/07/11 11:11:06 INFO mapred.JobClient:  map 93% reduce 0%
>>>> 12/07/11 11:12:51 INFO mapred.JobClient:  map 94% reduce 0%
>>>> 12/07/11 11:15:22 INFO mapred.JobClient:  map 95% reduce 0%
>>>> 12/07/11 11:18:43 INFO mapred.JobClient:  map 96% reduce 0%
>>>> 12/07/11 11:24:32 INFO mapred.JobClient:  map 97% reduce 0%
>>>> 12/07/11 11:27:40 INFO mapred.JobClient:  map 98% reduce 0%
>>>> 12/07/11 11:30:53 INFO mapred.JobClient:  map 99% reduce 0%
>>>> 12/07/11 11:36:35 INFO mapred.JobClient:  map 100% reduce 0%
>>>> ---after a very long wait (9hrs or so) insert fail here--->
>>>>
>>>> 8 core 2 machine cluster with 8G ram per machine 32,000 docs 76,000 terms
>>>>
>>>> Any other info you need please ask.
>>>>
>>>> I'm about to try cranking it up to a couple hours for timeout but I
>>>> suspect
>>>> there is something else going on here.
>>>>
>>>>
>>>> On 7/11/12 10:35 AM, Pat Ferrel wrote:
>>>>>
>>>>> I'm have a custom lucene stemming analyzer that filters out stop words
>>>>> and
>>>>> uses the following seq2sparse. The -x 40 is the only other thing that
>>>>> affects tossing frequent terms and as I understand things, tosses any
>>>>> term
>>>>> that appears in over 40% of the docs.
>>>>>
>>>>> mahout seq2sparse \
>>>>>      -i b2/seqfiles/ \
>>>>>      -o b2/vectors/ \
>>>>>      -ow \
>>>>>      -chunk 2000 \
>>>>>      -x 40 \
>>>>>      -seq \
>>>>>      -n 2 \
>>>>>      -nv \
>>>>>      -a com.finderbots.analyzers.LuceneStemmingAnalyzer
>>>>>
>>>>>
>>>>> On 7/11/12 9:18 AM, Sebastian Schelter wrote:
>>>>>>
>>>>>> Hi Pat,
>>>>>>
>>>>>> have you removed highly frequent terms before launching rowsimilarity
>>>>>> job?
>>>>>>
>>>>>> On 11.07.2012 18:14, Pat Ferrel wrote:
>>>>>>>
>>>>>>> I've been trying to get a rowsimilarity job to complete. It continues
>>>>>>> to
>>>>>>> timeout on a RowSimilarityJob-CooccurrencesMapper-Reducer task so I've
>>>>>>> upped the timeout to 30 minutes now. There are no errors in the logs
>>>>>>> that I can see and no other task I've tried is acting like this. Is
>>>>>>> this
>>>>>>> expected? Shouldn't the task check in more often?
>>>>>>>
>>>>>>> It's doing 34,000 docs with 40 sim docs each on 8 cores so it is a bit
>>>>>>> slow anyway, still I shouldn't have to turn up the timeout so high
>>>>>>> should I?
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>

Re: RowSimilarity error

Reply via email to