Re: RowSimilarity error

Sebastian Schelter Wed, 11 Jul 2012 16:00:36 -0700

To be honest, I don't think it makes a lot of sense to test a Hadoop
job on a single machine. It's pretty obvious that you will get
terrible performance.


2012/7/12 Pat Ferrel <[email protected]>:
> BTW the timeout is 1800 but the task in total runs over 9 hours before each
> failure. This causes the job to take (after three tries) 27 hrs to
> completely fail. Oh, bother...
>
> The timeout seems to be during the last map, so when the mappers reach 100%
> but still running. Maybe some kind of cleanup is happening?
> The first reducer is still "pending". The reducer never gets a chance to
> start.
>
> 12/07/11 11:09:45 INFO mapred.JobClient:  map 92% reduce 0%
> 12/07/11 11:11:06 INFO mapred.JobClient:  map 93% reduce 0%
> 12/07/11 11:12:51 INFO mapred.JobClient:  map 94% reduce 0%
> 12/07/11 11:15:22 INFO mapred.JobClient:  map 95% reduce 0%
> 12/07/11 11:18:43 INFO mapred.JobClient:  map 96% reduce 0%
> 12/07/11 11:24:32 INFO mapred.JobClient:  map 97% reduce 0%
> 12/07/11 11:27:40 INFO mapred.JobClient:  map 98% reduce 0%
> 12/07/11 11:30:53 INFO mapred.JobClient:  map 99% reduce 0%
> 12/07/11 11:36:35 INFO mapred.JobClient:  map 100% reduce 0%
> ---after a very long wait (9hrs or so) insert fail here--->
>
> 8 core 2 machine cluster with 8G ram per machine 32,000 docs 76,000 terms
>
> Any other info you need please ask.
>
> I'm about to try cranking it up to a couple hours for timeout but I suspect
> there is something else going on here.
>
>
> On 7/11/12 10:35 AM, Pat Ferrel wrote:
>>
>> I'm have a custom lucene stemming analyzer that filters out stop words and
>> uses the following seq2sparse. The -x 40 is the only other thing that
>> affects tossing frequent terms and as I understand things, tosses any term
>> that appears in over 40% of the docs.
>>
>> mahout seq2sparse \
>>     -i b2/seqfiles/ \
>>     -o b2/vectors/ \
>>     -ow \
>>     -chunk 2000 \
>>     -x 40 \
>>     -seq \
>>     -n 2 \
>>     -nv \
>>     -a com.finderbots.analyzers.LuceneStemmingAnalyzer
>>
>>
>> On 7/11/12 9:18 AM, Sebastian Schelter wrote:
>>>
>>> Hi Pat,
>>>
>>> have you removed highly frequent terms before launching rowsimilarity
>>> job?
>>>
>>> On 11.07.2012 18:14, Pat Ferrel wrote:
>>>>
>>>> I've been trying to get a rowsimilarity job to complete. It continues to
>>>> timeout on a RowSimilarityJob-CooccurrencesMapper-Reducer task so I've
>>>> upped the timeout to 30 minutes now. There are no errors in the logs
>>>> that I can see and no other task I've tried is acting like this. Is this
>>>> expected? Shouldn't the task check in more often?
>>>>
>>>> It's doing 34,000 docs with 40 sim docs each on 8 cores so it is a bit
>>>> slow anyway, still I shouldn't have to turn up the timeout so high
>>>> should I?
>>>>
>>>
>>>
>>>
>>
>>
>
>

Re: RowSimilarity error

Reply via email to