Re: RowSimilarity error

Pat Ferrel Wed, 11 Jul 2012 15:51:17 -0700

BTW the timeout is 1800 but the task in total runs over 9 hours beforeeach failure. This causes the job to take (after three tries) 27 hrs tocompletely fail. Oh, bother...

The timeout seems to be during the last map, so when the mappers reach100% but still running. Maybe some kind of cleanup is happening?The first reducer is still "pending". The reducer never gets a chance tostart.


12/07/11 11:09:45 INFO mapred.JobClient:  map 92% reduce 0%
12/07/11 11:11:06 INFO mapred.JobClient:  map 93% reduce 0%
12/07/11 11:12:51 INFO mapred.JobClient:  map 94% reduce 0%
12/07/11 11:15:22 INFO mapred.JobClient:  map 95% reduce 0%
12/07/11 11:18:43 INFO mapred.JobClient:  map 96% reduce 0%
12/07/11 11:24:32 INFO mapred.JobClient:  map 97% reduce 0%
12/07/11 11:27:40 INFO mapred.JobClient:  map 98% reduce 0%
12/07/11 11:30:53 INFO mapred.JobClient:  map 99% reduce 0%
12/07/11 11:36:35 INFO mapred.JobClient:  map 100% reduce 0%
---after a very long wait (9hrs or so) insert fail here--->

8 core 2 machine cluster with 8G ram per machine 32,000 docs 76,000 terms

Any other info you need please ask.

I'm about to try cranking it up to a couple hours for timeout but Isuspect there is something else going on here.


On 7/11/12 10:35 AM, Pat Ferrel wrote:

I'm have a custom lucene stemming analyzer that filters out stop wordsand uses the following seq2sparse. The -x 40 is the only other thingthat affects tossing frequent terms and as I understand things, tossesany term that appears in over 40% of the docs.
mahout seq2sparse \
    -i b2/seqfiles/ \
    -o b2/vectors/ \
    -ow \
    -chunk 2000 \
    -x 40 \
    -seq \
    -n 2 \
    -nv \
    -a com.finderbots.analyzers.LuceneStemmingAnalyzer


On 7/11/12 9:18 AM, Sebastian Schelter wrote:
Hi Pat,
have you removed highly frequent terms before launching rowsimilarityjob?
On 11.07.2012 18:14, Pat Ferrel wrote:
I've been trying to get a rowsimilarity job to complete. Itcontinues to
timeout on a RowSimilarityJob-CooccurrencesMapper-Reducer task so I've
upped the timeout to 30 minutes now. There are no errors in the logs
that I can see and no other task I've tried is acting like this. Isthis
expected? Shouldn't the task check in more often?

It's doing 34,000 docs with 40 sim docs each on 8 cores so it is a bit
slow anyway, still I shouldn't have to turn up the timeout so high
should I?

Re: RowSimilarity error

Reply via email to