Sorry, I overread that its more than one machine. Could you provide the values for the counters from RowSimilarityJob (ROWS, COOCCURRENCES, PRUNED_COOCCURRENCES)?
Best, Sebastian 2012/7/12 Pat Ferrel <[email protected]>: > Thanks, actually there are two machines. I am testing before spending on > AWS. It's failing the test in this case. > > BTW I ran the same setup with 150,000 docs and 250,000 terms with a much > lower timeout (30000000) all worked fine. I was using 0.6 at the time and > not sure if 0.8 has ever completed a rowsimilarity of any size. Small runs > work fine on my laptop. > > I smell some kind of other problem than simple performance. In any case in a > perfect world isn't the code supposed to check in often enough so the > cluster config doesn't need to be tweaked for a specific job? > > It may be some problem of mine, of course. I see no obvious hadoop or mahout > errors but there are many places to look. > > With a 100 minute timeout I am currently at the pause between map and > reduce. If it fails would you like any specific logs? > > > On 7/11/12 4:00 PM, Sebastian Schelter wrote: >> >> To be honest, I don't think it makes a lot of sense to test a Hadoop >> job on a single machine. It's pretty obvious that you will get >> terrible performance. >> >> 2012/7/12 Pat Ferrel <[email protected]>: >>> >>> BTW the timeout is 1800 but the task in total runs over 9 hours before >>> each >>> failure. This causes the job to take (after three tries) 27 hrs to >>> completely fail. Oh, bother... >>> >>> The timeout seems to be during the last map, so when the mappers reach >>> 100% >>> but still running. Maybe some kind of cleanup is happening? >>> The first reducer is still "pending". The reducer never gets a chance to >>> start. >>> >>> 12/07/11 11:09:45 INFO mapred.JobClient: map 92% reduce 0% >>> 12/07/11 11:11:06 INFO mapred.JobClient: map 93% reduce 0% >>> 12/07/11 11:12:51 INFO mapred.JobClient: map 94% reduce 0% >>> 12/07/11 11:15:22 INFO mapred.JobClient: map 95% reduce 0% >>> 12/07/11 11:18:43 INFO mapred.JobClient: map 96% reduce 0% >>> 12/07/11 11:24:32 INFO mapred.JobClient: map 97% reduce 0% >>> 12/07/11 11:27:40 INFO mapred.JobClient: map 98% reduce 0% >>> 12/07/11 11:30:53 INFO mapred.JobClient: map 99% reduce 0% >>> 12/07/11 11:36:35 INFO mapred.JobClient: map 100% reduce 0% >>> ---after a very long wait (9hrs or so) insert fail here---> >>> >>> 8 core 2 machine cluster with 8G ram per machine 32,000 docs 76,000 terms >>> >>> Any other info you need please ask. >>> >>> I'm about to try cranking it up to a couple hours for timeout but I >>> suspect >>> there is something else going on here. >>> >>> >>> On 7/11/12 10:35 AM, Pat Ferrel wrote: >>>> >>>> I'm have a custom lucene stemming analyzer that filters out stop words >>>> and >>>> uses the following seq2sparse. The -x 40 is the only other thing that >>>> affects tossing frequent terms and as I understand things, tosses any >>>> term >>>> that appears in over 40% of the docs. >>>> >>>> mahout seq2sparse \ >>>> -i b2/seqfiles/ \ >>>> -o b2/vectors/ \ >>>> -ow \ >>>> -chunk 2000 \ >>>> -x 40 \ >>>> -seq \ >>>> -n 2 \ >>>> -nv \ >>>> -a com.finderbots.analyzers.LuceneStemmingAnalyzer >>>> >>>> >>>> On 7/11/12 9:18 AM, Sebastian Schelter wrote: >>>>> >>>>> Hi Pat, >>>>> >>>>> have you removed highly frequent terms before launching rowsimilarity >>>>> job? >>>>> >>>>> On 11.07.2012 18:14, Pat Ferrel wrote: >>>>>> >>>>>> I've been trying to get a rowsimilarity job to complete. It continues >>>>>> to >>>>>> timeout on a RowSimilarityJob-CooccurrencesMapper-Reducer task so I've >>>>>> upped the timeout to 30 minutes now. There are no errors in the logs >>>>>> that I can see and no other task I've tried is acting like this. Is >>>>>> this >>>>>> expected? Shouldn't the task check in more often? >>>>>> >>>>>> It's doing 34,000 docs with 40 sim docs each on 8 cores so it is a bit >>>>>> slow anyway, still I shouldn't have to turn up the timeout so high >>>>>> should I? >>>>>> >>>>> >>>>> >>>> >>> >> > >
