It's important to note that the performance of RowSimilarityJob heavily depends on the sparsity of the input data, because in general comparing all pairs of things is a quadratic (non-scalable) problem.
2012/7/12 Sebastian Schelter <[email protected]>: > Sorry, I overread that its more than one machine. Could you provide > the values for the counters from RowSimilarityJob (ROWS, > COOCCURRENCES, PRUNED_COOCCURRENCES)? > > Best, > Sebastian > > 2012/7/12 Pat Ferrel <[email protected]>: >> Thanks, actually there are two machines. I am testing before spending on >> AWS. It's failing the test in this case. >> >> BTW I ran the same setup with 150,000 docs and 250,000 terms with a much >> lower timeout (30000000) all worked fine. I was using 0.6 at the time and >> not sure if 0.8 has ever completed a rowsimilarity of any size. Small runs >> work fine on my laptop. >> >> I smell some kind of other problem than simple performance. In any case in a >> perfect world isn't the code supposed to check in often enough so the >> cluster config doesn't need to be tweaked for a specific job? >> >> It may be some problem of mine, of course. I see no obvious hadoop or mahout >> errors but there are many places to look. >> >> With a 100 minute timeout I am currently at the pause between map and >> reduce. If it fails would you like any specific logs? >> >> >> On 7/11/12 4:00 PM, Sebastian Schelter wrote: >>> >>> To be honest, I don't think it makes a lot of sense to test a Hadoop >>> job on a single machine. It's pretty obvious that you will get >>> terrible performance. >>> >>> 2012/7/12 Pat Ferrel <[email protected]>: >>>> >>>> BTW the timeout is 1800 but the task in total runs over 9 hours before >>>> each >>>> failure. This causes the job to take (after three tries) 27 hrs to >>>> completely fail. Oh, bother... >>>> >>>> The timeout seems to be during the last map, so when the mappers reach >>>> 100% >>>> but still running. Maybe some kind of cleanup is happening? >>>> The first reducer is still "pending". The reducer never gets a chance to >>>> start. >>>> >>>> 12/07/11 11:09:45 INFO mapred.JobClient: map 92% reduce 0% >>>> 12/07/11 11:11:06 INFO mapred.JobClient: map 93% reduce 0% >>>> 12/07/11 11:12:51 INFO mapred.JobClient: map 94% reduce 0% >>>> 12/07/11 11:15:22 INFO mapred.JobClient: map 95% reduce 0% >>>> 12/07/11 11:18:43 INFO mapred.JobClient: map 96% reduce 0% >>>> 12/07/11 11:24:32 INFO mapred.JobClient: map 97% reduce 0% >>>> 12/07/11 11:27:40 INFO mapred.JobClient: map 98% reduce 0% >>>> 12/07/11 11:30:53 INFO mapred.JobClient: map 99% reduce 0% >>>> 12/07/11 11:36:35 INFO mapred.JobClient: map 100% reduce 0% >>>> ---after a very long wait (9hrs or so) insert fail here---> >>>> >>>> 8 core 2 machine cluster with 8G ram per machine 32,000 docs 76,000 terms >>>> >>>> Any other info you need please ask. >>>> >>>> I'm about to try cranking it up to a couple hours for timeout but I >>>> suspect >>>> there is something else going on here. >>>> >>>> >>>> On 7/11/12 10:35 AM, Pat Ferrel wrote: >>>>> >>>>> I'm have a custom lucene stemming analyzer that filters out stop words >>>>> and >>>>> uses the following seq2sparse. The -x 40 is the only other thing that >>>>> affects tossing frequent terms and as I understand things, tosses any >>>>> term >>>>> that appears in over 40% of the docs. >>>>> >>>>> mahout seq2sparse \ >>>>> -i b2/seqfiles/ \ >>>>> -o b2/vectors/ \ >>>>> -ow \ >>>>> -chunk 2000 \ >>>>> -x 40 \ >>>>> -seq \ >>>>> -n 2 \ >>>>> -nv \ >>>>> -a com.finderbots.analyzers.LuceneStemmingAnalyzer >>>>> >>>>> >>>>> On 7/11/12 9:18 AM, Sebastian Schelter wrote: >>>>>> >>>>>> Hi Pat, >>>>>> >>>>>> have you removed highly frequent terms before launching rowsimilarity >>>>>> job? >>>>>> >>>>>> On 11.07.2012 18:14, Pat Ferrel wrote: >>>>>>> >>>>>>> I've been trying to get a rowsimilarity job to complete. It continues >>>>>>> to >>>>>>> timeout on a RowSimilarityJob-CooccurrencesMapper-Reducer task so I've >>>>>>> upped the timeout to 30 minutes now. There are no errors in the logs >>>>>>> that I can see and no other task I've tried is acting like this. Is >>>>>>> this >>>>>>> expected? Shouldn't the task check in more often? >>>>>>> >>>>>>> It's doing 34,000 docs with 40 sim docs each on 8 cores so it is a bit >>>>>>> slow anyway, still I shouldn't have to turn up the timeout so high >>>>>>> should I? >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >>
