To be honest, I don't think it makes a lot of sense to test a Hadoop job on a single machine. It's pretty obvious that you will get terrible performance.
2012/7/12 Pat Ferrel <[email protected]>: > BTW the timeout is 1800 but the task in total runs over 9 hours before each > failure. This causes the job to take (after three tries) 27 hrs to > completely fail. Oh, bother... > > The timeout seems to be during the last map, so when the mappers reach 100% > but still running. Maybe some kind of cleanup is happening? > The first reducer is still "pending". The reducer never gets a chance to > start. > > 12/07/11 11:09:45 INFO mapred.JobClient: map 92% reduce 0% > 12/07/11 11:11:06 INFO mapred.JobClient: map 93% reduce 0% > 12/07/11 11:12:51 INFO mapred.JobClient: map 94% reduce 0% > 12/07/11 11:15:22 INFO mapred.JobClient: map 95% reduce 0% > 12/07/11 11:18:43 INFO mapred.JobClient: map 96% reduce 0% > 12/07/11 11:24:32 INFO mapred.JobClient: map 97% reduce 0% > 12/07/11 11:27:40 INFO mapred.JobClient: map 98% reduce 0% > 12/07/11 11:30:53 INFO mapred.JobClient: map 99% reduce 0% > 12/07/11 11:36:35 INFO mapred.JobClient: map 100% reduce 0% > ---after a very long wait (9hrs or so) insert fail here---> > > 8 core 2 machine cluster with 8G ram per machine 32,000 docs 76,000 terms > > Any other info you need please ask. > > I'm about to try cranking it up to a couple hours for timeout but I suspect > there is something else going on here. > > > On 7/11/12 10:35 AM, Pat Ferrel wrote: >> >> I'm have a custom lucene stemming analyzer that filters out stop words and >> uses the following seq2sparse. The -x 40 is the only other thing that >> affects tossing frequent terms and as I understand things, tosses any term >> that appears in over 40% of the docs. >> >> mahout seq2sparse \ >> -i b2/seqfiles/ \ >> -o b2/vectors/ \ >> -ow \ >> -chunk 2000 \ >> -x 40 \ >> -seq \ >> -n 2 \ >> -nv \ >> -a com.finderbots.analyzers.LuceneStemmingAnalyzer >> >> >> On 7/11/12 9:18 AM, Sebastian Schelter wrote: >>> >>> Hi Pat, >>> >>> have you removed highly frequent terms before launching rowsimilarity >>> job? >>> >>> On 11.07.2012 18:14, Pat Ferrel wrote: >>>> >>>> I've been trying to get a rowsimilarity job to complete. It continues to >>>> timeout on a RowSimilarityJob-CooccurrencesMapper-Reducer task so I've >>>> upped the timeout to 30 minutes now. There are no errors in the logs >>>> that I can see and no other task I've tried is acting like this. Is this >>>> expected? Shouldn't the task check in more often? >>>> >>>> It's doing 34,000 docs with 40 sim docs each on 8 cores so it is a bit >>>> slow anyway, still I shouldn't have to turn up the timeout so high >>>> should I? >>>> >>> >>> >>> >> >> > >
