I should add that i sometimes see an url filter exception written to the reduce log. I don't understand why this is the case; all the ParseSegment.reduce() code does is collecting key/value data.
I also should point out that most reducers finish in reasonable time and it's always one task stalling the job to excessive lengths. The cluster is homogenous, this is not an assumption (i know the fallacies of distibuted computing ;) ). A server stalling the process is identical to all others and replication factor is only 2 for all files except the crawl db. Please enlighten me. > Hi, > > Any idea why the reducer of the parse job is as slow as a snail taking a > detour? There is no processing in reducer; all it does it copy the keys and > values. > > The reduce step (meaning the last 33% of the reducer) is even slower than > the whole parsing done in the mapper! It is even slower than the whole > fetch job while it is the fetcher that produces the most output (high > I/O). > > A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total > amount) while the reducer has 7 times less data to write and no > processing! Yet it takes about 3 times longer to complete, stunning > figures! > > This excessive run time came apparant only when i significantly increased > the number of url's to generate (topN). When the topN was lower the > difference between run times of the fetch and parse jobs were a lot > smaller, usually it was the fetcher being slow because of merging the > spills. > > Any thoughts? > > Thanks

