Hi Markus, You are right in thinking that the reduce step does not do much in itself. It is not so much the reduce step which is likely to be the source of your problem but the URLFiltering / Normalizing within ParseoutputFormat. Basically we get outlinks as a result of the parse and when writing the output to HDFS we need to filter / normalise them.
I have seen problems on large crawls with ridiculously large URLs which put the normalisation in disarray with the symptoms you described. You can add a trace in the log before normalising to see what the URLs look like and add a custom normaliser which prevents large URLs to be processed. As usual jstack is your friend and will confirm that this is where the problem is. HTH Julien On 30 August 2011 23:39, Markus Jelsma <[email protected]> wrote: > I should add that i sometimes see an url filter exception written to the > reduce log. I don't understand why this is the case; all the > ParseSegment.reduce() code does is collecting key/value data. > > I also should point out that most reducers finish in reasonable time and > it's > always one task stalling the job to excessive lengths. The cluster is > homogenous, this is not an assumption (i know the fallacies of distibuted > computing ;) ). A server stalling the process is identical to all others > and > replication factor is only 2 for all files except the crawl db. > > Please enlighten me. > > > Hi, > > > > Any idea why the reducer of the parse job is as slow as a snail taking a > > detour? There is no processing in reducer; all it does it copy the keys > and > > values. > > > > The reduce step (meaning the last 33% of the reducer) is even slower than > > the whole parsing done in the mapper! It is even slower than the whole > > fetch job while it is the fetcher that produces the most output (high > > I/O). > > > > A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total > > amount) while the reducer has 7 times less data to write and no > > processing! Yet it takes about 3 times longer to complete, stunning > > figures! > > > > This excessive run time came apparant only when i significantly increased > > the number of url's to generate (topN). When the topN was lower the > > difference between run times of the fetch and parse jobs were a lot > > smaller, usually it was the fetcher being slow because of merging the > > spills. > > > > Any thoughts? > > > > Thanks > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

