Thanks for shedding some light. I was already looking for filters/normalizers in the step but couldn't find it. I forgot to think about the job's output format. Makes sense indeed.
Cheers On Wednesday 31 August 2011 11:26:46 Julien Nioche wrote: > Hi Markus, > > You are right in thinking that the reduce step does not do much in itself. > It is not so much the reduce step which is likely to be the source of your > problem but the URLFiltering / Normalizing within ParseoutputFormat. > Basically we get outlinks as a result of the parse and when writing the > output to HDFS we need to filter / normalise them. > > I have seen problems on large crawls with ridiculously large URLs which put > the normalisation in disarray with the symptoms you described. You can add > a trace in the log before normalising to see what the URLs look like and > add a custom normaliser which prevents large URLs to be processed. > > As usual jstack is your friend and will confirm that this is where the > problem is. > > HTH > > Julien > > On 30 August 2011 23:39, Markus Jelsma <[email protected]> wrote: > > I should add that i sometimes see an url filter exception written to the > > reduce log. I don't understand why this is the case; all the > > ParseSegment.reduce() code does is collecting key/value data. > > > > I also should point out that most reducers finish in reasonable time and > > it's > > always one task stalling the job to excessive lengths. The cluster is > > homogenous, this is not an assumption (i know the fallacies of distibuted > > computing ;) ). A server stalling the process is identical to all others > > and > > replication factor is only 2 for all files except the crawl db. > > > > Please enlighten me. > > > > > Hi, > > > > > > Any idea why the reducer of the parse job is as slow as a snail taking > > > a detour? There is no processing in reducer; all it does it copy the > > > keys > > > > and > > > > > values. > > > > > > The reduce step (meaning the last 33% of the reducer) is even slower > > > than the whole parsing done in the mapper! It is even slower than the > > > whole fetch job while it is the fetcher that produces the most output > > > (high I/O). > > > > > > A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total > > > amount) while the reducer has 7 times less data to write and no > > > processing! Yet it takes about 3 times longer to complete, stunning > > > figures! > > > > > > This excessive run time came apparant only when i significantly > > > increased the number of url's to generate (topN). When the topN was > > > lower the difference between run times of the fetch and parse jobs > > > were a lot smaller, usually it was the fetcher being slow because of > > > merging the spills. > > > > > > Any thoughts? > > > > > > Thanks -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

