Thanks for shedding some light. I was already looking for filters/normalizers 
in the step but couldn't find it. I forgot to think about the job's output 
format. Makes sense indeed.

Cheers

On Wednesday 31 August 2011 11:26:46 Julien Nioche wrote:
> Hi Markus,
> 
> You are right in thinking that the reduce step does not do much in itself.
> It is not so much the reduce step which is likely to be the source of your
> problem but the URLFiltering / Normalizing within ParseoutputFormat.
> Basically we get outlinks as a result of the parse and when writing the
> output to HDFS we need to filter / normalise them.
> 
> I have seen problems on large crawls with ridiculously large URLs which put
> the normalisation in disarray with the symptoms you described. You can add
> a trace in the log before normalising to see what the URLs look like and
> add a custom normaliser which prevents large URLs to be processed.
> 
> As usual jstack is your friend and will confirm that this is where the
> problem is.
> 
> HTH
> 
> Julien
> 
> On 30 August 2011 23:39, Markus Jelsma <[email protected]> wrote:
> > I should add that i sometimes see an url filter exception written to the
> > reduce log. I don't understand why this is the case; all the
> > ParseSegment.reduce() code does is collecting key/value data.
> > 
> > I also should point out that most reducers finish in reasonable time and
> > it's
> > always one task stalling the job to excessive lengths. The cluster is
> > homogenous, this is not an assumption (i know the fallacies of distibuted
> > computing ;) ). A server stalling the process is identical to all others
> > and
> > replication factor is only 2 for all files except the crawl db.
> > 
> > Please enlighten me.
> > 
> > > Hi,
> > > 
> > > Any idea why the reducer of the parse job is as slow as a snail taking
> > > a detour? There is no processing in reducer; all it does it copy the
> > > keys
> > 
> > and
> > 
> > > values.
> > > 
> > > The reduce step (meaning the last 33% of the reducer) is even slower
> > > than the whole parsing done in the mapper! It is even slower than the
> > > whole fetch job while it is the fetcher that produces the most output
> > > (high I/O).
> > > 
> > > A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total
> > > amount) while the reducer has 7 times less data to write and no
> > > processing! Yet it takes about 3 times longer to complete, stunning
> > > figures!
> > > 
> > > This excessive run time came apparant only when i significantly
> > > increased the number of url's to generate (topN). When the topN was
> > > lower the difference between run times of the fetch and parse jobs
> > > were a lot smaller, usually it was the fetcher being slow because of
> > > merging the spills.
> > > 
> > > Any thoughts?
> > > 
> > > Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to