Re: Parse reduce slow as a snail

Markus Jelsma Tue, 30 Aug 2011 15:39:46 -0700

I should add that i sometimes see an url filter exception written to the 
reduce log. I don't understand why this is the case; all the 
ParseSegment.reduce() code does is collecting key/value data.


I also should point out that most reducers finish in reasonable time and it's 
always one task stalling the job to excessive lengths. The cluster is 
homogenous, this is not an assumption (i know the fallacies of distibuted 
computing ;) ). A server stalling the process is identical to all others and 
replication factor is only 2 for all files except the crawl db.

Please enlighten me.

> Hi,
> 
> Any idea why the reducer of the parse job is as slow as a snail taking a
> detour? There is no processing in reducer; all it does it copy the keys and
> values.
> 
> The reduce step (meaning the last 33% of the reducer) is even slower than
> the whole parsing done in the mapper! It is even slower than the whole
> fetch job while it is the fetcher that produces the most output (high
> I/O).
> 
> A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total
> amount) while the reducer has 7 times less data to write and no
> processing! Yet it takes about 3 times longer to complete, stunning
> figures!
> 
> This excessive run time came apparant only when i significantly increased
> the number of url's to generate (topN). When the topN was lower the
> difference between run times of the fetch and parse jobs were a lot
> smaller, usually it was the fetcher being slow because of merging the
> spills.
> 
> Any thoughts?
> 
> Thanks

Re: Parse reduce slow as a snail

Reply via email to