Hi,

Any idea why the reducer of the parse job is as slow as a snail taking a 
detour? There is no processing in reducer; all it does it copy the keys and 
values.

The reduce step (meaning the last 33% of the reducer) is even slower than the 
whole parsing done in the mapper! It is even slower than the whole fetch job 
while it is the fetcher that produces the most output (high I/O).

A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total amount) 
while the reducer has 7 times less data to write and no processing! Yet it 
takes about 3 times longer to complete, stunning figures!

This excessive run time came apparant only when i significantly increased the 
number of url's to generate (topN). When the topN was lower the difference 
between run times of the fetch and parse jobs were a lot smaller, usually it 
was the fetcher being slow because of merging the spills.

Any thoughts? 

Thanks

Reply via email to