Hi, Any idea why the reducer of the parse job is as slow as a snail taking a detour? There is no processing in reducer; all it does it copy the keys and values.
The reduce step (meaning the last 33% of the reducer) is even slower than the whole parsing done in the mapper! It is even slower than the whole fetch job while it is the fetcher that produces the most output (high I/O). A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total amount) while the reducer has 7 times less data to write and no processing! Yet it takes about 3 times longer to complete, stunning figures! This excessive run time came apparant only when i significantly increased the number of url's to generate (topN). When the topN was lower the difference between run times of the fetch and parse jobs were a lot smaller, usually it was the fetcher being slow because of merging the spills. Any thoughts? Thanks

