Hi, all, I successfully build and run a hadoop job based on nutch 2.0 rc3. I have a very large seed list(around 100K). I set the depth as 4, after two iterations, I found one reduce task in the fetch phase is always very slow, about 10X slow down. As a result, even though other 11 tasks (I configured to use 12 reduce tasks) already finished, the whole job can't advance to the next "parse" phase and further to the next iteration.
I diagnosed this problem a bit, the major problem may be that task is fetching pages at a very slow speed, as: " 10/10 spinwaiting/active, 2290 pages, 31 errors, 0.5 0.4 pages/s, 101 71 kb/s, 500 URLs in 1 queues > reduce " I guess the the slowest task is fetching urls from those slow remote websites, is that true? Since the performance of Map-reduce job is determined by the slowest task, so I guess it's hard to change once the "fetch" map tasks finished. I am wondering if there are any way to do better load balance or dynamically adjust the load on slow tasks? Thanks Tianwei

