performance bottleneck

Tianwei Sat, 07 Jul 2012 13:43:58 -0700

Hi, all,

I successfully build and run a hadoop job based on nutch 2.0 rc3.  I
have a very large seed list(around 100K). I set the depth as 4, after
two iterations, I found one reduce task in the fetch phase is always
very slow, about 10X slow down. As a result, even though other 11
tasks (I configured to use 12 reduce tasks) already finished, the
whole job can't advance to the next "parse" phase and further to the
next iteration.


I diagnosed this problem a bit, the major problem may be that task is
fetching pages at a very slow speed, as:
"
10/10 spinwaiting/active, 2290 pages, 31 errors, 0.5 0.4 pages/s, 101
71 kb/s, 500 URLs in 1 queues > reduce
"

I guess the the slowest task is fetching urls from those slow remote
websites, is that true?


Since the performance of Map-reduce job is determined by the slowest
task, so I guess it's hard to change once the "fetch" map tasks
finished. I am wondering if there are any way to do better load
balance or dynamically adjust the load on slow tasks?


Thanks

Tianwei

performance bottleneck

Reply via email to