Hi, Ferdy, Got it, Thanks a lot for your reply. I will try those options.
Now I just manually monitored my job for a while, collected those slow domains and add them into url filter files, then simply kill and restart my job. I will try your recommended options and also to see if I have any better ideas for the load balance improvement ;-) Thanks again. Tianwei On Mon, Jul 9, 2012 at 6:43 AM, Ferdy Galema <[email protected]>wrote: > Hi, > > There are options to abort fetcher on certain conditions, for > example fetcher.timelimit.mins for timelimit > or fetcher.throughput.threshold.* for throughput. The > fetcher.max.exceptions.per.queue options seems to be broken for nutch2. > Afaik there is no current work in progress with regard to dynamic balancing > of queues or something like that. Please search the issuetracker for some > related issues. If you have some ideas to improve the fetch behaviour feel > free to share them. > > Ferdy > > On Sat, Jul 7, 2012 at 10:43 PM, Tianwei <[email protected]> wrote: > > > Hi, all, > > > > I successfully build and run a hadoop job based on nutch 2.0 rc3. I > > have a very large seed list(around 100K). I set the depth as 4, after > > two iterations, I found one reduce task in the fetch phase is always > > very slow, about 10X slow down. As a result, even though other 11 > > tasks (I configured to use 12 reduce tasks) already finished, the > > whole job can't advance to the next "parse" phase and further to the > > next iteration. > > > > I diagnosed this problem a bit, the major problem may be that task is > > fetching pages at a very slow speed, as: > > " > > 10/10 spinwaiting/active, 2290 pages, 31 errors, 0.5 0.4 pages/s, 101 > > 71 kb/s, 500 URLs in 1 queues > reduce > > " > > > > I guess the the slowest task is fetching urls from those slow remote > > websites, is that true? > > > > > > Since the performance of Map-reduce job is determined by the slowest > > task, so I guess it's hard to change once the "fetch" map tasks > > finished. I am wondering if there are any way to do better load > > balance or dynamically adjust the load on slow tasks? > > > > > > Thanks > > > > Tianwei > > >

