Hi, Ferdy,

Got it, Thanks a lot for your reply.  I will try those options.

Now I just manually monitored my job for a while, collected those slow
domains and add them into url filter files, then simply kill and restart my
job.

I will try your recommended options and also to see if I have any better
ideas for the load balance improvement ;-)

Thanks again.

Tianwei
On Mon, Jul 9, 2012 at 6:43 AM, Ferdy Galema <[email protected]>wrote:

> Hi,
>
> There are options to abort fetcher on certain conditions, for
> example fetcher.timelimit.mins for timelimit
> or fetcher.throughput.threshold.* for throughput. The
> fetcher.max.exceptions.per.queue options seems to be broken for nutch2.
> Afaik there is no current work in progress with regard to dynamic balancing
> of queues or something like that. Please search the issuetracker for some
> related issues. If you have some ideas to improve the fetch behaviour feel
> free to share them.
>
> Ferdy
>
> On Sat, Jul 7, 2012 at 10:43 PM, Tianwei <[email protected]> wrote:
>
> > Hi, all,
> >
> > I successfully build and run a hadoop job based on nutch 2.0 rc3.  I
> > have a very large seed list(around 100K). I set the depth as 4, after
> > two iterations, I found one reduce task in the fetch phase is always
> > very slow, about 10X slow down. As a result, even though other 11
> > tasks (I configured to use 12 reduce tasks) already finished, the
> > whole job can't advance to the next "parse" phase and further to the
> > next iteration.
> >
> > I diagnosed this problem a bit, the major problem may be that task is
> > fetching pages at a very slow speed, as:
> > "
> > 10/10 spinwaiting/active, 2290 pages, 31 errors, 0.5 0.4 pages/s, 101
> > 71 kb/s, 500 URLs in 1 queues > reduce
> > "
> >
> > I guess the the slowest task is fetching urls from those slow remote
> > websites, is that true?
> >
> >
> > Since the performance of Map-reduce job is determined by the slowest
> > task, so I guess it's hard to change once the "fetch" map tasks
> > finished. I am wondering if there are any way to do better load
> > balance or dynamically adjust the load on slow tasks?
> >
> >
> > Thanks
> >
> > Tianwei
> >
>

Reply via email to