Re: performance bottleneck

Ferdy Galema Mon, 09 Jul 2012 06:44:22 -0700

Hi,

There are options to abort fetcher on certain conditions, for
example fetcher.timelimit.mins for timelimit
or fetcher.throughput.threshold.* for throughput. The
fetcher.max.exceptions.per.queue options seems to be broken for nutch2.
Afaik there is no current work in progress with regard to dynamic balancing
of queues or something like that. Please search the issuetracker for some
related issues. If you have some ideas to improve the fetch behaviour feel
free to share them.


Ferdy

On Sat, Jul 7, 2012 at 10:43 PM, Tianwei <[email protected]> wrote:

> Hi, all,
>
> I successfully build and run a hadoop job based on nutch 2.0 rc3.  I
> have a very large seed list(around 100K). I set the depth as 4, after
> two iterations, I found one reduce task in the fetch phase is always
> very slow, about 10X slow down. As a result, even though other 11
> tasks (I configured to use 12 reduce tasks) already finished, the
> whole job can't advance to the next "parse" phase and further to the
> next iteration.
>
> I diagnosed this problem a bit, the major problem may be that task is
> fetching pages at a very slow speed, as:
> "
> 10/10 spinwaiting/active, 2290 pages, 31 errors, 0.5 0.4 pages/s, 101
> 71 kb/s, 500 URLs in 1 queues > reduce
> "
>
> I guess the the slowest task is fetching urls from those slow remote
> websites, is that true?
>
>
> Since the performance of Map-reduce job is determined by the slowest
> task, so I guess it's hard to change once the "fetch" map tasks
> finished. I am wondering if there are any way to do better load
> balance or dynamically adjust the load on slow tasks?
>
>
> Thanks
>
> Tianwei
>

Re: performance bottleneck

Reply via email to