Hi,

to ensure politeness by guaranteed intervals between accesses
to the same host, all URLs of one single host (or optionally IP address)
are placed in one queue which is processed by a single task.
The longest queue determines the time required to execute one
fetch cycle.

If the URLs crawled spread over many hosts they should get evenly
distributed over tasks. If there are only few hosts with huge differences
regarding the number of crawled URLs the only way is to set the max. length
of any queue via the property "generate.max.count".

Cheers,
Sebastian

On 04/01/2015 09:25 AM, Ai Ai wrote:
> 
> Hello,
> I'm trying to optimize nutch performance for crawling sites. Now i test 
> performance on small hadoop cluster, only two nodes 32gb RAM, cpu Intel Xeon 
> E3 1245v2 4c/8t.
> My config for nutch  http://pastebin.com/bBRHpFuq
> So, the problem: fetching jobs works not optimal. Some reduce task has 4k 
> pages for fetching, some 1kk pages. For example see screenshot  
> https://docs.google.com/file/d/0B98dgNxOqKMvT1doOVVPUU1PNXM/edit  Some reduce 
> task finished in 10 minutes, but one task work 11 hours and still continue 
> working, so it's like a bottle neck when i have 24 reduce task, but works 
> only one.
> May be someone can give usefull advices or links where i can read about 
> problem.
> 
> Big thanks for help
> Sergey
> 

Reply via email to