Hi, to ensure politeness by guaranteed intervals between accesses to the same host, all URLs of one single host (or optionally IP address) are placed in one queue which is processed by a single task. The longest queue determines the time required to execute one fetch cycle.
If the URLs crawled spread over many hosts they should get evenly distributed over tasks. If there are only few hosts with huge differences regarding the number of crawled URLs the only way is to set the max. length of any queue via the property "generate.max.count". Cheers, Sebastian On 04/01/2015 09:25 AM, Ai Ai wrote: > > Hello, > I'm trying to optimize nutch performance for crawling sites. Now i test > performance on small hadoop cluster, only two nodes 32gb RAM, cpu Intel Xeon > E3 1245v2 4c/8t. > My config for nutch http://pastebin.com/bBRHpFuq > So, the problem: fetching jobs works not optimal. Some reduce task has 4k > pages for fetching, some 1kk pages. For example see screenshot > https://docs.google.com/file/d/0B98dgNxOqKMvT1doOVVPUU1PNXM/edit Some reduce > task finished in 10 minutes, but one task work 11 hours and still continue > working, so it's like a bottle neck when i have 24 reduce task, but works > only one. > May be someone can give usefull advices or links where i can read about > problem. > > Big thanks for help > Sergey >

