Re: speed of fetcher in nutch-2.0

Lewis John Mcgibbney Thu, 23 Aug 2012 12:26:30 -0700

Hi,

@Alxsss I hope Walters suggestion(s) help you out here.


@Walter I've added your model answer to the wiki [0] this is a great
response and I just couldn't help but add it. Thank you

Lewis

[0] 
http://wiki.apache.org/nutch/FAQ#Speed_of_Fetching_seems_to_decrease_between_crawl_iterations..._what.27s_wrong.3F

On Thu, Aug 23, 2012 at 8:16 PM, Walter Tietze <[email protected]> wrote:
> Am 23.08.2012 20:36, schrieb [email protected]:
>> Hello,
>>
>> I am using nutch-2.0 with hbase-0.92.1. I noticed that, in depth 1, 2,3  
>> fetcher was fetching around 20K urls per hour. In depth 4 it fetches only 8K 
>> urls per hour.
>> Any ideas what could cause this decrease in speed.  I use local mode with 10 
>> threads.
>>
>> Thanks.
>> Alex.
>>
>>
>>
>>
>
>
> I once recognized the same behaviour with version 1.4.
>
>
> The reason in my case was that by default the 'partition.url.mode'
> was set to 'byHost', which is a reasonable setting, because in
> the url-subsets for the fetcher threads in different map steps, you
> want to have disjoint subsets to avoid that urls are loaded twice from
> different machines.
>
>
>
> Secondly the default setting for 'generate.max.count' was set to -1.
>
> This means the more urls you collect, especially from the same host,
> the more urls of the same host will be in the same fetcher map job!
>
>
> Because there is also a policy setting (please do this at home!!) to
> wait for a delay of 30 secs. between calls to the same server, all maps
> which contains urls to the same server are slowing down.
>
>
> The resulting reduce step will only be done when all fetcher maps are
> done, which is a bottleneck in the overall processing step.
>
>
>
>
> The following settings solved my problems:
>
>
> Map tasks should be splitted according to the host:
>
> <property>
>   <name>partition.url.mode</name>
>   <value>byHost</value>
>   <description>Determines how to partition URLs. Default value is
> 'byHost',  also takes 'byDomain' or 'byIP'.
>   </description>
> </property>
>
>
> Don't insert in a single fetch list more than 10000 entries!
>
> <property>
>   <name>generate.max.count</name>
>   <value>10000</value>
>   <description>The maximum number of urls in a single
>   fetchlist.  -1 if unlimited. The urls are counted according
>   to the value of the parameter generator.count.mode.
>   </description>
> </property>
>
>
> Wait time between two fetches to the same server.
>
> <property>
>  <name>fetcher.max.crawl.delay</name>
>  <value>10</value>
>  <description>
>  If the Crawl-Delay in robots.txt is set to greater than this value (in
>  seconds) then the fetcher will skip this page, generating an error report.
>  If set to -1 the fetcher will never skip such pages and will wait the
>  amount of time retrieved from robots.txt Crawl-Delay, however long that
>  might be.
>  </description>
> </property>
>
>
>
> Cheers, Walter
>
>
>
> --
>
> --------------------------------
> Walter Tietze
> Senior Softwareengineer
> Research
>
> Neofonie GmbH
> Robert-Koch-Platz 4
> 10115 Berlin
>
> T +49.30 24627 318
> F +49.30 24627 120
>
> [email protected]
> http://www.neofonie.de
>
> Handelsregister
> Berlin-Charlottenburg: HRB 67460
>
> Geschäftsführung:
> Thomas Kitlitschko
> --------------------------------
>



-- 
Lewis

Re: speed of fetcher in nutch-2.0

Reply via email to