Re: speed of fetcher in nutch-2.0

Walter Tietze Thu, 23 Aug 2012 12:16:49 -0700

Am 23.08.2012 20:36, schrieb [email protected]:
> Hello,
> 
> I am using nutch-2.0 with hbase-0.92.1. I noticed that, in depth 1, 2,3  
> fetcher was fetching around 20K urls per hour. In depth 4 it fetches only 8K 
> urls per hour.
> Any ideas what could cause this decrease in speed.  I use local mode with 10 
> threads.
> 
> Thanks.
> Alex.
> 
> 
>  
>



I once recognized the same behaviour with version 1.4.


The reason in my case was that by default the 'partition.url.mode'
was set to 'byHost', which is a reasonable setting, because in
the url-subsets for the fetcher threads in different map steps, you
want to have disjoint subsets to avoid that urls are loaded twice from
different machines.



Secondly the default setting for 'generate.max.count' was set to -1.

This means the more urls you collect, especially from the same host,
the more urls of the same host will be in the same fetcher map job!


Because there is also a policy setting (please do this at home!!) to
wait for a delay of 30 secs. between calls to the same server, all maps
which contains urls to the same server are slowing down.


The resulting reduce step will only be done when all fetcher maps are
done, which is a bottleneck in the overall processing step.




The following settings solved my problems:


Map tasks should be splitted according to the host:

<property>
  <name>partition.url.mode</name>
  <value>byHost</value>
  <description>Determines how to partition URLs. Default value is
'byHost',  also takes 'byDomain' or 'byIP'.
  </description>
</property>


Don't insert in a single fetch list more than 10000 entries!

<property>
  <name>generate.max.count</name>
  <value>10000</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
</property>


Wait time between two fetches to the same server.

<property>
 <name>fetcher.max.crawl.delay</name>
 <value>10</value>
 <description>
 If the Crawl-Delay in robots.txt is set to greater than this value (in
 seconds) then the fetcher will skip this page, generating an error report.
 If set to -1 the fetcher will never skip such pages and will wait the
 amount of time retrieved from robots.txt Crawl-Delay, however long that
 might be.
 </description>
</property>



Cheers, Walter



-- 

--------------------------------
Walter Tietze
Senior Softwareengineer
Research

Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin

T +49.30 24627 318
F +49.30 24627 120

[email protected]
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschäftsführung:
Thomas Kitlitschko
--------------------------------

Re: speed of fetcher in nutch-2.0

Reply via email to