Am 23.08.2012 20:36, schrieb [email protected]: > Hello, > > I am using nutch-2.0 with hbase-0.92.1. I noticed that, in depth 1, 2,3 > fetcher was fetching around 20K urls per hour. In depth 4 it fetches only 8K > urls per hour. > Any ideas what could cause this decrease in speed. I use local mode with 10 > threads. > > Thanks. > Alex. > > > >
I once recognized the same behaviour with version 1.4. The reason in my case was that by default the 'partition.url.mode' was set to 'byHost', which is a reasonable setting, because in the url-subsets for the fetcher threads in different map steps, you want to have disjoint subsets to avoid that urls are loaded twice from different machines. Secondly the default setting for 'generate.max.count' was set to -1. This means the more urls you collect, especially from the same host, the more urls of the same host will be in the same fetcher map job! Because there is also a policy setting (please do this at home!!) to wait for a delay of 30 secs. between calls to the same server, all maps which contains urls to the same server are slowing down. The resulting reduce step will only be done when all fetcher maps are done, which is a bottleneck in the overall processing step. The following settings solved my problems: Map tasks should be splitted according to the host: <property> <name>partition.url.mode</name> <value>byHost</value> <description>Determines how to partition URLs. Default value is 'byHost', also takes 'byDomain' or 'byIP'. </description> </property> Don't insert in a single fetch list more than 10000 entries! <property> <name>generate.max.count</name> <value>10000</value> <description>The maximum number of urls in a single fetchlist. -1 if unlimited. The urls are counted according to the value of the parameter generator.count.mode. </description> </property> Wait time between two fetches to the same server. <property> <name>fetcher.max.crawl.delay</name> <value>10</value> <description> If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. </description> </property> Cheers, Walter -- -------------------------------- Walter Tietze Senior Softwareengineer Research Neofonie GmbH Robert-Koch-Platz 4 10115 Berlin T +49.30 24627 318 F +49.30 24627 120 [email protected] http://www.neofonie.de Handelsregister Berlin-Charlottenburg: HRB 67460 Geschäftsführung: Thomas Kitlitschko --------------------------------

