Hi, @Alxsss I hope Walters suggestion(s) help you out here.
@Walter I've added your model answer to the wiki [0] this is a great response and I just couldn't help but add it. Thank you Lewis [0] http://wiki.apache.org/nutch/FAQ#Speed_of_Fetching_seems_to_decrease_between_crawl_iterations..._what.27s_wrong.3F On Thu, Aug 23, 2012 at 8:16 PM, Walter Tietze <[email protected]> wrote: > Am 23.08.2012 20:36, schrieb [email protected]: >> Hello, >> >> I am using nutch-2.0 with hbase-0.92.1. I noticed that, in depth 1, 2,3 >> fetcher was fetching around 20K urls per hour. In depth 4 it fetches only 8K >> urls per hour. >> Any ideas what could cause this decrease in speed. I use local mode with 10 >> threads. >> >> Thanks. >> Alex. >> >> >> >> > > > I once recognized the same behaviour with version 1.4. > > > The reason in my case was that by default the 'partition.url.mode' > was set to 'byHost', which is a reasonable setting, because in > the url-subsets for the fetcher threads in different map steps, you > want to have disjoint subsets to avoid that urls are loaded twice from > different machines. > > > > Secondly the default setting for 'generate.max.count' was set to -1. > > This means the more urls you collect, especially from the same host, > the more urls of the same host will be in the same fetcher map job! > > > Because there is also a policy setting (please do this at home!!) to > wait for a delay of 30 secs. between calls to the same server, all maps > which contains urls to the same server are slowing down. > > > The resulting reduce step will only be done when all fetcher maps are > done, which is a bottleneck in the overall processing step. > > > > > The following settings solved my problems: > > > Map tasks should be splitted according to the host: > > <property> > <name>partition.url.mode</name> > <value>byHost</value> > <description>Determines how to partition URLs. Default value is > 'byHost', also takes 'byDomain' or 'byIP'. > </description> > </property> > > > Don't insert in a single fetch list more than 10000 entries! > > <property> > <name>generate.max.count</name> > <value>10000</value> > <description>The maximum number of urls in a single > fetchlist. -1 if unlimited. The urls are counted according > to the value of the parameter generator.count.mode. > </description> > </property> > > > Wait time between two fetches to the same server. > > <property> > <name>fetcher.max.crawl.delay</name> > <value>10</value> > <description> > If the Crawl-Delay in robots.txt is set to greater than this value (in > seconds) then the fetcher will skip this page, generating an error report. > If set to -1 the fetcher will never skip such pages and will wait the > amount of time retrieved from robots.txt Crawl-Delay, however long that > might be. > </description> > </property> > > > > Cheers, Walter > > > > -- > > -------------------------------- > Walter Tietze > Senior Softwareengineer > Research > > Neofonie GmbH > Robert-Koch-Platz 4 > 10115 Berlin > > T +49.30 24627 318 > F +49.30 24627 120 > > [email protected] > http://www.neofonie.de > > Handelsregister > Berlin-Charlottenburg: HRB 67460 > > Geschäftsführung: > Thomas Kitlitschko > -------------------------------- > -- Lewis

