It seems kind of slower for 20k links, how many map and reduce tasks ,have you configured for each one of the pahses in a Nutch crawl. On Jan 1, 2015 6:00 AM, "Chaushu, Shani" <[email protected]> wrote:
> > > Hi all, > I wanted to know how long nutch should run. > I change the configurations, and ran distributed - one master node and 3 > slaves, and it for 20k links for about a day (depth 15). > Is it normal? Or it should take less? > This is my configurations: > > > <property> > <name>db.ignore.external.links</name> > <value>true</value> > <description>If true, outlinks leading from a page to > external hosts > will be ignored. This is an effective way to limit > the crawl to include > only initially injected hosts, without creating > complex URLFilters. > </description> > </property> > > <property> > <name>db.max.outlinks.per.page</name> > <value>1000</value> > <description>The maximum number of outlinks that we'll > process for a page. > If this value is nonnegative (>=0), at most > db.max.outlinks.per.page outlinks > will be processed for a page; otherwise, all > outlinks will be processed. > </description> > </property> > > > <property> > <name>fetcher.threads.fetch</name> > <value>100</value> > <description>The number of FetcherThreads the fetcher > should use. > This is also determines the maximum number of > requests that are > made at once (each FetcherThread handles one > connection). The total > number of threads running in distributed mode will > be the number of > fetcher threads * number of nodes as fetcher has > one map task per node. > </description> > </property> > > > <property> > <name>fetcher.queue.depth.multiplier</name> > <value>150</value> > <description>(EXPERT)The fetcher buffers the incoming URLs > into queues based on the [host|domain|IP] > see param fetcher.queue.mode). The depth of the > queue is the number of threads times the value of this parameter. > A large value requires more memory but can improve > the performance of the fetch when the order of the URLS in the fetch list > is not optimal. > </description> > </property> > > > <property> > <name>fetcher.threads.per.queue</name> > <value>10</value> > <description>This number is the maximum number of threads > that > should be allowed to access a queue at one time. > Setting it to > a value > 1 will cause the Crawl-Delay value from > robots.txt to > be ignored and the value of > fetcher.server.min.delay to be used > as a delay between successive requests to the same > server instead > of fetcher.server.delay. > </description> > </property> > > <property> > <name>fetcher.server.min.delay</name> > <value>0.0</value> > <description>The minimum number of seconds the fetcher > will delay between > successive requests to the same server. This value > is applicable ONLY > if fetcher.threads.per.queue is greater than 1 > (i.e. the host blocking > is turned off). > </description> > </property> > > > <property> > <name>fetcher.max.crawl.delay</name> > <value>5</value> > <description> > If the Crawl-Delay in robots.txt is set to greater > than this value (in > seconds) then the fetcher will skip this page, > generating an error report. > If set to -1 the fetcher will never skip such > pages and will wait the > amount of time retrieved from robots.txt > Crawl-Delay, however long that > might be. > </description> > </property> > > > > > > --------------------------------------------------------------------- > Intel Electronics Ltd. > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. >

