You need to run Nutch as a Map Reduce job/application on Hadoop , there is a lot of info on the Wiki to make it run in distributed mode , but if you can live with the psuedo-distributed /local mode for the 20K pages that you need to fecth , it would save you lot of work.
On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani <[email protected]> wrote: > How can I configure number of map reduce? Which parameter is it? More map > reduce will make it slower or faster? > > Thanks > > -----Original Message----- > From: Meraj A. Khan [mailto:[email protected]] > Sent: Thursday, January 01, 2015 15:17 > To: [email protected] > Subject: Re: Nutch running time > > It seems kind of slower for 20k links, how many map and reduce tasks ,have > you configured for each one of the pahses in a Nutch crawl. > On Jan 1, 2015 6:00 AM, "Chaushu, Shani" <[email protected]> wrote: > > > > > > > Hi all, > > I wanted to know how long nutch should run. > > I change the configurations, and ran distributed - one master node and > > 3 slaves, and it for 20k links for about a day (depth 15). > > Is it normal? Or it should take less? > > This is my configurations: > > > > > > <property> > > <name>db.ignore.external.links</name> > > <value>true</value> > > <description>If true, outlinks leading from a page to > > external hosts > > will be ignored. This is an effective way to > > limit the crawl to include > > only initially injected hosts, without > > creating complex URLFilters. > > </description> > > </property> > > > > <property> > > <name>db.max.outlinks.per.page</name> > > <value>1000</value> > > <description>The maximum number of outlinks that we'll > > process for a page. > > If this value is nonnegative (>=0), at most > > db.max.outlinks.per.page outlinks > > will be processed for a page; otherwise, all > > outlinks will be processed. > > </description> > > </property> > > > > > > <property> > > <name>fetcher.threads.fetch</name> > > <value>100</value> > > <description>The number of FetcherThreads the fetcher > > should use. > > This is also determines the maximum number of > > requests that are > > made at once (each FetcherThread handles one > > connection). The total > > number of threads running in distributed mode > > will be the number of > > fetcher threads * number of nodes as fetcher > > has one map task per node. > > </description> > > </property> > > > > > > <property> > > <name>fetcher.queue.depth.multiplier</name> > > <value>150</value> > > <description>(EXPERT)The fetcher buffers the incoming > > URLs into queues based on the [host|domain|IP] > > see param fetcher.queue.mode). The depth of > > the queue is the number of threads times the value of this parameter. > > A large value requires more memory but can > > improve the performance of the fetch when the order of the URLS in the > fetch list > > is not optimal. > > </description> > > </property> > > > > > > <property> > > <name>fetcher.threads.per.queue</name> > > <value>10</value> > > <description>This number is the maximum number of > > threads that > > should be allowed to access a queue at one time. > > Setting it to > > a value > 1 will cause the Crawl-Delay value > > from robots.txt to > > be ignored and the value of > > fetcher.server.min.delay to be used > > as a delay between successive requests to the > > same server instead > > of fetcher.server.delay. > > </description> > > </property> > > > > <property> > > <name>fetcher.server.min.delay</name> > > <value>0.0</value> > > <description>The minimum number of seconds the fetcher > > will delay between > > successive requests to the same server. This > > value is applicable ONLY > > if fetcher.threads.per.queue is greater than 1 > > (i.e. the host blocking > > is turned off). > > </description> > > </property> > > > > > > <property> > > <name>fetcher.max.crawl.delay</name> > > <value>5</value> > > <description> > > If the Crawl-Delay in robots.txt is set to > > greater than this value (in > > seconds) then the fetcher will skip this page, > > generating an error report. > > If set to -1 the fetcher will never skip such > > pages and will wait the > > amount of time retrieved from robots.txt > > Crawl-Delay, however long that > > might be. > > </description> > > </property> > > > > > > > > > > > > --------------------------------------------------------------------- > > Intel Electronics Ltd. > > > > This e-mail and any attachments may contain confidential material for > > the sole use of the intended recipient(s). Any review or distribution > > by others is strictly prohibited. If you are not the intended > > recipient, please contact the sender and delete all copies. > > > --------------------------------------------------------------------- > Intel Electronics Ltd. > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. >

