Shani, What is your Nutch version and which Hadoop version are you using , I was able to get this running using Nutch 1.7 on Hadoop Yarn, for which I needed to make minor tweaks in the code.
On Fri, Jan 2, 2015 at 12:37 PM, Chaushu, Shani <[email protected]> wrote: > I'm running nutch distributed, on 3 nodes... > I thought there is more configuration that I missed.. > > -----Original Message----- > From: S.L [mailto:[email protected]] > Sent: Thursday, January 01, 2015 18:28 > To: [email protected] > Subject: Re: Nutch running time > > You need to run Nutch as a Map Reduce job/application on Hadoop , there is a > lot of info on the Wiki to make it run in distributed mode , but if you can > live with the psuedo-distributed /local mode for the 20K pages that you need > to fecth , it would save you lot of work. > > On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani <[email protected]> > wrote: > >> How can I configure number of map reduce? Which parameter is it? More >> map reduce will make it slower or faster? >> >> Thanks >> >> -----Original Message----- >> From: Meraj A. Khan [mailto:[email protected]] >> Sent: Thursday, January 01, 2015 15:17 >> To: [email protected] >> Subject: Re: Nutch running time >> >> It seems kind of slower for 20k links, how many map and reduce tasks >> ,have you configured for each one of the pahses in a Nutch crawl. >> On Jan 1, 2015 6:00 AM, "Chaushu, Shani" <[email protected]> wrote: >> >> > >> > >> > Hi all, >> > I wanted to know how long nutch should run. >> > I change the configurations, and ran distributed - one master node >> > and >> > 3 slaves, and it for 20k links for about a day (depth 15). >> > Is it normal? Or it should take less? >> > This is my configurations: >> > >> > >> > <property> >> > <name>db.ignore.external.links</name> >> > <value>true</value> >> > <description>If true, outlinks leading from a page >> > to external hosts >> > will be ignored. This is an effective way to >> > limit the crawl to include >> > only initially injected hosts, without >> > creating complex URLFilters. >> > </description> >> > </property> >> > >> > <property> >> > <name>db.max.outlinks.per.page</name> >> > <value>1000</value> >> > <description>The maximum number of outlinks that >> > we'll process for a page. >> > If this value is nonnegative (>=0), at most >> > db.max.outlinks.per.page outlinks >> > will be processed for a page; otherwise, all >> > outlinks will be processed. >> > </description> >> > </property> >> > >> > >> > <property> >> > <name>fetcher.threads.fetch</name> >> > <value>100</value> >> > <description>The number of FetcherThreads the >> > fetcher should use. >> > This is also determines the maximum number >> > of requests that are >> > made at once (each FetcherThread handles one >> > connection). The total >> > number of threads running in distributed >> > mode will be the number of >> > fetcher threads * number of nodes as fetcher >> > has one map task per node. >> > </description> >> > </property> >> > >> > >> > <property> >> > <name>fetcher.queue.depth.multiplier</name> >> > <value>150</value> >> > <description>(EXPERT)The fetcher buffers the >> > incoming URLs into queues based on the [host|domain|IP] >> > see param fetcher.queue.mode). The depth of >> > the queue is the number of threads times the value of this parameter. >> > A large value requires more memory but can >> > improve the performance of the fetch when the order of the URLS in >> > the >> fetch list >> > is not optimal. >> > </description> >> > </property> >> > >> > >> > <property> >> > <name>fetcher.threads.per.queue</name> >> > <value>10</value> >> > <description>This number is the maximum number of >> > threads that >> > should be allowed to access a queue at one time. >> > Setting it to >> > a value > 1 will cause the Crawl-Delay value >> > from robots.txt to >> > be ignored and the value of >> > fetcher.server.min.delay to be used >> > as a delay between successive requests to >> > the same server instead >> > of fetcher.server.delay. >> > </description> >> > </property> >> > >> > <property> >> > <name>fetcher.server.min.delay</name> >> > <value>0.0</value> >> > <description>The minimum number of seconds the >> > fetcher will delay between >> > successive requests to the same server. This >> > value is applicable ONLY >> > if fetcher.threads.per.queue is greater than >> > 1 (i.e. the host blocking >> > is turned off). >> > </description> >> > </property> >> > >> > >> > <property> >> > <name>fetcher.max.crawl.delay</name> >> > <value>5</value> >> > <description> >> > If the Crawl-Delay in robots.txt is set to >> > greater than this value (in >> > seconds) then the fetcher will skip this >> > page, generating an error report. >> > If set to -1 the fetcher will never skip >> > such pages and will wait the >> > amount of time retrieved from robots.txt >> > Crawl-Delay, however long that >> > might be. >> > </description> >> > </property> >> > >> > >> > >> > >> > >> > -------------------------------------------------------------------- >> > - >> > Intel Electronics Ltd. >> > >> > This e-mail and any attachments may contain confidential material >> > for the sole use of the intended recipient(s). Any review or >> > distribution by others is strictly prohibited. If you are not the >> > intended recipient, please contact the sender and delete all copies. >> > >> --------------------------------------------------------------------- >> Intel Electronics Ltd. >> >> This e-mail and any attachments may contain confidential material for >> the sole use of the intended recipient(s). Any review or distribution >> by others is strictly prohibited. If you are not the intended >> recipient, please contact the sender and delete all copies. >> > --------------------------------------------------------------------- > Intel Electronics Ltd. > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies.

