Re: Nutch running time

S.L Thu, 01 Jan 2015 08:29:42 -0800

You need to run Nutch as a Map Reduce job/application on Hadoop , there is
a lot of info on the Wiki to make it run in distributed mode , but if you
can live with the psuedo-distributed /local mode for the 20K pages that you
need to fecth , it would save you lot of work.


On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani <[email protected]>
wrote:

> How can I configure number of map reduce? Which parameter is it? More map
> reduce will make it slower or faster?
>
> Thanks
>
> -----Original Message-----
> From: Meraj A. Khan [mailto:[email protected]]
> Sent: Thursday, January 01, 2015 15:17
> To: [email protected]
> Subject: Re: Nutch running time
>
> It seems kind of slower for 20k links, how many map and reduce tasks ,have
> you configured for each one of the pahses in a Nutch crawl.
> On Jan 1, 2015 6:00 AM, "Chaushu, Shani" <[email protected]> wrote:
>
> >
> >
> > Hi all,
> >  I wanted to know how long nutch should run.
> > I change the configurations, and ran distributed - one master node and
> > 3 slaves, and it for 20k links for about a day (depth 15).
> > Is it normal? Or it should take less?
> > This is my configurations:
> >
> >
> >         <property>
> >                 <name>db.ignore.external.links</name>
> >                 <value>true</value>
> >                 <description>If true, outlinks leading from a page to
> > external hosts
> >                         will be ignored. This is an effective way to
> > limit the crawl to include
> >                         only initially injected hosts, without
> > creating complex URLFilters.
> >                 </description>
> >         </property>
> >
> >         <property>
> >                 <name>db.max.outlinks.per.page</name>
> >                 <value>1000</value>
> >                 <description>The maximum number of outlinks that we'll
> > process for a page.
> >                         If this value is nonnegative (>=0), at most
> > db.max.outlinks.per.page outlinks
> >                         will be processed for a page; otherwise, all
> > outlinks will be processed.
> >                 </description>
> >         </property>
> >
> >
> >         <property>
> >                 <name>fetcher.threads.fetch</name>
> >                 <value>100</value>
> >                 <description>The number of FetcherThreads the fetcher
> > should use.
> >                         This is also determines the maximum number of
> > requests that are
> >                         made at once (each FetcherThread handles one
> > connection). The total
> >                         number of threads running in distributed mode
> > will be the number of
> >                         fetcher threads * number of nodes as fetcher
> > has one map task per node.
> >                 </description>
> >         </property>
> >
> >
> >         <property>
> >                 <name>fetcher.queue.depth.multiplier</name>
> >                 <value>150</value>
> >                 <description>(EXPERT)The fetcher buffers the incoming
> > URLs into queues based on the [host|domain|IP]
> >                         see param fetcher.queue.mode). The depth of
> > the queue is the number of threads times the value of this parameter.
> >                         A large value requires more memory but can
> > improve the performance of the fetch when the order of the URLS in the
> fetch list
> >                         is not optimal.
> >                 </description>
> >         </property>
> >
> >
> >         <property>
> >                 <name>fetcher.threads.per.queue</name>
> >                 <value>10</value>
> >                  <description>This number is the maximum number of
> > threads that
> >                         should be allowed to access a queue at one time.
> > Setting it to
> >                         a value > 1 will cause the Crawl-Delay value
> > from robots.txt to
> >                         be ignored and the value of
> > fetcher.server.min.delay to be used
> >                         as a delay between successive requests to the
> > same server instead
> >                         of fetcher.server.delay.
> >                 </description>
> >         </property>
> >
> >         <property>
> >                 <name>fetcher.server.min.delay</name>
> >                 <value>0.0</value>
> >                 <description>The minimum number of seconds the fetcher
> > will delay between
> >                         successive requests to the same server. This
> > value is applicable ONLY
> >                         if fetcher.threads.per.queue is greater than 1
> > (i.e. the host blocking
> >                         is turned off).
> >                 </description>
> >         </property>
> >
> >
> >         <property>
> >                 <name>fetcher.max.crawl.delay</name>
> >                 <value>5</value>
> >                 <description>
> >                         If the Crawl-Delay in robots.txt is set to
> > greater than this value (in
> >                         seconds) then the fetcher will skip this page,
> > generating an error report.
> >                         If set to -1 the fetcher will never skip such
> > pages and will wait the
> >                         amount of time retrieved from robots.txt
> > Crawl-Delay, however long that
> >                         might be.
> >                 </description>
> >         </property>
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > Intel Electronics Ltd.
> >
> > This e-mail and any attachments may contain confidential material for
> > the sole use of the intended recipient(s). Any review or distribution
> > by others is strictly prohibited. If you are not the intended
> > recipient, please contact the sender and delete all copies.
> >
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

Re: Nutch running time

Reply via email to