RE: Nutch running time

Chaushu, Shani Thu, 01 Jan 2015 05:33:30 -0800

How can I configure number of map reduce? Which parameter is it? More map 
reduce will make it slower or faster?


Thanks

-----Original Message-----
From: Meraj A. Khan [mailto:[email protected]] 
Sent: Thursday, January 01, 2015 15:17
To: [email protected]
Subject: Re: Nutch running time

It seems kind of slower for 20k links, how many map and reduce tasks ,have you 
configured for each one of the pahses in a Nutch crawl.
On Jan 1, 2015 6:00 AM, "Chaushu, Shani" <[email protected]> wrote:

>
>
> Hi all,
>  I wanted to know how long nutch should run.
> I change the configurations, and ran distributed - one master node and 
> 3 slaves, and it for 20k links for about a day (depth 15).
> Is it normal? Or it should take less?
> This is my configurations:
>
>
>         <property>
>                 <name>db.ignore.external.links</name>
>                 <value>true</value>
>                 <description>If true, outlinks leading from a page to 
> external hosts
>                         will be ignored. This is an effective way to 
> limit the crawl to include
>                         only initially injected hosts, without 
> creating complex URLFilters.
>                 </description>
>         </property>
>
>         <property>
>                 <name>db.max.outlinks.per.page</name>
>                 <value>1000</value>
>                 <description>The maximum number of outlinks that we'll 
> process for a page.
>                         If this value is nonnegative (>=0), at most 
> db.max.outlinks.per.page outlinks
>                         will be processed for a page; otherwise, all 
> outlinks will be processed.
>                 </description>
>         </property>
>
>
>         <property>
>                 <name>fetcher.threads.fetch</name>
>                 <value>100</value>
>                 <description>The number of FetcherThreads the fetcher 
> should use.
>                         This is also determines the maximum number of 
> requests that are
>                         made at once (each FetcherThread handles one 
> connection). The total
>                         number of threads running in distributed mode 
> will be the number of
>                         fetcher threads * number of nodes as fetcher 
> has one map task per node.
>                 </description>
>         </property>
>
>
>         <property>
>                 <name>fetcher.queue.depth.multiplier</name>
>                 <value>150</value>
>                 <description>(EXPERT)The fetcher buffers the incoming 
> URLs into queues based on the [host|domain|IP]
>                         see param fetcher.queue.mode). The depth of 
> the queue is the number of threads times the value of this parameter.
>                         A large value requires more memory but can 
> improve the performance of the fetch when the order of the URLS in the fetch 
> list
>                         is not optimal.
>                 </description>
>         </property>
>
>
>         <property>
>                 <name>fetcher.threads.per.queue</name>
>                 <value>10</value>
>                  <description>This number is the maximum number of 
> threads that
>                         should be allowed to access a queue at one time.
> Setting it to
>                         a value > 1 will cause the Crawl-Delay value 
> from robots.txt to
>                         be ignored and the value of 
> fetcher.server.min.delay to be used
>                         as a delay between successive requests to the 
> same server instead
>                         of fetcher.server.delay.
>                 </description>
>         </property>
>
>         <property>
>                 <name>fetcher.server.min.delay</name>
>                 <value>0.0</value>
>                 <description>The minimum number of seconds the fetcher 
> will delay between
>                         successive requests to the same server. This 
> value is applicable ONLY
>                         if fetcher.threads.per.queue is greater than 1 
> (i.e. the host blocking
>                         is turned off).
>                 </description>
>         </property>
>
>
>         <property>
>                 <name>fetcher.max.crawl.delay</name>
>                 <value>5</value>
>                 <description>
>                         If the Crawl-Delay in robots.txt is set to 
> greater than this value (in
>                         seconds) then the fetcher will skip this page, 
> generating an error report.
>                         If set to -1 the fetcher will never skip such 
> pages and will wait the
>                         amount of time retrieved from robots.txt 
> Crawl-Delay, however long that
>                         might be.
>                 </description>
>         </property>
>
>
>
>
>
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for 
> the sole use of the intended recipient(s). Any review or distribution 
> by others is strictly prohibited. If you are not the intended 
> recipient, please contact the sender and delete all copies.
>
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

RE: Nutch running time

Reply via email to