Re: Nutch running time

Meraj A. Khan Thu, 01 Jan 2015 05:17:53 -0800

It seems kind of slower for 20k links, how many map and reduce tasks ,have
you configured for each one of the pahses in a Nutch crawl.
On Jan 1, 2015 6:00 AM, "Chaushu, Shani" <[email protected]> wrote:


>
>
> Hi all,
>  I wanted to know how long nutch should run.
> I change the configurations, and ran distributed - one master node and 3
> slaves, and it for 20k links for about a day (depth 15).
> Is it normal? Or it should take less?
> This is my configurations:
>
>
>         <property>
>                 <name>db.ignore.external.links</name>
>                 <value>true</value>
>                 <description>If true, outlinks leading from a page to
> external hosts
>                         will be ignored. This is an effective way to limit
> the crawl to include
>                         only initially injected hosts, without creating
> complex URLFilters.
>                 </description>
>         </property>
>
>         <property>
>                 <name>db.max.outlinks.per.page</name>
>                 <value>1000</value>
>                 <description>The maximum number of outlinks that we'll
> process for a page.
>                         If this value is nonnegative (>=0), at most
> db.max.outlinks.per.page outlinks
>                         will be processed for a page; otherwise, all
> outlinks will be processed.
>                 </description>
>         </property>
>
>
>         <property>
>                 <name>fetcher.threads.fetch</name>
>                 <value>100</value>
>                 <description>The number of FetcherThreads the fetcher
> should use.
>                         This is also determines the maximum number of
> requests that are
>                         made at once (each FetcherThread handles one
> connection). The total
>                         number of threads running in distributed mode will
> be the number of
>                         fetcher threads * number of nodes as fetcher has
> one map task per node.
>                 </description>
>         </property>
>
>
>         <property>
>                 <name>fetcher.queue.depth.multiplier</name>
>                 <value>150</value>
>                 <description>(EXPERT)The fetcher buffers the incoming URLs
> into queues based on the [host|domain|IP]
>                         see param fetcher.queue.mode). The depth of the
> queue is the number of threads times the value of this parameter.
>                         A large value requires more memory but can improve
> the performance of the fetch when the order of the URLS in the fetch list
>                         is not optimal.
>                 </description>
>         </property>
>
>
>         <property>
>                 <name>fetcher.threads.per.queue</name>
>                 <value>10</value>
>                  <description>This number is the maximum number of threads
> that
>                         should be allowed to access a queue at one time.
> Setting it to
>                         a value > 1 will cause the Crawl-Delay value from
> robots.txt to
>                         be ignored and the value of
> fetcher.server.min.delay to be used
>                         as a delay between successive requests to the same
> server instead
>                         of fetcher.server.delay.
>                 </description>
>         </property>
>
>         <property>
>                 <name>fetcher.server.min.delay</name>
>                 <value>0.0</value>
>                 <description>The minimum number of seconds the fetcher
> will delay between
>                         successive requests to the same server. This value
> is applicable ONLY
>                         if fetcher.threads.per.queue is greater than 1
> (i.e. the host blocking
>                         is turned off).
>                 </description>
>         </property>
>
>
>         <property>
>                 <name>fetcher.max.crawl.delay</name>
>                 <value>5</value>
>                 <description>
>                         If the Crawl-Delay in robots.txt is set to greater
> than this value (in
>                         seconds) then the fetcher will skip this page,
> generating an error report.
>                         If set to -1 the fetcher will never skip such
> pages and will wait the
>                         amount of time retrieved from robots.txt
> Crawl-Delay, however long that
>                         might be.
>                 </description>
>         </property>
>
>
>
>
>
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

Re: Nutch running time

Reply via email to