Re: Nutch running time

Meraj A. Khan Sat, 03 Jan 2015 12:38:38 -0800

Shani,

What is your Nutch version and which Hadoop version are you using , I
was able to get this running using Nutch 1.7 on Hadoop Yarn, for which
I needed to make minor tweaks in the code.


On Fri, Jan 2, 2015 at 12:37 PM, Chaushu, Shani <[email protected]> wrote:
> I'm running nutch distributed, on 3 nodes...
> I thought there is more configuration that I missed..
>
> -----Original Message-----
> From: S.L [mailto:[email protected]]
> Sent: Thursday, January 01, 2015 18:28
> To: [email protected]
> Subject: Re: Nutch running time
>
> You need to run Nutch as a Map Reduce job/application on Hadoop , there is a 
> lot of info on the Wiki to make it run in distributed mode , but if you can 
> live with the psuedo-distributed /local mode for the 20K pages that you need 
> to fecth , it would save you lot of work.
>
> On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani <[email protected]>
> wrote:
>
>> How can I configure number of map reduce? Which parameter is it? More
>> map reduce will make it slower or faster?
>>
>> Thanks
>>
>> -----Original Message-----
>> From: Meraj A. Khan [mailto:[email protected]]
>> Sent: Thursday, January 01, 2015 15:17
>> To: [email protected]
>> Subject: Re: Nutch running time
>>
>> It seems kind of slower for 20k links, how many map and reduce tasks
>> ,have you configured for each one of the pahses in a Nutch crawl.
>> On Jan 1, 2015 6:00 AM, "Chaushu, Shani" <[email protected]> wrote:
>>
>> >
>> >
>> > Hi all,
>> >  I wanted to know how long nutch should run.
>> > I change the configurations, and ran distributed - one master node
>> > and
>> > 3 slaves, and it for 20k links for about a day (depth 15).
>> > Is it normal? Or it should take less?
>> > This is my configurations:
>> >
>> >
>> >         <property>
>> >                 <name>db.ignore.external.links</name>
>> >                 <value>true</value>
>> >                 <description>If true, outlinks leading from a page
>> > to external hosts
>> >                         will be ignored. This is an effective way to
>> > limit the crawl to include
>> >                         only initially injected hosts, without
>> > creating complex URLFilters.
>> >                 </description>
>> >         </property>
>> >
>> >         <property>
>> >                 <name>db.max.outlinks.per.page</name>
>> >                 <value>1000</value>
>> >                 <description>The maximum number of outlinks that
>> > we'll process for a page.
>> >                         If this value is nonnegative (>=0), at most
>> > db.max.outlinks.per.page outlinks
>> >                         will be processed for a page; otherwise, all
>> > outlinks will be processed.
>> >                 </description>
>> >         </property>
>> >
>> >
>> >         <property>
>> >                 <name>fetcher.threads.fetch</name>
>> >                 <value>100</value>
>> >                 <description>The number of FetcherThreads the
>> > fetcher should use.
>> >                         This is also determines the maximum number
>> > of requests that are
>> >                         made at once (each FetcherThread handles one
>> > connection). The total
>> >                         number of threads running in distributed
>> > mode will be the number of
>> >                         fetcher threads * number of nodes as fetcher
>> > has one map task per node.
>> >                 </description>
>> >         </property>
>> >
>> >
>> >         <property>
>> >                 <name>fetcher.queue.depth.multiplier</name>
>> >                 <value>150</value>
>> >                 <description>(EXPERT)The fetcher buffers the
>> > incoming URLs into queues based on the [host|domain|IP]
>> >                         see param fetcher.queue.mode). The depth of
>> > the queue is the number of threads times the value of this parameter.
>> >                         A large value requires more memory but can
>> > improve the performance of the fetch when the order of the URLS in
>> > the
>> fetch list
>> >                         is not optimal.
>> >                 </description>
>> >         </property>
>> >
>> >
>> >         <property>
>> >                 <name>fetcher.threads.per.queue</name>
>> >                 <value>10</value>
>> >                  <description>This number is the maximum number of
>> > threads that
>> >                         should be allowed to access a queue at one time.
>> > Setting it to
>> >                         a value > 1 will cause the Crawl-Delay value
>> > from robots.txt to
>> >                         be ignored and the value of
>> > fetcher.server.min.delay to be used
>> >                         as a delay between successive requests to
>> > the same server instead
>> >                         of fetcher.server.delay.
>> >                 </description>
>> >         </property>
>> >
>> >         <property>
>> >                 <name>fetcher.server.min.delay</name>
>> >                 <value>0.0</value>
>> >                 <description>The minimum number of seconds the
>> > fetcher will delay between
>> >                         successive requests to the same server. This
>> > value is applicable ONLY
>> >                         if fetcher.threads.per.queue is greater than
>> > 1 (i.e. the host blocking
>> >                         is turned off).
>> >                 </description>
>> >         </property>
>> >
>> >
>> >         <property>
>> >                 <name>fetcher.max.crawl.delay</name>
>> >                 <value>5</value>
>> >                 <description>
>> >                         If the Crawl-Delay in robots.txt is set to
>> > greater than this value (in
>> >                         seconds) then the fetcher will skip this
>> > page, generating an error report.
>> >                         If set to -1 the fetcher will never skip
>> > such pages and will wait the
>> >                         amount of time retrieved from robots.txt
>> > Crawl-Delay, however long that
>> >                         might be.
>> >                 </description>
>> >         </property>
>> >
>> >
>> >
>> >
>> >
>> > --------------------------------------------------------------------
>> > -
>> > Intel Electronics Ltd.
>> >
>> > This e-mail and any attachments may contain confidential material
>> > for the sole use of the intended recipient(s). Any review or
>> > distribution by others is strictly prohibited. If you are not the
>> > intended recipient, please contact the sender and delete all copies.
>> >
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>>
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

Re: Nutch running time

Reply via email to