Re: Nutch running time

2015-01-03 Thread Meraj A. Khan
Shani,

What is your Nutch version and which Hadoop version are you using , I
was able to get this running using Nutch 1.7 on Hadoop Yarn, for which
I needed to make minor tweaks in the code.

On Fri, Jan 2, 2015 at 12:37 PM, Chaushu, Shani shani.chau...@intel.com wrote:
 I'm running nutch distributed, on 3 nodes...
 I thought there is more configuration that I missed..

 -Original Message-
 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Thursday, January 01, 2015 18:28
 To: user@nutch.apache.org
 Subject: Re: Nutch running time

 You need to run Nutch as a Map Reduce job/application on Hadoop , there is a 
 lot of info on the Wiki to make it run in distributed mode , but if you can 
 live with the psuedo-distributed /local mode for the 20K pages that you need 
 to fecth , it would save you lot of work.

 On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani shani.chau...@intel.com
 wrote:

 How can I configure number of map reduce? Which parameter is it? More
 map reduce will make it slower or faster?

 Thanks

 -Original Message-
 From: Meraj A. Khan [mailto:mera...@gmail.com]
 Sent: Thursday, January 01, 2015 15:17
 To: user@nutch.apache.org
 Subject: Re: Nutch running time

 It seems kind of slower for 20k links, how many map and reduce tasks
 ,have you configured for each one of the pahses in a Nutch crawl.
 On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote:

 
 
  Hi all,
   I wanted to know how long nutch should run.
  I change the configurations, and ran distributed - one master node
  and
  3 slaves, and it for 20k links for about a day (depth 15).
  Is it normal? Or it should take less?
  This is my configurations:
 
 
  property
  namedb.ignore.external.links/name
  valuetrue/value
  descriptionIf true, outlinks leading from a page
  to external hosts
  will be ignored. This is an effective way to
  limit the crawl to include
  only initially injected hosts, without
  creating complex URLFilters.
  /description
  /property
 
  property
  namedb.max.outlinks.per.page/name
  value1000/value
  descriptionThe maximum number of outlinks that
  we'll process for a page.
  If this value is nonnegative (=0), at most
  db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all
  outlinks will be processed.
  /description
  /property
 
 
  property
  namefetcher.threads.fetch/name
  value100/value
  descriptionThe number of FetcherThreads the
  fetcher should use.
  This is also determines the maximum number
  of requests that are
  made at once (each FetcherThread handles one
  connection). The total
  number of threads running in distributed
  mode will be the number of
  fetcher threads * number of nodes as fetcher
  has one map task per node.
  /description
  /property
 
 
  property
  namefetcher.queue.depth.multiplier/name
  value150/value
  description(EXPERT)The fetcher buffers the
  incoming URLs into queues based on the [host|domain|IP]
  see param fetcher.queue.mode). The depth of
  the queue is the number of threads times the value of this parameter.
  A large value requires more memory but can
  improve the performance of the fetch when the order of the URLS in
  the
 fetch list
  is not optimal.
  /description
  /property
 
 
  property
  namefetcher.threads.per.queue/name
  value10/value
   descriptionThis number is the maximum number of
  threads that
  should be allowed to access a queue at one time.
  Setting it to
  a value  1 will cause the Crawl-Delay value
  from robots.txt to
  be ignored and the value of
  fetcher.server.min.delay to be used
  as a delay between successive requests to
  the same server instead
  of fetcher.server.delay.
  /description
  /property
 
  property
  namefetcher.server.min.delay/name
  value0.0/value
  descriptionThe minimum number of seconds the
  fetcher will delay between
  successive requests to the same server. This
  value is applicable ONLY
  if fetcher.threads.per.queue is greater than
  1 (i.e. the host blocking
  is turned off).
  /description
  /property

RE: Nutch running time

2015-01-03 Thread Chaushu, Shani
Hi,
My nutch version is 1.9
Hadoop is on CDH 5.2, I think it's Hadoop 2.3
What changes did you make?

Thank,
Shani

-Original Message-
From: Meraj A. Khan [mailto:mera...@gmail.com] 
Sent: Saturday, January 03, 2015 22:36
To: user@nutch.apache.org
Subject: Re: Nutch running time

Shani,

What is your Nutch version and which Hadoop version are you using , I was able 
to get this running using Nutch 1.7 on Hadoop Yarn, for which I needed to make 
minor tweaks in the code.

On Fri, Jan 2, 2015 at 12:37 PM, Chaushu, Shani shani.chau...@intel.com wrote:
 I'm running nutch distributed, on 3 nodes...
 I thought there is more configuration that I missed..

 -Original Message-
 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Thursday, January 01, 2015 18:28
 To: user@nutch.apache.org
 Subject: Re: Nutch running time

 You need to run Nutch as a Map Reduce job/application on Hadoop , there is a 
 lot of info on the Wiki to make it run in distributed mode , but if you can 
 live with the psuedo-distributed /local mode for the 20K pages that you need 
 to fecth , it would save you lot of work.

 On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani 
 shani.chau...@intel.com
 wrote:

 How can I configure number of map reduce? Which parameter is it? More 
 map reduce will make it slower or faster?

 Thanks

 -Original Message-
 From: Meraj A. Khan [mailto:mera...@gmail.com]
 Sent: Thursday, January 01, 2015 15:17
 To: user@nutch.apache.org
 Subject: Re: Nutch running time

 It seems kind of slower for 20k links, how many map and reduce tasks 
 ,have you configured for each one of the pahses in a Nutch crawl.
 On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote:

 
 
  Hi all,
   I wanted to know how long nutch should run.
  I change the configurations, and ran distributed - one master node 
  and
  3 slaves, and it for 20k links for about a day (depth 15).
  Is it normal? Or it should take less?
  This is my configurations:
 
 
  property
  namedb.ignore.external.links/name
  valuetrue/value
  descriptionIf true, outlinks leading from a page 
  to external hosts
  will be ignored. This is an effective way 
  to limit the crawl to include
  only initially injected hosts, without 
  creating complex URLFilters.
  /description
  /property
 
  property
  namedb.max.outlinks.per.page/name
  value1000/value
  descriptionThe maximum number of outlinks that 
  we'll process for a page.
  If this value is nonnegative (=0), at most 
  db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, 
  all outlinks will be processed.
  /description
  /property
 
 
  property
  namefetcher.threads.fetch/name
  value100/value
  descriptionThe number of FetcherThreads the 
  fetcher should use.
  This is also determines the maximum number 
  of requests that are
  made at once (each FetcherThread handles 
  one connection). The total
  number of threads running in distributed 
  mode will be the number of
  fetcher threads * number of nodes as 
  fetcher has one map task per node.
  /description
  /property
 
 
  property
  namefetcher.queue.depth.multiplier/name
  value150/value
  description(EXPERT)The fetcher buffers the 
  incoming URLs into queues based on the [host|domain|IP]
  see param fetcher.queue.mode). The depth of 
  the queue is the number of threads times the value of this parameter.
  A large value requires more memory but can 
  improve the performance of the fetch when the order of the URLS in 
  the
 fetch list
  is not optimal.
  /description
  /property
 
 
  property
  namefetcher.threads.per.queue/name
  value10/value
   descriptionThis number is the maximum number of 
  threads that
  should be allowed to access a queue at one time.
  Setting it to
  a value  1 will cause the Crawl-Delay 
  value from robots.txt to
  be ignored and the value of 
  fetcher.server.min.delay to be used
  as a delay between successive requests to 
  the same server instead
  of fetcher.server.delay.
  /description
  /property
 
  property
  namefetcher.server.min.delay/name
  value0.0/value
  descriptionThe minimum number of seconds the 
  fetcher

RE: Nutch running time

2015-01-02 Thread Chaushu, Shani
I'm running nutch distributed, on 3 nodes...
I thought there is more configuration that I missed..

-Original Message-
From: S.L [mailto:simpleliving...@gmail.com] 
Sent: Thursday, January 01, 2015 18:28
To: user@nutch.apache.org
Subject: Re: Nutch running time

You need to run Nutch as a Map Reduce job/application on Hadoop , there is a 
lot of info on the Wiki to make it run in distributed mode , but if you can 
live with the psuedo-distributed /local mode for the 20K pages that you need to 
fecth , it would save you lot of work.

On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani shani.chau...@intel.com
wrote:

 How can I configure number of map reduce? Which parameter is it? More 
 map reduce will make it slower or faster?

 Thanks

 -Original Message-
 From: Meraj A. Khan [mailto:mera...@gmail.com]
 Sent: Thursday, January 01, 2015 15:17
 To: user@nutch.apache.org
 Subject: Re: Nutch running time

 It seems kind of slower for 20k links, how many map and reduce tasks 
 ,have you configured for each one of the pahses in a Nutch crawl.
 On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote:

 
 
  Hi all,
   I wanted to know how long nutch should run.
  I change the configurations, and ran distributed - one master node 
  and
  3 slaves, and it for 20k links for about a day (depth 15).
  Is it normal? Or it should take less?
  This is my configurations:
 
 
  property
  namedb.ignore.external.links/name
  valuetrue/value
  descriptionIf true, outlinks leading from a page 
  to external hosts
  will be ignored. This is an effective way to 
  limit the crawl to include
  only initially injected hosts, without 
  creating complex URLFilters.
  /description
  /property
 
  property
  namedb.max.outlinks.per.page/name
  value1000/value
  descriptionThe maximum number of outlinks that 
  we'll process for a page.
  If this value is nonnegative (=0), at most 
  db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all 
  outlinks will be processed.
  /description
  /property
 
 
  property
  namefetcher.threads.fetch/name
  value100/value
  descriptionThe number of FetcherThreads the 
  fetcher should use.
  This is also determines the maximum number 
  of requests that are
  made at once (each FetcherThread handles one 
  connection). The total
  number of threads running in distributed 
  mode will be the number of
  fetcher threads * number of nodes as fetcher 
  has one map task per node.
  /description
  /property
 
 
  property
  namefetcher.queue.depth.multiplier/name
  value150/value
  description(EXPERT)The fetcher buffers the 
  incoming URLs into queues based on the [host|domain|IP]
  see param fetcher.queue.mode). The depth of 
  the queue is the number of threads times the value of this parameter.
  A large value requires more memory but can 
  improve the performance of the fetch when the order of the URLS in 
  the
 fetch list
  is not optimal.
  /description
  /property
 
 
  property
  namefetcher.threads.per.queue/name
  value10/value
   descriptionThis number is the maximum number of 
  threads that
  should be allowed to access a queue at one time.
  Setting it to
  a value  1 will cause the Crawl-Delay value 
  from robots.txt to
  be ignored and the value of 
  fetcher.server.min.delay to be used
  as a delay between successive requests to 
  the same server instead
  of fetcher.server.delay.
  /description
  /property
 
  property
  namefetcher.server.min.delay/name
  value0.0/value
  descriptionThe minimum number of seconds the 
  fetcher will delay between
  successive requests to the same server. This 
  value is applicable ONLY
  if fetcher.threads.per.queue is greater than 
  1 (i.e. the host blocking
  is turned off).
  /description
  /property
 
 
  property
  namefetcher.max.crawl.delay/name
  value5/value
  description
  If the Crawl-Delay in robots.txt is set to 
  greater than this value (in
  seconds

Re: Nutch running time

2015-01-01 Thread S.L
You need to run Nutch as a Map Reduce job/application on Hadoop , there is
a lot of info on the Wiki to make it run in distributed mode , but if you
can live with the psuedo-distributed /local mode for the 20K pages that you
need to fecth , it would save you lot of work.

On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani shani.chau...@intel.com
wrote:

 How can I configure number of map reduce? Which parameter is it? More map
 reduce will make it slower or faster?

 Thanks

 -Original Message-
 From: Meraj A. Khan [mailto:mera...@gmail.com]
 Sent: Thursday, January 01, 2015 15:17
 To: user@nutch.apache.org
 Subject: Re: Nutch running time

 It seems kind of slower for 20k links, how many map and reduce tasks ,have
 you configured for each one of the pahses in a Nutch crawl.
 On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote:

 
 
  Hi all,
   I wanted to know how long nutch should run.
  I change the configurations, and ran distributed - one master node and
  3 slaves, and it for 20k links for about a day (depth 15).
  Is it normal? Or it should take less?
  This is my configurations:
 
 
  property
  namedb.ignore.external.links/name
  valuetrue/value
  descriptionIf true, outlinks leading from a page to
  external hosts
  will be ignored. This is an effective way to
  limit the crawl to include
  only initially injected hosts, without
  creating complex URLFilters.
  /description
  /property
 
  property
  namedb.max.outlinks.per.page/name
  value1000/value
  descriptionThe maximum number of outlinks that we'll
  process for a page.
  If this value is nonnegative (=0), at most
  db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all
  outlinks will be processed.
  /description
  /property
 
 
  property
  namefetcher.threads.fetch/name
  value100/value
  descriptionThe number of FetcherThreads the fetcher
  should use.
  This is also determines the maximum number of
  requests that are
  made at once (each FetcherThread handles one
  connection). The total
  number of threads running in distributed mode
  will be the number of
  fetcher threads * number of nodes as fetcher
  has one map task per node.
  /description
  /property
 
 
  property
  namefetcher.queue.depth.multiplier/name
  value150/value
  description(EXPERT)The fetcher buffers the incoming
  URLs into queues based on the [host|domain|IP]
  see param fetcher.queue.mode). The depth of
  the queue is the number of threads times the value of this parameter.
  A large value requires more memory but can
  improve the performance of the fetch when the order of the URLS in the
 fetch list
  is not optimal.
  /description
  /property
 
 
  property
  namefetcher.threads.per.queue/name
  value10/value
   descriptionThis number is the maximum number of
  threads that
  should be allowed to access a queue at one time.
  Setting it to
  a value  1 will cause the Crawl-Delay value
  from robots.txt to
  be ignored and the value of
  fetcher.server.min.delay to be used
  as a delay between successive requests to the
  same server instead
  of fetcher.server.delay.
  /description
  /property
 
  property
  namefetcher.server.min.delay/name
  value0.0/value
  descriptionThe minimum number of seconds the fetcher
  will delay between
  successive requests to the same server. This
  value is applicable ONLY
  if fetcher.threads.per.queue is greater than 1
  (i.e. the host blocking
  is turned off).
  /description
  /property
 
 
  property
  namefetcher.max.crawl.delay/name
  value5/value
  description
  If the Crawl-Delay in robots.txt is set to
  greater than this value (in
  seconds) then the fetcher will skip this page,
  generating an error report.
  If set to -1 the fetcher will never skip such
  pages and will wait the
  amount of time retrieved from robots.txt
  Crawl-Delay, however long that
  might

RE: Nutch running time

2015-01-01 Thread Chaushu, Shani
How can I configure number of map reduce? Which parameter is it? More map 
reduce will make it slower or faster?

Thanks

-Original Message-
From: Meraj A. Khan [mailto:mera...@gmail.com] 
Sent: Thursday, January 01, 2015 15:17
To: user@nutch.apache.org
Subject: Re: Nutch running time

It seems kind of slower for 20k links, how many map and reduce tasks ,have you 
configured for each one of the pahses in a Nutch crawl.
On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote:



 Hi all,
  I wanted to know how long nutch should run.
 I change the configurations, and ran distributed - one master node and 
 3 slaves, and it for 20k links for about a day (depth 15).
 Is it normal? Or it should take less?
 This is my configurations:


 property
 namedb.ignore.external.links/name
 valuetrue/value
 descriptionIf true, outlinks leading from a page to 
 external hosts
 will be ignored. This is an effective way to 
 limit the crawl to include
 only initially injected hosts, without 
 creating complex URLFilters.
 /description
 /property

 property
 namedb.max.outlinks.per.page/name
 value1000/value
 descriptionThe maximum number of outlinks that we'll 
 process for a page.
 If this value is nonnegative (=0), at most 
 db.max.outlinks.per.page outlinks
 will be processed for a page; otherwise, all 
 outlinks will be processed.
 /description
 /property


 property
 namefetcher.threads.fetch/name
 value100/value
 descriptionThe number of FetcherThreads the fetcher 
 should use.
 This is also determines the maximum number of 
 requests that are
 made at once (each FetcherThread handles one 
 connection). The total
 number of threads running in distributed mode 
 will be the number of
 fetcher threads * number of nodes as fetcher 
 has one map task per node.
 /description
 /property


 property
 namefetcher.queue.depth.multiplier/name
 value150/value
 description(EXPERT)The fetcher buffers the incoming 
 URLs into queues based on the [host|domain|IP]
 see param fetcher.queue.mode). The depth of 
 the queue is the number of threads times the value of this parameter.
 A large value requires more memory but can 
 improve the performance of the fetch when the order of the URLS in the fetch 
 list
 is not optimal.
 /description
 /property


 property
 namefetcher.threads.per.queue/name
 value10/value
  descriptionThis number is the maximum number of 
 threads that
 should be allowed to access a queue at one time.
 Setting it to
 a value  1 will cause the Crawl-Delay value 
 from robots.txt to
 be ignored and the value of 
 fetcher.server.min.delay to be used
 as a delay between successive requests to the 
 same server instead
 of fetcher.server.delay.
 /description
 /property

 property
 namefetcher.server.min.delay/name
 value0.0/value
 descriptionThe minimum number of seconds the fetcher 
 will delay between
 successive requests to the same server. This 
 value is applicable ONLY
 if fetcher.threads.per.queue is greater than 1 
 (i.e. the host blocking
 is turned off).
 /description
 /property


 property
 namefetcher.max.crawl.delay/name
 value5/value
 description
 If the Crawl-Delay in robots.txt is set to 
 greater than this value (in
 seconds) then the fetcher will skip this page, 
 generating an error report.
 If set to -1 the fetcher will never skip such 
 pages and will wait the
 amount of time retrieved from robots.txt 
 Crawl-Delay, however long that
 might be.
 /description
 /property





 -
 Intel Electronics Ltd.

 This e-mail and any attachments may contain confidential material for 
 the sole use of the intended recipient(s). Any review or distribution 
 by others is strictly prohibited. If you are not the intended 
 recipient, please contact the sender and delete all copies

Re: Nutch running time

2015-01-01 Thread Meraj A. Khan
It seems kind of slower for 20k links, how many map and reduce tasks ,have
you configured for each one of the pahses in a Nutch crawl.
On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote:



 Hi all,
  I wanted to know how long nutch should run.
 I change the configurations, and ran distributed - one master node and 3
 slaves, and it for 20k links for about a day (depth 15).
 Is it normal? Or it should take less?
 This is my configurations:


 property
 namedb.ignore.external.links/name
 valuetrue/value
 descriptionIf true, outlinks leading from a page to
 external hosts
 will be ignored. This is an effective way to limit
 the crawl to include
 only initially injected hosts, without creating
 complex URLFilters.
 /description
 /property

 property
 namedb.max.outlinks.per.page/name
 value1000/value
 descriptionThe maximum number of outlinks that we'll
 process for a page.
 If this value is nonnegative (=0), at most
 db.max.outlinks.per.page outlinks
 will be processed for a page; otherwise, all
 outlinks will be processed.
 /description
 /property


 property
 namefetcher.threads.fetch/name
 value100/value
 descriptionThe number of FetcherThreads the fetcher
 should use.
 This is also determines the maximum number of
 requests that are
 made at once (each FetcherThread handles one
 connection). The total
 number of threads running in distributed mode will
 be the number of
 fetcher threads * number of nodes as fetcher has
 one map task per node.
 /description
 /property


 property
 namefetcher.queue.depth.multiplier/name
 value150/value
 description(EXPERT)The fetcher buffers the incoming URLs
 into queues based on the [host|domain|IP]
 see param fetcher.queue.mode). The depth of the
 queue is the number of threads times the value of this parameter.
 A large value requires more memory but can improve
 the performance of the fetch when the order of the URLS in the fetch list
 is not optimal.
 /description
 /property


 property
 namefetcher.threads.per.queue/name
 value10/value
  descriptionThis number is the maximum number of threads
 that
 should be allowed to access a queue at one time.
 Setting it to
 a value  1 will cause the Crawl-Delay value from
 robots.txt to
 be ignored and the value of
 fetcher.server.min.delay to be used
 as a delay between successive requests to the same
 server instead
 of fetcher.server.delay.
 /description
 /property

 property
 namefetcher.server.min.delay/name
 value0.0/value
 descriptionThe minimum number of seconds the fetcher
 will delay between
 successive requests to the same server. This value
 is applicable ONLY
 if fetcher.threads.per.queue is greater than 1
 (i.e. the host blocking
 is turned off).
 /description
 /property


 property
 namefetcher.max.crawl.delay/name
 value5/value
 description
 If the Crawl-Delay in robots.txt is set to greater
 than this value (in
 seconds) then the fetcher will skip this page,
 generating an error report.
 If set to -1 the fetcher will never skip such
 pages and will wait the
 amount of time retrieved from robots.txt
 Crawl-Delay, however long that
 might be.
 /description
 /property





 -
 Intel Electronics Ltd.

 This e-mail and any attachments may contain confidential material for
 the sole use of the intended recipient(s). Any review or distribution
 by others is strictly prohibited. If you are not the intended
 recipient, please contact the sender and delete all copies.