Re: Nutch running time
Shani, What is your Nutch version and which Hadoop version are you using , I was able to get this running using Nutch 1.7 on Hadoop Yarn, for which I needed to make minor tweaks in the code. On Fri, Jan 2, 2015 at 12:37 PM, Chaushu, Shani shani.chau...@intel.com wrote: I'm running nutch distributed, on 3 nodes... I thought there is more configuration that I missed.. -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Thursday, January 01, 2015 18:28 To: user@nutch.apache.org Subject: Re: Nutch running time You need to run Nutch as a Map Reduce job/application on Hadoop , there is a lot of info on the Wiki to make it run in distributed mode , but if you can live with the psuedo-distributed /local mode for the 20K pages that you need to fecth , it would save you lot of work. On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani shani.chau...@intel.com wrote: How can I configure number of map reduce? Which parameter is it? More map reduce will make it slower or faster? Thanks -Original Message- From: Meraj A. Khan [mailto:mera...@gmail.com] Sent: Thursday, January 01, 2015 15:17 To: user@nutch.apache.org Subject: Re: Nutch running time It seems kind of slower for 20k links, how many map and reduce tasks ,have you configured for each one of the pahses in a Nutch crawl. On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote: Hi all, I wanted to know how long nutch should run. I change the configurations, and ran distributed - one master node and 3 slaves, and it for 20k links for about a day (depth 15). Is it normal? Or it should take less? This is my configurations: property namedb.ignore.external.links/name valuetrue/value descriptionIf true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. /description /property property namedb.max.outlinks.per.page/name value1000/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefetcher.threads.fetch/name value100/value descriptionThe number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node. /description /property property namefetcher.queue.depth.multiplier/name value150/value description(EXPERT)The fetcher buffers the incoming URLs into queues based on the [host|domain|IP] see param fetcher.queue.mode). The depth of the queue is the number of threads times the value of this parameter. A large value requires more memory but can improve the performance of the fetch when the order of the URLS in the fetch list is not optimal. /description /property property namefetcher.threads.per.queue/name value10/value descriptionThis number is the maximum number of threads that should be allowed to access a queue at one time. Setting it to a value 1 will cause the Crawl-Delay value from robots.txt to be ignored and the value of fetcher.server.min.delay to be used as a delay between successive requests to the same server instead of fetcher.server.delay. /description /property property namefetcher.server.min.delay/name value0.0/value descriptionThe minimum number of seconds the fetcher will delay between successive requests to the same server. This value is applicable ONLY if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking is turned off). /description /property
RE: Nutch running time
Hi, My nutch version is 1.9 Hadoop is on CDH 5.2, I think it's Hadoop 2.3 What changes did you make? Thank, Shani -Original Message- From: Meraj A. Khan [mailto:mera...@gmail.com] Sent: Saturday, January 03, 2015 22:36 To: user@nutch.apache.org Subject: Re: Nutch running time Shani, What is your Nutch version and which Hadoop version are you using , I was able to get this running using Nutch 1.7 on Hadoop Yarn, for which I needed to make minor tweaks in the code. On Fri, Jan 2, 2015 at 12:37 PM, Chaushu, Shani shani.chau...@intel.com wrote: I'm running nutch distributed, on 3 nodes... I thought there is more configuration that I missed.. -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Thursday, January 01, 2015 18:28 To: user@nutch.apache.org Subject: Re: Nutch running time You need to run Nutch as a Map Reduce job/application on Hadoop , there is a lot of info on the Wiki to make it run in distributed mode , but if you can live with the psuedo-distributed /local mode for the 20K pages that you need to fecth , it would save you lot of work. On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani shani.chau...@intel.com wrote: How can I configure number of map reduce? Which parameter is it? More map reduce will make it slower or faster? Thanks -Original Message- From: Meraj A. Khan [mailto:mera...@gmail.com] Sent: Thursday, January 01, 2015 15:17 To: user@nutch.apache.org Subject: Re: Nutch running time It seems kind of slower for 20k links, how many map and reduce tasks ,have you configured for each one of the pahses in a Nutch crawl. On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote: Hi all, I wanted to know how long nutch should run. I change the configurations, and ran distributed - one master node and 3 slaves, and it for 20k links for about a day (depth 15). Is it normal? Or it should take less? This is my configurations: property namedb.ignore.external.links/name valuetrue/value descriptionIf true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. /description /property property namedb.max.outlinks.per.page/name value1000/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefetcher.threads.fetch/name value100/value descriptionThe number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node. /description /property property namefetcher.queue.depth.multiplier/name value150/value description(EXPERT)The fetcher buffers the incoming URLs into queues based on the [host|domain|IP] see param fetcher.queue.mode). The depth of the queue is the number of threads times the value of this parameter. A large value requires more memory but can improve the performance of the fetch when the order of the URLS in the fetch list is not optimal. /description /property property namefetcher.threads.per.queue/name value10/value descriptionThis number is the maximum number of threads that should be allowed to access a queue at one time. Setting it to a value 1 will cause the Crawl-Delay value from robots.txt to be ignored and the value of fetcher.server.min.delay to be used as a delay between successive requests to the same server instead of fetcher.server.delay. /description /property property namefetcher.server.min.delay/name value0.0/value descriptionThe minimum number of seconds the fetcher
RE: Nutch running time
I'm running nutch distributed, on 3 nodes... I thought there is more configuration that I missed.. -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Thursday, January 01, 2015 18:28 To: user@nutch.apache.org Subject: Re: Nutch running time You need to run Nutch as a Map Reduce job/application on Hadoop , there is a lot of info on the Wiki to make it run in distributed mode , but if you can live with the psuedo-distributed /local mode for the 20K pages that you need to fecth , it would save you lot of work. On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani shani.chau...@intel.com wrote: How can I configure number of map reduce? Which parameter is it? More map reduce will make it slower or faster? Thanks -Original Message- From: Meraj A. Khan [mailto:mera...@gmail.com] Sent: Thursday, January 01, 2015 15:17 To: user@nutch.apache.org Subject: Re: Nutch running time It seems kind of slower for 20k links, how many map and reduce tasks ,have you configured for each one of the pahses in a Nutch crawl. On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote: Hi all, I wanted to know how long nutch should run. I change the configurations, and ran distributed - one master node and 3 slaves, and it for 20k links for about a day (depth 15). Is it normal? Or it should take less? This is my configurations: property namedb.ignore.external.links/name valuetrue/value descriptionIf true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. /description /property property namedb.max.outlinks.per.page/name value1000/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefetcher.threads.fetch/name value100/value descriptionThe number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node. /description /property property namefetcher.queue.depth.multiplier/name value150/value description(EXPERT)The fetcher buffers the incoming URLs into queues based on the [host|domain|IP] see param fetcher.queue.mode). The depth of the queue is the number of threads times the value of this parameter. A large value requires more memory but can improve the performance of the fetch when the order of the URLS in the fetch list is not optimal. /description /property property namefetcher.threads.per.queue/name value10/value descriptionThis number is the maximum number of threads that should be allowed to access a queue at one time. Setting it to a value 1 will cause the Crawl-Delay value from robots.txt to be ignored and the value of fetcher.server.min.delay to be used as a delay between successive requests to the same server instead of fetcher.server.delay. /description /property property namefetcher.server.min.delay/name value0.0/value descriptionThe minimum number of seconds the fetcher will delay between successive requests to the same server. This value is applicable ONLY if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking is turned off). /description /property property namefetcher.max.crawl.delay/name value5/value description If the Crawl-Delay in robots.txt is set to greater than this value (in seconds
Re: Nutch running time
You need to run Nutch as a Map Reduce job/application on Hadoop , there is a lot of info on the Wiki to make it run in distributed mode , but if you can live with the psuedo-distributed /local mode for the 20K pages that you need to fecth , it would save you lot of work. On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani shani.chau...@intel.com wrote: How can I configure number of map reduce? Which parameter is it? More map reduce will make it slower or faster? Thanks -Original Message- From: Meraj A. Khan [mailto:mera...@gmail.com] Sent: Thursday, January 01, 2015 15:17 To: user@nutch.apache.org Subject: Re: Nutch running time It seems kind of slower for 20k links, how many map and reduce tasks ,have you configured for each one of the pahses in a Nutch crawl. On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote: Hi all, I wanted to know how long nutch should run. I change the configurations, and ran distributed - one master node and 3 slaves, and it for 20k links for about a day (depth 15). Is it normal? Or it should take less? This is my configurations: property namedb.ignore.external.links/name valuetrue/value descriptionIf true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. /description /property property namedb.max.outlinks.per.page/name value1000/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefetcher.threads.fetch/name value100/value descriptionThe number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node. /description /property property namefetcher.queue.depth.multiplier/name value150/value description(EXPERT)The fetcher buffers the incoming URLs into queues based on the [host|domain|IP] see param fetcher.queue.mode). The depth of the queue is the number of threads times the value of this parameter. A large value requires more memory but can improve the performance of the fetch when the order of the URLS in the fetch list is not optimal. /description /property property namefetcher.threads.per.queue/name value10/value descriptionThis number is the maximum number of threads that should be allowed to access a queue at one time. Setting it to a value 1 will cause the Crawl-Delay value from robots.txt to be ignored and the value of fetcher.server.min.delay to be used as a delay between successive requests to the same server instead of fetcher.server.delay. /description /property property namefetcher.server.min.delay/name value0.0/value descriptionThe minimum number of seconds the fetcher will delay between successive requests to the same server. This value is applicable ONLY if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking is turned off). /description /property property namefetcher.max.crawl.delay/name value5/value description If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might
RE: Nutch running time
How can I configure number of map reduce? Which parameter is it? More map reduce will make it slower or faster? Thanks -Original Message- From: Meraj A. Khan [mailto:mera...@gmail.com] Sent: Thursday, January 01, 2015 15:17 To: user@nutch.apache.org Subject: Re: Nutch running time It seems kind of slower for 20k links, how many map and reduce tasks ,have you configured for each one of the pahses in a Nutch crawl. On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote: Hi all, I wanted to know how long nutch should run. I change the configurations, and ran distributed - one master node and 3 slaves, and it for 20k links for about a day (depth 15). Is it normal? Or it should take less? This is my configurations: property namedb.ignore.external.links/name valuetrue/value descriptionIf true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. /description /property property namedb.max.outlinks.per.page/name value1000/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefetcher.threads.fetch/name value100/value descriptionThe number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node. /description /property property namefetcher.queue.depth.multiplier/name value150/value description(EXPERT)The fetcher buffers the incoming URLs into queues based on the [host|domain|IP] see param fetcher.queue.mode). The depth of the queue is the number of threads times the value of this parameter. A large value requires more memory but can improve the performance of the fetch when the order of the URLS in the fetch list is not optimal. /description /property property namefetcher.threads.per.queue/name value10/value descriptionThis number is the maximum number of threads that should be allowed to access a queue at one time. Setting it to a value 1 will cause the Crawl-Delay value from robots.txt to be ignored and the value of fetcher.server.min.delay to be used as a delay between successive requests to the same server instead of fetcher.server.delay. /description /property property namefetcher.server.min.delay/name value0.0/value descriptionThe minimum number of seconds the fetcher will delay between successive requests to the same server. This value is applicable ONLY if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking is turned off). /description /property property namefetcher.max.crawl.delay/name value5/value description If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. /description /property - Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies
Re: Nutch running time
It seems kind of slower for 20k links, how many map and reduce tasks ,have you configured for each one of the pahses in a Nutch crawl. On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote: Hi all, I wanted to know how long nutch should run. I change the configurations, and ran distributed - one master node and 3 slaves, and it for 20k links for about a day (depth 15). Is it normal? Or it should take less? This is my configurations: property namedb.ignore.external.links/name valuetrue/value descriptionIf true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. /description /property property namedb.max.outlinks.per.page/name value1000/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefetcher.threads.fetch/name value100/value descriptionThe number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node. /description /property property namefetcher.queue.depth.multiplier/name value150/value description(EXPERT)The fetcher buffers the incoming URLs into queues based on the [host|domain|IP] see param fetcher.queue.mode). The depth of the queue is the number of threads times the value of this parameter. A large value requires more memory but can improve the performance of the fetch when the order of the URLS in the fetch list is not optimal. /description /property property namefetcher.threads.per.queue/name value10/value descriptionThis number is the maximum number of threads that should be allowed to access a queue at one time. Setting it to a value 1 will cause the Crawl-Delay value from robots.txt to be ignored and the value of fetcher.server.min.delay to be used as a delay between successive requests to the same server instead of fetcher.server.delay. /description /property property namefetcher.server.min.delay/name value0.0/value descriptionThe minimum number of seconds the fetcher will delay between successive requests to the same server. This value is applicable ONLY if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking is turned off). /description /property property namefetcher.max.crawl.delay/name value5/value description If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. /description /property - Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.