Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?

2015-03-02 Thread Meraj A. Khan
Jorge ,

I think I spoke too soon , if I use the protocol-httpclient plugin , I
am unable to fetch  any page using the parsechecker.

I get a [Fatal Error] :1:1: Content is not allowed in prolog. error.

Are there any known issues with using protocol-httpclient , I am using
Nutch 1.7 I have the following settings in my nutch-site.xml

!-- Added based on the suggestion from nutch mailing list --
property
nameplugin.includes/name

valueprotocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|regex|basic)/value
/property


property
namehttp.useHttp11/name
valuetrue/value
descriptionNOTE: at the moment this works only for
protocol-httpclient.
If true, use HTTP 1.1, if false use HTTP 1.0 .
/description
/property


Thanks.

On Sun, Mar 1, 2015 at 10:05 PM, Jorge Luis Betancourt González
jlbetanco...@uci.cu wrote:
 The general answer is: it dependes, usually is polite to present your robot 
 to the website so the webmaster knows what is accessing the site, this is why 
 google and a lot of other search engines (big and small) use a distinctive 
 name for their crawlers/bots. That being said, the first site that you 
 mention works fine for a quick parsechecker that I've executed:

 ➜  local  bin/nutch parsechecker 
 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 fetching: 
 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 parsing: 
 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 contentType: text/html
 signature: 8e90c6d581f27c36828d433f746e4d7a
 -
 Url
 ---

 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 -
 ParseData
 -

 Version: 5
 Status: success(1,0)
 Title: Dressing for the Dark
 Outlinks: 151
   outlink: toUrl: 
 http://www.neimanmarcus.com/cssbundle/1468949595/bundles/product_rwd.css 
 anchor:
   outlink: toUrl: 
 http://www.neimanmarcus.com/category/templates/css/r_rBrand.css anchor:
   outlink: toUrl: 
 http://www.neimanmarcus.com/category/templates/css/r_rProduct.css anchor:
   outlink: toUrl: 
 http://www.neimanmarcus.com/jsbundle/2144966094/bundles/general_rwd.js anchor:
 ...

 (trimmed due length)

 As for the second one I wasn't able to do a test, the provided blocks access 
 from my IP/country:

 This request is blocked by the SonicWALL Gateway Geo IP Service.
 Country Name:Cuba.

 Reading your experience with this website, looks like an error in the website 
 programming, basically I'm assuming they are saying if your User Agent is not 
 X,Y or Z then serve the mobile version, this could worth reporting.

 Trying to fool the website giving the impression that your bot is a regular 
 user by tweaking the user agent could work for now, but could draw in 
 webmaster's attention and could be a cause for blocking your access, this 
 depends a lot on the webmaster :). But for your particular case could be your 
 only solution if the webmaster doesn't have a problem with the increase in 
 traffic.

 Regards,

 - Original Message -
 From: Meraj A. Khan mera...@gmail.com
 To: user@nutch.apache.org
 Sent: Saturday, February 28, 2015 12:09:47 AM
 Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a 
 browser?

 Hi Jorge,

 Yes, I was exploring changing the http.agent.name property value in
 case where the sites either serve the mobile version or outright deny
 the request if no agent is specified.

 For example the following URL will give Request Rejected response if
 the User-Agent is not specified.

 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod

 And the following URL will server a mobile version.

 http://www.techforless.com/cgi-bin/tech4less/60PN5000.

 So is it a good practice to set the  http.agent.name  to something
 like the below , to mimic a Chrome browser?

 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
 Chrome/41.0.2228.0 Safari/537.36

 On Fri, Feb 27, 2015 at 3:21 PM, Jorge Luis Betancourt González
 jlbetanco...@uci.cu wrote:
 Hi Meraj,

 Can you provide an example URL? explain exactly what you're after? if the 
 page you're trying to fetch has a lot of javascript/ajax keep in mind that 
 the browsers do a lot of stuff with the downloaded page, for instance when 
 you enter a page, the HTML is downloaded, the referenced CSS files are also 
 fetched and applied to the HTML (also inline styles, etc.), if any 
 javascript is referenced is also downloaded and executed on top of the 
 loaded DOM (also inline script tags). The same applies to fonts, etc. The 
 browsers knows how to deal with all this resources, also the CSS is 
 applied depending on which browser you're using. The Nutch crawler only 
 knows

Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?

2015-03-02 Thread Meraj A. Khan
Thanks Jorge, I appreciate your help.

On Sun, Mar 1, 2015 at 10:05 PM, Jorge Luis Betancourt González
jlbetanco...@uci.cu wrote:
 The general answer is: it dependes, usually is polite to present your robot 
 to the website so the webmaster knows what is accessing the site, this is why 
 google and a lot of other search engines (big and small) use a distinctive 
 name for their crawlers/bots. That being said, the first site that you 
 mention works fine for a quick parsechecker that I've executed:

 ➜  local  bin/nutch parsechecker 
 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 fetching: 
 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 parsing: 
 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 contentType: text/html
 signature: 8e90c6d581f27c36828d433f746e4d7a
 -
 Url
 ---

 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 -
 ParseData
 -

 Version: 5
 Status: success(1,0)
 Title: Dressing for the Dark
 Outlinks: 151
   outlink: toUrl: 
 http://www.neimanmarcus.com/cssbundle/1468949595/bundles/product_rwd.css 
 anchor:
   outlink: toUrl: 
 http://www.neimanmarcus.com/category/templates/css/r_rBrand.css anchor:
   outlink: toUrl: 
 http://www.neimanmarcus.com/category/templates/css/r_rProduct.css anchor:
   outlink: toUrl: 
 http://www.neimanmarcus.com/jsbundle/2144966094/bundles/general_rwd.js anchor:
 ...

 (trimmed due length)

 As for the second one I wasn't able to do a test, the provided blocks access 
 from my IP/country:

 This request is blocked by the SonicWALL Gateway Geo IP Service.
 Country Name:Cuba.

 Reading your experience with this website, looks like an error in the website 
 programming, basically I'm assuming they are saying if your User Agent is not 
 X,Y or Z then serve the mobile version, this could worth reporting.

 Trying to fool the website giving the impression that your bot is a regular 
 user by tweaking the user agent could work for now, but could draw in 
 webmaster's attention and could be a cause for blocking your access, this 
 depends a lot on the webmaster :). But for your particular case could be your 
 only solution if the webmaster doesn't have a problem with the increase in 
 traffic.

 Regards,

 - Original Message -
 From: Meraj A. Khan mera...@gmail.com
 To: user@nutch.apache.org
 Sent: Saturday, February 28, 2015 12:09:47 AM
 Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a 
 browser?

 Hi Jorge,

 Yes, I was exploring changing the http.agent.name property value in
 case where the sites either serve the mobile version or outright deny
 the request if no agent is specified.

 For example the following URL will give Request Rejected response if
 the User-Agent is not specified.

 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod

 And the following URL will server a mobile version.

 http://www.techforless.com/cgi-bin/tech4less/60PN5000.

 So is it a good practice to set the  http.agent.name  to something
 like the below , to mimic a Chrome browser?

 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
 Chrome/41.0.2228.0 Safari/537.36

 On Fri, Feb 27, 2015 at 3:21 PM, Jorge Luis Betancourt González
 jlbetanco...@uci.cu wrote:
 Hi Meraj,

 Can you provide an example URL? explain exactly what you're after? if the 
 page you're trying to fetch has a lot of javascript/ajax keep in mind that 
 the browsers do a lot of stuff with the downloaded page, for instance when 
 you enter a page, the HTML is downloaded, the referenced CSS files are also 
 fetched and applied to the HTML (also inline styles, etc.), if any 
 javascript is referenced is also downloaded and executed on top of the 
 loaded DOM (also inline script tags). The same applies to fonts, etc. The 
 browsers knows how to deal with all this resources, also the CSS is 
 applied depending on which browser you're using. The Nutch crawler only 
 knows about the downloaded HTML (similar to what you see when you view the 
 source code of an HTML webpage) it doesn't know what a CSS style is, 
 basically the crawler only is interested in: the links and the 
 textual/binary content of the webpage, so when a page es fetched by Nutch, 
 the HTML is downloaded but the other resources (fonts, styles, javascript) 
 are not applied to the fetched page.

 Tweaking the http.agent.name property in the nutch-site.xml only will help 
 with those sites that change what their response based on the user agent 
 (one for mobile and other different for desktop browsers). This approach is 
 being replaced by the responsive design, meaning that the user agent is not 
 important for how the page is rendered.

 In the current trunk of the upcoming 1.10 version a plugin has been merged 
 that could

Re: Can anyone fetch this page?

2015-02-27 Thread Meraj A. Khan
Can you please  set the user agent to something that resembles a
browser like Chrome for example and test? I just posted a query
yesterday for a similar issue where the mobile version of the site
gets served up instead of 500.

On Fri, Feb 27, 2015 at 1:08 PM, Iain Lopata ilopa...@hotmail.com wrote:
 I get a 500.  Have tried removing Nutch from my user-agent string and still 
 get the same result.

 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Friday, February 27, 2015 12:05 PM
 To: user@nutch.apache.org
 Subject: RE: Can anyone fetch this page?

 Seems fine to me
 http://oldservice.openindex.io/extract.php?url=http%3A%2F%2Fwww.nature.com%2Fnature%2Fjournal%2Fv518%2Fn7540%2Ffull%2Fnature14236.html


 -Original message-
 From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Sent: Friday 27th February 2015 18:56
 To: user@nutch.apache.org
 Subject: Can anyone fetch this page?

 Hi Folks,
 I was getting 500 internal server error using Nutch trunk when
 attempting to fetch content from this domain.
 http://www.nature.com
 Just for detail, Nature.com is a catalogue of journals and science
 resources, including the journal *Nature*. Publishes science news and
 articles across a wide range of scientific fields. So it is nothing
 malicious or sensitive/offending content-wise.
 Can anyone else fetch this URL?
 I can get it with curl and wget but not Nutch.
 Thanks
 Lewis


 --
 *Lewis*




How to make Nutch 1.7 request mimic a browser?

2015-02-26 Thread Meraj A. Khan
In some instances the content that is downloaded in Fetch phase from a
HTTP URL is not what you would get if you were to access the request
from a well known browser like Google Chrome for example, that is
because the server is expecting a user agent value that represents a
browser.

There is a http.agent.name property in nutch-site.xml, is it the same
property that should be used to set the user agent to make the server
respond to a Nutch get request the same way as it would for a request
from a browser ? Or is there an another configurable property ?

For example the user agent value for a Chrome browser is below.

Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/41.0.2228.0 Safari/537.36


Thanks.


NUTCH-762 Generate Multiple Segments

2015-02-18 Thread Meraj A. Khan
Hi Folks,

I am facing the exact same problem that is described in JIRA NUTCH-762
, i.e the generate -updates takes excessive amount of time and the
actual fetch only takes very less time compared to the generate time.

The Jira issue commits a patch to allow generating of multiple
segments in a single generate phase, however I was not able to do so .

How can I generate multiple segments in a single generate phase ? I am
using Nutch 1.7 , any help would be greatly appreciated. I am using
Nutc 1.7 on YARN 2.3.0

Thanks.


Re: Depth option

2015-01-04 Thread Meraj A. Khan
Shadi,

I am not sure what will be the case if example.com itself has external
links,I think it will fetch those with depth 1,but  if you want to disbale
the fetching of external links , just set the external.links property to
false,you dont need any url filter set up if you do so.
On Jan 4, 2015 10:37 AM, Shadi Saleh propat...@gmail.com wrote:

 Thanks Adil,

 crawldb is not empty, now it contains old and current folder, should I
 clean it before I start new crawl? what is the proper way?

 Best

 On Sun, Jan 4, 2015 at 4:28 PM, Adil Ishaque Abbasi aiabb...@gmail.com
 wrote:

  Yes, you are correct. no need to use the url filter. But this will work
  only if your crawldb remains empty.
 
  Regards
  Adil I. Abbasi
 
  On Sun, Jan 4, 2015 at 8:22 PM, Shadi Saleh propat...@gmail.com wrote:
 
   Hello,
  
   I want to check this point please.
  
   I am using crawl to crawl www.example.com with depth =1 option, So if
  that
   website contains url to other website e.g. www.example2.com nutch will
  not
   crawl it , is it enogh to use depth option or should I use url filer?
  
  
   Best
  
  
   --
  
  
  
  
   *Shadi SalehPh.D StudentInstitute of Formal and Applied
  LinguisticsFaculty
   of Mathematics and Physics*
   *-Charles University in Prague*
  
   *16017 Prague 6 - Czech Republic Mob +420773515578*
  
 



 --




 *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty
 of Mathematics and Physics*
 *-Charles University in Prague*

 *16017 Prague 6 - Czech Republic Mob +420773515578*



Re: Nutch running time

2015-01-03 Thread Meraj A. Khan
Shani,

What is your Nutch version and which Hadoop version are you using , I
was able to get this running using Nutch 1.7 on Hadoop Yarn, for which
I needed to make minor tweaks in the code.

On Fri, Jan 2, 2015 at 12:37 PM, Chaushu, Shani shani.chau...@intel.com wrote:
 I'm running nutch distributed, on 3 nodes...
 I thought there is more configuration that I missed..

 -Original Message-
 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Thursday, January 01, 2015 18:28
 To: user@nutch.apache.org
 Subject: Re: Nutch running time

 You need to run Nutch as a Map Reduce job/application on Hadoop , there is a 
 lot of info on the Wiki to make it run in distributed mode , but if you can 
 live with the psuedo-distributed /local mode for the 20K pages that you need 
 to fecth , it would save you lot of work.

 On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani shani.chau...@intel.com
 wrote:

 How can I configure number of map reduce? Which parameter is it? More
 map reduce will make it slower or faster?

 Thanks

 -Original Message-
 From: Meraj A. Khan [mailto:mera...@gmail.com]
 Sent: Thursday, January 01, 2015 15:17
 To: user@nutch.apache.org
 Subject: Re: Nutch running time

 It seems kind of slower for 20k links, how many map and reduce tasks
 ,have you configured for each one of the pahses in a Nutch crawl.
 On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote:

 
 
  Hi all,
   I wanted to know how long nutch should run.
  I change the configurations, and ran distributed - one master node
  and
  3 slaves, and it for 20k links for about a day (depth 15).
  Is it normal? Or it should take less?
  This is my configurations:
 
 
  property
  namedb.ignore.external.links/name
  valuetrue/value
  descriptionIf true, outlinks leading from a page
  to external hosts
  will be ignored. This is an effective way to
  limit the crawl to include
  only initially injected hosts, without
  creating complex URLFilters.
  /description
  /property
 
  property
  namedb.max.outlinks.per.page/name
  value1000/value
  descriptionThe maximum number of outlinks that
  we'll process for a page.
  If this value is nonnegative (=0), at most
  db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all
  outlinks will be processed.
  /description
  /property
 
 
  property
  namefetcher.threads.fetch/name
  value100/value
  descriptionThe number of FetcherThreads the
  fetcher should use.
  This is also determines the maximum number
  of requests that are
  made at once (each FetcherThread handles one
  connection). The total
  number of threads running in distributed
  mode will be the number of
  fetcher threads * number of nodes as fetcher
  has one map task per node.
  /description
  /property
 
 
  property
  namefetcher.queue.depth.multiplier/name
  value150/value
  description(EXPERT)The fetcher buffers the
  incoming URLs into queues based on the [host|domain|IP]
  see param fetcher.queue.mode). The depth of
  the queue is the number of threads times the value of this parameter.
  A large value requires more memory but can
  improve the performance of the fetch when the order of the URLS in
  the
 fetch list
  is not optimal.
  /description
  /property
 
 
  property
  namefetcher.threads.per.queue/name
  value10/value
   descriptionThis number is the maximum number of
  threads that
  should be allowed to access a queue at one time.
  Setting it to
  a value  1 will cause the Crawl-Delay value
  from robots.txt to
  be ignored and the value of
  fetcher.server.min.delay to be used
  as a delay between successive requests to
  the same server instead
  of fetcher.server.delay.
  /description
  /property
 
  property
  namefetcher.server.min.delay/name
  value0.0/value
  descriptionThe minimum number of seconds the
  fetcher will delay between
  successive requests to the same server. This
  value is applicable ONLY
  if fetcher.threads.per.queue is greater than
  1 (i.e. the host blocking
  is turned off).
  /description
  /property

Re: Question about db.default.fetch.interval.

2015-01-03 Thread Meraj A. Khan
Reposting my question.

Hi All,

I have a quick question regarding the db.default.fetch.interval
parameter , I have currently set it to 15 days , however my crawl
cycle itself  is going beyond 15 days and upto 30 days , now I was not
sure since I have set the db.default.fetch.interval to be only 15 days
, is there a possibility that even before a complete crawl is
completed , an already fetched page will get re-fetched before an
un-fetched page is fetched and there by fetching less number of
distinct pages.

I guess, I am trying to know if setting the db.default.fetch.interval
to a value less than the time it takes to do one complete crawl of the
web will  lead to some kind of infinite loop where the recently
fetched pages will be re-fetched before the completely un-fetched ones
because the value of the interval is less than the total crawl time ?


Thanks.

Thanks.

On Sun, Dec 28, 2014 at 11:18 AM, Meraj A. Khan mera...@gmail.com wrote:
 Hi All,

 I have a quick question regarding the db.default.fetch.interval
 parameter , I have currently set it to 15 days , however my crawl
 cycle itself  is going beyond 15 days and upto 30 days , now I was not
 sure since I have set the db.default.fetch.interval to be only 15 days
 , is there a possibility that even before a complete crawl is
 completed , an already fetched page will get re-fetched before an
 un-fetched page is fetched and there by fetching less number of
 distinct pages.

 I guess, I am trying to know if db.default.fetch.interval be set to
 at-least be greater than one comprehensive crawl cycle time .

 Thanks.


Re: Nutch running time

2015-01-01 Thread Meraj A. Khan
It seems kind of slower for 20k links, how many map and reduce tasks ,have
you configured for each one of the pahses in a Nutch crawl.
On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote:



 Hi all,
  I wanted to know how long nutch should run.
 I change the configurations, and ran distributed - one master node and 3
 slaves, and it for 20k links for about a day (depth 15).
 Is it normal? Or it should take less?
 This is my configurations:


 property
 namedb.ignore.external.links/name
 valuetrue/value
 descriptionIf true, outlinks leading from a page to
 external hosts
 will be ignored. This is an effective way to limit
 the crawl to include
 only initially injected hosts, without creating
 complex URLFilters.
 /description
 /property

 property
 namedb.max.outlinks.per.page/name
 value1000/value
 descriptionThe maximum number of outlinks that we'll
 process for a page.
 If this value is nonnegative (=0), at most
 db.max.outlinks.per.page outlinks
 will be processed for a page; otherwise, all
 outlinks will be processed.
 /description
 /property


 property
 namefetcher.threads.fetch/name
 value100/value
 descriptionThe number of FetcherThreads the fetcher
 should use.
 This is also determines the maximum number of
 requests that are
 made at once (each FetcherThread handles one
 connection). The total
 number of threads running in distributed mode will
 be the number of
 fetcher threads * number of nodes as fetcher has
 one map task per node.
 /description
 /property


 property
 namefetcher.queue.depth.multiplier/name
 value150/value
 description(EXPERT)The fetcher buffers the incoming URLs
 into queues based on the [host|domain|IP]
 see param fetcher.queue.mode). The depth of the
 queue is the number of threads times the value of this parameter.
 A large value requires more memory but can improve
 the performance of the fetch when the order of the URLS in the fetch list
 is not optimal.
 /description
 /property


 property
 namefetcher.threads.per.queue/name
 value10/value
  descriptionThis number is the maximum number of threads
 that
 should be allowed to access a queue at one time.
 Setting it to
 a value  1 will cause the Crawl-Delay value from
 robots.txt to
 be ignored and the value of
 fetcher.server.min.delay to be used
 as a delay between successive requests to the same
 server instead
 of fetcher.server.delay.
 /description
 /property

 property
 namefetcher.server.min.delay/name
 value0.0/value
 descriptionThe minimum number of seconds the fetcher
 will delay between
 successive requests to the same server. This value
 is applicable ONLY
 if fetcher.threads.per.queue is greater than 1
 (i.e. the host blocking
 is turned off).
 /description
 /property


 property
 namefetcher.max.crawl.delay/name
 value5/value
 description
 If the Crawl-Delay in robots.txt is set to greater
 than this value (in
 seconds) then the fetcher will skip this page,
 generating an error report.
 If set to -1 the fetcher will never skip such
 pages and will wait the
 amount of time retrieved from robots.txt
 Crawl-Delay, however long that
 might be.
 /description
 /property





 -
 Intel Electronics Ltd.

 This e-mail and any attachments may contain confidential material for
 the sole use of the intended recipient(s). Any review or distribution
 by others is strictly prohibited. If you are not the intended
 recipient, please contact the sender and delete all copies.



Re: nutch on amazon emr

2015-01-01 Thread Meraj A. Khan
I suggest running it using the stock bin/crawl script from the  command
line  first and then try using the jar that you mentioned.
On Jan 1, 2015 12:04 PM, Adil Ishaque Abbasi aiabb...@gmail.com wrote:

 I tried to run it through custom jar step using script runner jar i.e.
 s3://elasticmapreduce/libs/script-runner/script-runner.jar

 Regards
 Adil I. Abbasi

 On Thu, Jan 1, 2015 at 8:51 PM, Meraj A. Khan mera...@gmail.com wrote:

  Can you give us the command that you use to start the crawl?
  On Jan 1, 2015 10:28 AM, Adil Ishaque Abbasi aiabb...@gmail.com
 wrote:
 
   When I try to nutch crawl script on amazon emr, it gives me this error
  
   /mnt/var/lib/hadoop/steps/s-3VT1QRVSURPSH/./crawl: line 81:
   hdfs:///nutch/bin/nutch: No such file or directory
   Command exiting with ret '0'
  
  
   Though nutch script is located at hdfs:///nutch/bin/,still it gives
 this
   erorr.
  
   Any idea what is it that I'm doing wrong ?
  
  
  
  
   Regards
   Adil
  
 



Re: nutch on amazon emr

2015-01-01 Thread Meraj A. Khan
Can you give us the command that you use to start the crawl?
On Jan 1, 2015 10:28 AM, Adil Ishaque Abbasi aiabb...@gmail.com wrote:

 When I try to nutch crawl script on amazon emr, it gives me this error

 /mnt/var/lib/hadoop/steps/s-3VT1QRVSURPSH/./crawl: line 81:
 hdfs:///nutch/bin/nutch: No such file or directory
 Command exiting with ret '0'


 Though nutch script is located at hdfs:///nutch/bin/,still it gives this
 erorr.

 Any idea what is it that I'm doing wrong ?




 Regards
 Adil



Question about db.default.fetch.interval.

2014-12-28 Thread Meraj A. Khan
Hi All,

I have a quick question regarding the db.default.fetch.interval
parameter , I have currently set it to 15 days , however my crawl
cycle itself  is going beyond 15 days and upto 30 days , now I was not
sure since I have set the db.default.fetch.interval to be only 15 days
, is there a possibility that even before a complete crawl is
completed , an already fetched page will get re-fetched before an
un-fetched page is fetched and there by fetching less number of
distinct pages.

I guess, I am trying to know if db.default.fetch.interval be set to
at-least be greater than one comprehensive crawl cycle time .

Thanks.


Re: Nutch configuration - V1 vs V2 differences

2014-11-12 Thread Meraj A. Khan
I installed it by copying the files to conf directory, never tried without
that step to confirm if the copying is really needed.
On Nov 12, 2014 6:24 AM, mikejf12 i...@semtech-solutions.co.nz wrote:


 Hi

 I installed two version of Nutch on to a Centos 6 Linux Hadoop V1.2.1
 cluster. I didnt have any issues in using them but I noticed a difference
 ..

 I installed the src version  of apache-nutch-1.8-src, the instructions that
 I followed advised that the hadoop configuration files be copied to the
 nutch conf directory.

 I also installed the non source release apache-nutch-2.2.1, which didnt
 require this.

 Its been a while since I did this and I wondered whether the step to copy
 the hadoop config files was necessary for the src release ?

 cheers



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Nutch-configuration-V1-vs-V2-differences-tp4168893.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: When to delete the segments?

2014-11-03 Thread Meraj A. Khan
I am only indexing the parsed data in Solr , so there is no way for me
to know when to delete a segment in an automated fashion by
considering the parsed data alone, however I just relaized that there
is a _SUCCESS file being created with in the segment once it is
fetched. I will use that as an indicator to automate the deletion of
the segment folders.



On Mon, Nov 3, 2014 at 12:56 AM, remi tassing tassingr...@gmail.com wrote:
 If you are able to determine what is done with the parsed data, then you
 could delete the segment as soon as that job is completed.

 As I mentioned earlier, if the data is to be pushed to Solr (e.g. with
 bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawldb $SEGMENT),
 then after indexing is done you can get rid of the segment

 On Mon, Nov 3, 2014 at 12:16 PM, Meraj A. Khan mera...@gmail.com wrote:

 Thanks .

 How do I definitively determine , if a segment has been completely
 parsed , if I were to set up a hourly crontab to delete the segments
 from HDFS? I have seen that the presence of the crawl_parse directory
 in the segments directory at least indicates that the parsing has
 started , but I think the directory would be created as  soon as the
 parsing begins.

 So as to not delete the segments prematurely , while it is still being
 fetched , what should I be looking for in my script ?

 On Sun, Nov 2, 2014 at 7:58 PM, remi tassing tassingr...@gmail.com
 wrote:
  The next fetching time is computed after updatedb is isssued with that
  segment
 
  So as long as you don't need the parsed data anymore then you can delete
  the segment (e.g. after indexing through Solr...).
 
 
 
  On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan mera...@gmail.com wrote:
 
  Hi All,
 
  I am deleting the segments as soon as they are fetched and parsed , I
  have read in previous posts that it is safe to delete the segments
  only if it is older than the db.default.fetch.interval , my
  understanding is that one does have to wait for the segment to be
  older than db.default.fetch.interval, but can delete it as soon as the
  segment is parsed.
 
  Is my understanding correct ? I want to delete the segment as soon as
  possible so as to save as much disk space as possible.
 
  Thanks.
 



Re: Reduce phase in Fetcher taking excessive time to finish.

2014-11-02 Thread Meraj A. Khan
Julien,


 Do we need to
 consider any data loss(URLs) in this scenario ?


 no, why?

Thank you for confirming.







 J.








 On Thu, Oct 30, 2014 at 6:22 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

  Hi Meraj
 
  You can control the # of URLs per segment with
 
  property
namegenerate.max.count/name
value-1/value
descriptionThe maximum number of urls in a single
fetchlist.  -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
/description
  /property
 
  property
namegenerate.count.mode/name
valuehost/value
descriptionDetermines how the URLs are counted for
 generator.max.count.
Default value is 'host' but can be 'domain'. Note that we do not count
per IP in the new version of the Generator.
/description
  /property
 
  the urls are grouped into inputs for the map tasks accordingly.
 
  Julien
 
 
 
 
 
  On 26 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote:
 
   Julien,
  
   On further analysis , I found that it was not a delay at reduce time ,
  but
   a long running fetch map task , when I have multiple fetch map tasks
   running on a single segment , I see  that one of the map tasks runs
 for a
   excessively longer period of time than the other fetch map tasks ,it
  seems
   this is happening because of the disproportionate distribution of urls
  per
   map task, meaning if I have topN of 10,00,000 and 10 fetch map tasks ,
 it
   seems its not guaranteed that each fetch map tasks will have 100,000
 urls
   to fetch.
  
   Is is possible to set the an upper limit on the max number of URLs per
   fetch map task, along with the collective topN for the whole Fetch
 phase
  ?
  
   Thanks,
   Meraj.
  
   On Sat, Oct 18, 2014 at 2:28 AM, Julien Nioche 
   lists.digitalpeb...@gmail.com wrote:
  
Hi Meraj,
   
What do the logs for the map tasks tell you about the URLs being
  fetched?
   
J.
   
On 17 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote:
   
 Julien,

 Thanks for your suggestion , I looked at the jstack thread dumps ,
  and
   I
 could see that the fetcher threads are in a waiting state and
  actually
the
 map phase is not yet complete looking at the JobClient console.

 14/10/15 12:09:48 INFO mapreduce.Job:  map 95% reduce 31%
 14/10/16 07:11:20 INFO mapreduce.Job:  map 96% reduce 31%
 14/10/17 01:20:56 INFO mapreduce.Job:  map 97% reduce 31%

 And the following is the kind of statements I see in the jstack
  thread
 dump  for Hadoop child processes, is it possible that these map
 tasks
   are
 actually waiting on a particular host with some excessive
 crawl-delay
   , I
 already had the fetcher.threads.per.queue to 5 ,
 fetcher.server.delay
   to
0,
 fetcher.max.crawl.delay to 10  and http.max.delays to 1000 .

 Please see the jstack  log info for the child processes below.

 Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed
mode):

 Attach Listener daemon prio=10 tid=0x7fecf8c58000 nid=0x32e5
waiting
 on condition [0x]
java.lang.Thread.State: RUNNABLE

 IPC Client (638223659) connection to /170.75.153.162:40980 from
 job_1413149941617_0059 daemon prio=10 tid=0x01a5c000
  nid=0xce8
in
 Object.wait() [0x7fecdf80e000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0x99f8bf48 (a
 org.apache.hadoop.ipc.Client$Connection)
 at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:899)
 - locked 0x99f8bf48 (a
 org.apache.hadoop.ipc.Client$Connection)
 at org.apache.hadoop.ipc.Client$Connection.run(Client.java:944)

 fetcher#5 daemon prio=10 tid=0x7fecf8c49000 nid=0xce7 in
 Object.wait() [0x7fecdf90f000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0x99f62a68 (a
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
 at java.lang.Object.wait(Object.java:503)
 at


   
  
 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368)
 - locked 0x99f62a68 (a
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
 at

 org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161)

 fetcher#4 daemon prio=10 tid=0x7fecf8c47000 nid=0xce6 in
 Object.wait() [0x7fecdfa1]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0x99f62a68 (a
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
 at java.lang.Object.wait(Object.java:503

When to delete the segments?

2014-11-02 Thread Meraj A. Khan
Hi All,

I am deleting the segments as soon as they are fetched and parsed , I
have read in previous posts that it is safe to delete the segments
only if it is older than the db.default.fetch.interval , my
understanding is that one does have to wait for the segment to be
older than db.default.fetch.interval, but can delete it as soon as the
segment is parsed.

Is my understanding correct ? I want to delete the segment as soon as
possible so as to save as much disk space as possible.

Thanks.


Re: When to delete the segments?

2014-11-02 Thread Meraj A. Khan
Thanks .

How do I definitively determine , if a segment has been completely
parsed , if I were to set up a hourly crontab to delete the segments
from HDFS? I have seen that the presence of the crawl_parse directory
in the segments directory at least indicates that the parsing has
started , but I think the directory would be created as  soon as the
parsing begins.

So as to not delete the segments prematurely , while it is still being
fetched , what should I be looking for in my script ?

On Sun, Nov 2, 2014 at 7:58 PM, remi tassing tassingr...@gmail.com wrote:
 The next fetching time is computed after updatedb is isssued with that
 segment

 So as long as you don't need the parsed data anymore then you can delete
 the segment (e.g. after indexing through Solr...).



 On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan mera...@gmail.com wrote:

 Hi All,

 I am deleting the segments as soon as they are fetched and parsed , I
 have read in previous posts that it is safe to delete the segments
 only if it is older than the db.default.fetch.interval , my
 understanding is that one does have to wait for the segment to be
 older than db.default.fetch.interval, but can delete it as soon as the
 segment is parsed.

 Is my understanding correct ? I want to delete the segment as soon as
 possible so as to save as much disk space as possible.

 Thanks.



bin/Crawl script loosing status updates from the MR job.

2014-10-30 Thread Meraj A. Khan
Hi All,

I am running the bin/crawl script that comes with Nutch 1.7 on Hadoop YARN
by redirecting its output to a log file as shown below.

/opt/bitconfig/nutch/deploy/bin/crawl /urls crawldirectory 2000 
/tmp/nutch.log 21 

The issue I am facing is that randomly this script when it is running a job
looses track of the updates like Map 80% Reduce 67% and gets stuck there ,
and in the mean time the job completes successfully and the script is
waiting there for further updates , as a result the looping of
generate-fetch -update jobs gets terminated prematurely.

This is so random that I am not able to figure out a particular pattern to
this issue, and end up  restarting the script every so often.Some times
this happens in a job as short in duration as the inject phase of Nutch.

Just wondering if anyone faced this issue ?Is the fact that I am
redirecting the output to a logfile playing a part in this ? What are the
best practices for running a long running script like bin/crawl ? I am
using CentOs7.x

Thanks.


Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-30 Thread Meraj A. Khan
Thanks for the info Julien.For the hypothetical example below

 topN 200,000
generate.max.count = 10,000
generate.count.mode = host

If the number of hosts is 10 and let us assume that each one of those hosts
has more than 10,000 unfetched URLs in CrawlDB , since we have set
generate.max.count to 10,000 exactly 100,000 URLs would be fetched.

Would the remaining URLs be fetched in the next phase cycle? Do we need to
consider any data loss(URLs) in this scenario ?





On Thu, Oct 30, 2014 at 6:22 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Hi Meraj

 You can control the # of URLs per segment with

 property
   namegenerate.max.count/name
   value-1/value
   descriptionThe maximum number of urls in a single
   fetchlist.  -1 if unlimited. The urls are counted according
   to the value of the parameter generator.count.mode.
   /description
 /property

 property
   namegenerate.count.mode/name
   valuehost/value
   descriptionDetermines how the URLs are counted for generator.max.count.
   Default value is 'host' but can be 'domain'. Note that we do not count
   per IP in the new version of the Generator.
   /description
 /property

 the urls are grouped into inputs for the map tasks accordingly.

 Julien





 On 26 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote:

  Julien,
 
  On further analysis , I found that it was not a delay at reduce time ,
 but
  a long running fetch map task , when I have multiple fetch map tasks
  running on a single segment , I see  that one of the map tasks runs for a
  excessively longer period of time than the other fetch map tasks ,it
 seems
  this is happening because of the disproportionate distribution of urls
 per
  map task, meaning if I have topN of 10,00,000 and 10 fetch map tasks , it
  seems its not guaranteed that each fetch map tasks will have 100,000 urls
  to fetch.
 
  Is is possible to set the an upper limit on the max number of URLs per
  fetch map task, along with the collective topN for the whole Fetch phase
 ?
 
  Thanks,
  Meraj.
 
  On Sat, Oct 18, 2014 at 2:28 AM, Julien Nioche 
  lists.digitalpeb...@gmail.com wrote:
 
   Hi Meraj,
  
   What do the logs for the map tasks tell you about the URLs being
 fetched?
  
   J.
  
   On 17 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote:
  
Julien,
   
Thanks for your suggestion , I looked at the jstack thread dumps ,
 and
  I
could see that the fetcher threads are in a waiting state and
 actually
   the
map phase is not yet complete looking at the JobClient console.
   
14/10/15 12:09:48 INFO mapreduce.Job:  map 95% reduce 31%
14/10/16 07:11:20 INFO mapreduce.Job:  map 96% reduce 31%
14/10/17 01:20:56 INFO mapreduce.Job:  map 97% reduce 31%
   
And the following is the kind of statements I see in the jstack
 thread
dump  for Hadoop child processes, is it possible that these map tasks
  are
actually waiting on a particular host with some excessive crawl-delay
  , I
already had the fetcher.threads.per.queue to 5 , fetcher.server.delay
  to
   0,
fetcher.max.crawl.delay to 10  and http.max.delays to 1000 .
   
Please see the jstack  log info for the child processes below.
   
Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed
   mode):
   
Attach Listener daemon prio=10 tid=0x7fecf8c58000 nid=0x32e5
   waiting
on condition [0x]
   java.lang.Thread.State: RUNNABLE
   
IPC Client (638223659) connection to /170.75.153.162:40980 from
job_1413149941617_0059 daemon prio=10 tid=0x01a5c000
 nid=0xce8
   in
Object.wait() [0x7fecdf80e000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x99f8bf48 (a
org.apache.hadoop.ipc.Client$Connection)
at
   org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:899)
- locked 0x99f8bf48 (a
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:944)
   
fetcher#5 daemon prio=10 tid=0x7fecf8c49000 nid=0xce7 in
Object.wait() [0x7fecdf90f000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x99f62a68 (a
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
at java.lang.Object.wait(Object.java:503)
at
   
   
  
 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368)
- locked 0x99f62a68 (a
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
at
org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161)
   
fetcher#4 daemon prio=10 tid=0x7fecf8c47000 nid=0xce6 in
Object.wait() [0x7fecdfa1]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method

Re: Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.

2014-10-26 Thread Meraj A. Khan
Fred,

In my last email on this topic , I mentioned that I am using a single
segment and multiple fetch map tasks, and also the changes  that I had to
make to Nutch 1.7 to make it possible on YARN.

Let me know if you cannot find it and I ll resend those again.

Meraj.

On Fri, Oct 24, 2014 at 4:17 PM, Fred frederic.luddeni+nab...@gmail.com
wrote:

 Hi Mak,

 Please can you give me details on your changes please? I have the same
 issue.

 Thanks you in advance,
 Regards,



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Generate-multiple-segments-in-Generate-phase-and-have-multiple-Fetch-map-tasks-in-parallel-tp4161005p4165766.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-26 Thread Meraj A. Khan
Julien,

On further analysis , I found that it was not a delay at reduce time , but
a long running fetch map task , when I have multiple fetch map tasks
running on a single segment , I see  that one of the map tasks runs for a
excessively longer period of time than the other fetch map tasks ,it seems
this is happening because of the disproportionate distribution of urls per
map task, meaning if I have topN of 10,00,000 and 10 fetch map tasks , it
seems its not guaranteed that each fetch map tasks will have 100,000 urls
to fetch.

Is is possible to set the an upper limit on the max number of URLs per
fetch map task, along with the collective topN for the whole Fetch phase ?

Thanks,
Meraj.

On Sat, Oct 18, 2014 at 2:28 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Hi Meraj,

 What do the logs for the map tasks tell you about the URLs being fetched?

 J.

 On 17 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote:

  Julien,
 
  Thanks for your suggestion , I looked at the jstack thread dumps , and I
  could see that the fetcher threads are in a waiting state and actually
 the
  map phase is not yet complete looking at the JobClient console.
 
  14/10/15 12:09:48 INFO mapreduce.Job:  map 95% reduce 31%
  14/10/16 07:11:20 INFO mapreduce.Job:  map 96% reduce 31%
  14/10/17 01:20:56 INFO mapreduce.Job:  map 97% reduce 31%
 
  And the following is the kind of statements I see in the jstack thread
  dump  for Hadoop child processes, is it possible that these map tasks are
  actually waiting on a particular host with some excessive crawl-delay , I
  already had the fetcher.threads.per.queue to 5 , fetcher.server.delay to
 0,
  fetcher.max.crawl.delay to 10  and http.max.delays to 1000 .
 
  Please see the jstack  log info for the child processes below.
 
  Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed
 mode):
 
  Attach Listener daemon prio=10 tid=0x7fecf8c58000 nid=0x32e5
 waiting
  on condition [0x]
 java.lang.Thread.State: RUNNABLE
 
  IPC Client (638223659) connection to /170.75.153.162:40980 from
  job_1413149941617_0059 daemon prio=10 tid=0x01a5c000 nid=0xce8
 in
  Object.wait() [0x7fecdf80e000]
 java.lang.Thread.State: TIMED_WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  - waiting on 0x99f8bf48 (a
  org.apache.hadoop.ipc.Client$Connection)
  at
 org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:899)
  - locked 0x99f8bf48 (a
  org.apache.hadoop.ipc.Client$Connection)
  at org.apache.hadoop.ipc.Client$Connection.run(Client.java:944)
 
  fetcher#5 daemon prio=10 tid=0x7fecf8c49000 nid=0xce7 in
  Object.wait() [0x7fecdf90f000]
 java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  - waiting on 0x99f62a68 (a
  org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
  at java.lang.Object.wait(Object.java:503)
  at
 
 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368)
  - locked 0x99f62a68 (a
  org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
  at
  org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161)
 
  fetcher#4 daemon prio=10 tid=0x7fecf8c47000 nid=0xce6 in
  Object.wait() [0x7fecdfa1]
 java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  - waiting on 0x99f62a68 (a
  org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
  at java.lang.Object.wait(Object.java:503)
  at
 
 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368)
  - locked 0x99f62a68 (a
  org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
  at
  org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161)
 
  fetcher#3 daemon prio=10 tid=0x7fecf8c45800 nid=0xce5 in
  Object.wait() [0x7fecdfb11000]
 java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  - waiting on 0x99f62a68 (a
  org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
  at java.lang.Object.wait(Object.java:503)
  at
 
 
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368)
  - locked 0x99f62a68 (a
  org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
  at
  org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161)
 
  fetcher#2 daemon prio=10 tid=0x7fecf8c31800 nid=0xce4 in
  Object.wait() [0x7fecdfc12000]
 java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  - waiting on 0x99f62a68 (a
  org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
  at java.lang.Object.wait(Object.java:503

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-17 Thread Meraj A. Khan
=0xc69
waiting on condition [0x]
   java.lang.Thread.State: RUNNABLE

Signal Dispatcher daemon prio=10 tid=0x7fecf8095000 nid=0xc68
runnable [0x]
   java.lang.Thread.State: RUNNABLE

Finalizer daemon prio=10 tid=0x7fecf807e000 nid=0xc60 in
Object.wait() [0x7fecec83c000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x99a1e040 (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
- locked 0x99a1e040 (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)

Reference Handler daemon prio=10 tid=0x7fecf807a000 nid=0xc5f in
Object.wait() [0x7fecec93d000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x99aeb3e0 (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked 0x99aeb3e0 (a java.lang.ref.Reference$Lock)

main prio=10 tid=0x7fecf800f800 nid=0xc4f in Object.wait()
[0x7fed00948000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x99f62a68 (a
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
at
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.waitUntilDone(ShuffleSchedulerImpl.java:443)
- locked 0x99f62a68 (a
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl)
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:129)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

VM Thread prio=10 tid=0x7fecf8077800 nid=0xc5e runnable

GC task thread#0 (ParallelGC) prio=10 tid=0x7fecf8025800 nid=0xc54
runnable

GC task thread#1 (ParallelGC) prio=10 tid=0x7fecf8027000 nid=0xc55
runnable

GC task thread#2 (ParallelGC) prio=10 tid=0x7fecf8029000 nid=0xc56
runnable

GC task thread#3 (ParallelGC) prio=10 tid=0x7fecf802b000 nid=0xc57
runnable

GC task thread#4 (ParallelGC) prio=10 tid=0x7fecf802c800 nid=0xc58
runnable

GC task thread#5 (ParallelGC) prio=10 tid=0x7fecf802e800 nid=0xc59
runnable

GC task thread#6 (ParallelGC) prio=10 tid=0x7fecf8030800 nid=0xc5a
runnable

GC task thread#7 (ParallelGC) prio=10 tid=0x7fecf8032800 nid=0xc5b
runnable

VM Periodic Task Thread prio=10 tid=0x7fecf80af800 nid=0xc6c waiting
on condition

JNI global references: 255



On Thu, Oct 16, 2014 at 5:20 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Hi Meraj

 You could call jstack on the Java process a couple of times to see what it
 is busy doing, that will be a simple of way of checking that this is indeed
 the source of the problem.
 See https://issues.apache.org/jira/browse/NUTCH-1314 for a possible
 solution

 J.

 On 16 October 2014 06:08, Meraj A. Khan mera...@gmail.com wrote:

  Hi All,
 
  I am running into a situation where the reduce phase of the fetch job
 with
  parsing enabled at the time of fetch is taking excessively long amount of
  time , I have seen recommendations to filter the URLs based on length to
  avoid normalization related delays ,I am not filtering any URLs based on
  length , could that be an issue ?
 
  Can anyone share if they faced this issue and what the resolution was, I
 am
  running Nutch 1.7 on Hadoop YARN.
 
  The issue was previously inconclusively discussed here.
 
 
 
 http://markmail.org/message/p6dzvvycpfzbaugr#query:+page:1+mid:p6dzvvycpfzbaugr+state:results
 
  Thanks.
 



 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble



Reduce phase in Fetcher taking excessive time to finish.

2014-10-15 Thread Meraj A. Khan
Hi All,

I am running into a situation where the reduce phase of the fetch job with
parsing enabled at the time of fetch is taking excessively long amount of
time , I have seen recommendations to filter the URLs based on length to
avoid normalization related delays ,I am not filtering any URLs based on
length , could that be an issue ?

Can anyone share if they faced this issue and what the resolution was, I am
running Nutch 1.7 on Hadoop YARN.

The issue was previously inconclusively discussed here.

http://markmail.org/message/p6dzvvycpfzbaugr#query:+page:1+mid:p6dzvvycpfzbaugr+state:results

Thanks.


Re: Generated Segment Too Large

2014-10-07 Thread Meraj A. Khan
Markus,

I have been using Nucth for a while , but I wasnt clear about this issue,
thank you for reminding me that this is Nucth 101 :)

I will go ahead and use topN as the segment size control mechanism,
although I have one question regarding topN , i.e if I have topN value of
1000 and if there are more than topN , lets say 2000 URLs that are
unfetched at that point of time  , the remaining 1000 would be addressed in
the subsequent Fetch phase, meaning nothing is discarded or felt unfetched ?





On Tue, Oct 7, 2014 at 3:46 AM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Hi - you have been using Nutch for some time already so aren't you already
 familiar with generate.max.count configuration directive possibly combined
 with the -topN parameter for the Generator job? With generate.max.count the
 segment size depends on the number of distinct hosts or domains so it is
 not really trustworthy, the topN parameter is really strict.

 Markus



 -Original message-
  From:Meraj A. Khan mera...@gmail.com
  Sent: Tuesday 7th October 2014 5:54
  To: user@nutch.apache.org
  Subject: Generated Segment Too Large
 
  Hi Folks,
 
  I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way
 of
  controlling the  segment size and since a single segment is being created
  which is very large for the capacity of my Hadoop cluster, I have a
  available storage of ~3TB , but since Hadoop generates the spill*.out
 files
  for this large segment which gets fetched for days ,I am running out of
  disk space.
 
  I figured , if the segment size were to be controlled then for each
 segment
  the spills files would be deleted after the job for that segment was
  completed, giving me a efficient use of the disk space.
 
  I would like to know how I can generate multiple segments of a certain
 size
  (or just fixed number )at each depth iteration .
 
  Right now , looks like the Generator.java does needs to be modified as it
  does not consider the number of segments , is that the right approach ?
 if
  so can you please give me a few pointers of what logic I should be
 changing
  , if this is not the right approach I would be happy to know if there is
  any way to control , the number as well as the size of the generated
  segments using the configuration/job submission parameters.
 
  Thanks for your help!
 



Generated Segment Too Large

2014-10-06 Thread Meraj A. Khan
Hi Folks,

I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of
controlling the  segment size and since a single segment is being created
which is very large for the capacity of my Hadoop cluster, I have a
available storage of ~3TB , but since Hadoop generates the spill*.out files
for this large segment which gets fetched for days ,I am running out of
disk space.

I figured , if the segment size were to be controlled then for each segment
the spills files would be deleted after the job for that segment was
completed, giving me a efficient use of the disk space.

I would like to know how I can generate multiple segments of a certain size
(or just fixed number )at each depth iteration .

Right now , looks like the Generator.java does needs to be modified as it
does not consider the number of segments , is that the right approach ? if
so can you please give me a few pointers of what logic I should be changing
, if this is not the right approach I would be happy to know if there is
any way to control , the number as well as the size of the generated
segments using the configuration/job submission parameters.

Thanks for your help!


Re: Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.

2014-09-25 Thread Meraj A. Khan
Just wanted to update and let everyone know that this issue with single map
task for fetch was occurring because Generator.java had logic around MRV1
property *mapred.job.tracker*, I had to change that logic and as I am
running this on YARN and now multiple fetch tasks operate on a single
segment.

Also I misunderstood that multiple segments would need to be generated to
achieve parallelism , it does not seem to be the case , parallelism at
fetch time is achieved by having multiple fetch tasks operate on a single
segment.

Thanks everyone for your help on resolving this issue.



On Wed, Sep 24, 2014 at 6:14 PM, Meraj A. Khan mera...@gmail.com wrote:

 Folks,

 As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN
 cluster .

 In order to scale I would need to Fetch concurrently with multiple map
 tasks on multiple nodes ,I  think that the first step to do so would be to
 generate multiple segments in the generate phase so that multiple fetch map
 tasks can operate in parallel and in  order to generate multiple segments
 at Generate time I have made the following changes , but unfortunately I
 have been unsuccessful in doing so.

 I have tweaked the following parameters in bin/crawl to do so .

 added the *maxNumSegments* and *numFetchers* parameters in the call to
 generate in *bin/crawl *script as can be seen below.


 *$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
 $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers
 -noFilter*

 (Here $numFetchers has a value of 15)

 The *generate.max.count* and *generate.count.mode* and *topN* are all
 default values , meaning I am not providing any values for them.

 Also the crawldb status before the Generate phase is as shown below , it
 shows that the number of unfetched URLs is more than *75 million* , so
 its not that there are not enough urls for Generate to generate multiple
 segments.

 * CrawlDB status*
 * db_fetched=318708*
 * db_gone=4774*
 * db_notmodified=2274*
 * db_redir_perm=2253*
 * db_redir_temp=2527*
 * db_unfetched=7524*

 However I do see this message in the logs consistently during the generate
 phase.

  *Generator: jobtracker is 'local', generating exactly one partition.*

 is this one partition referring to the the single segment that is going
 to be generated ? If so how do I address this.


 I feel like I have exhausted all the options but I am unable to have the
 Generate phase generate more than one segment at a time.

 Can someone let me know if there is anything else that I should be trying
 here ?

 *Thanks and any help is much appreciated!*





Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.

2014-09-24 Thread Meraj A. Khan
Folks,

As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN
cluster .

In order to scale I would need to Fetch concurrently with multiple map
tasks on multiple nodes ,I  think that the first step to do so would be to
generate multiple segments in the generate phase so that multiple fetch map
tasks can operate in parallel and in  order to generate multiple segments
at Generate time I have made the following changes , but unfortunately I
have been unsuccessful in doing so.

I have tweaked the following parameters in bin/crawl to do so .

added the *maxNumSegments* and *numFetchers* parameters in the call to
generate in *bin/crawl *script as can be seen below.


*$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
$CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers
-noFilter*

(Here $numFetchers has a value of 15)

The *generate.max.count* and *generate.count.mode* and *topN* are all
default values , meaning I am not providing any values for them.

Also the crawldb status before the Generate phase is as shown below , it
shows that the number of unfetched URLs is more than *75 million* , so its
not that there are not enough urls for Generate to generate multiple
segments.

* CrawlDB status*
* db_fetched=318708*
* db_gone=4774*
* db_notmodified=2274*
* db_redir_perm=2253*
* db_redir_temp=2527*
* db_unfetched=7524*

However I do see this message in the logs consistently during the generate
phase.

 *Generator: jobtracker is 'local', generating exactly one partition.*

is this one partition referring to the the single segment that is going
to be generated ? If so how do I address this.


I feel like I have exhausted all the options but I am unable to have the
Generate phase generate more than one segment at a time.

Can someone let me know if there is anything else that I should be trying
here ?

*Thanks and any help is much appreciated!*


Re: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Hi Edoardo,

How do you generate the multiple segments at the time of generate phase?
On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar...@gmail.com
wrote:

 Hi all,

 I’m building an Oozie workflow to schedule the generate, fetch, etc…
 workflow. Right now I'm trying to feed the list of generated segments into
 the following fetch stage.

 The “crawl” script assumes that the most recently added segment is
 un-fetched and does some hdfs shell scripting to determine its name and
 stuff this into a shell variable, but I’d like to avoid this and somehow
 feed the list of generated segments directly into the following step.

 I have the feeling that I could use the ooze “capture data from action”
 option but I think that will require fiddling with the Generator class
 source; that’s ok but I’m a bit weary of adding custom code that may not be
 part of the core distribution. Has anyone already done something similar,
 preferably without touching the source? (e.g.
 http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it
 now 404s on GitHub)


 Best,
 Edoardo

 --
 Edoardo Causarano
 Sent with Airmail


RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Markus, I have used the maxnum segments but no luck, is it driven by the
size of the segment instead ?
On Sep 22, 2014 9:28 AM, Markus Jelsma markus.jel...@openindex.io wrote:

 You can use maxNumSegments to generate more than one segment. And instead
 of passing a list of segment names around, why not just loop over the
 entire directory, and move finished segments to another.



 -Original message-
  From:Edoardo Causarano edoardo.causar...@gmail.com
  Sent: Monday 22nd September 2014 15:25
  To: user@nutch.apache.org
  Subject: Re: get generated segments from step / fetch all empty segments
 
  Hi Meraj,
 
  at the moment I’m not, but in the Generator job class the method
 “generate” does return a list of Paths therefore the possibility is there
 (somehow.) For now I’m concentrating on passing at least 1 segment name
 from one step to the other, then I’ll see if and how I can get more.
 
 
  Best,
  Edoardo
 
 
  On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com)
 wrote:
 
  Hi Edoardo,
 
  How do you generate the multiple segments at the time of generate phase?
  On Sep 22, 2014 6:01 AM, Edoardo Causarano 
 edoardo.causar...@gmail.com
  wrote:
 
   Hi all,
  
   I’m building an Oozie workflow to schedule the generate, fetch, etc…
   workflow. Right now I'm trying to feed the list of generated segments
 into
   the following fetch stage.
  
   The “crawl” script assumes that the most recently added segment is
   un-fetched and does some hdfs shell scripting to determine its name and
   stuff this into a shell variable, but I’d like to avoid this and
 somehow
   feed the list of generated segments directly into the following step.
  
   I have the feeling that I could use the ooze “capture data from action”
   option but I think that will require fiddling with the Generator class
   source; that’s ok but I’m a bit weary of adding custom code that may
 not be
   part of the core distribution. Has anyone already done something
 similar,
   preferably without touching the source? (e.g.
   http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch
 but it
   now 404s on GitHub)
  
  
   Best,
   Edoardo
  
   --
   Edoardo Causarano
   Sent with Airmail
  --
  Edoardo Causarano
  Sent with Airmail



RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Thanks Markus, is that enough driven by the HDFS block size?

Edoardo, sorry for hijacking your thread. :(
On Sep 22, 2014 9:35 AM, Markus Jelsma markus.jel...@openindex.io wrote:

 Hi - It will only generate more segments when there are enough URL's to
 generate combined with either topN or generate.count.mode and
 generate.max.count.

 -Original message-
  From:Meraj A. Khan mera...@gmail.com
  Sent: Monday 22nd September 2014 15:33
  To: user@nutch.apache.org
  Subject: RE: get generated segments from step / fetch all empty segments
 
  Markus, I have used the maxnum segments but no luck, is it driven by the
  size of the segment instead ?
  On Sep 22, 2014 9:28 AM, Markus Jelsma markus.jel...@openindex.io
 wrote:
 
   You can use maxNumSegments to generate more than one segment. And
 instead
   of passing a list of segment names around, why not just loop over the
   entire directory, and move finished segments to another.
  
  
  
   -Original message-
From:Edoardo Causarano edoardo.causar...@gmail.com
Sent: Monday 22nd September 2014 15:25
To: user@nutch.apache.org
Subject: Re: get generated segments from step / fetch all empty
 segments
   
Hi Meraj,
   
at the moment I’m not, but in the Generator job class the method
   “generate” does return a list of Paths therefore the possibility is
 there
   (somehow.) For now I’m concentrating on passing at least 1 segment name
   from one step to the other, then I’ll see if and how I can get more.
   
   
Best,
Edoardo
   
   
On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com)
   wrote:
   
Hi Edoardo,
   
How do you generate the multiple segments at the time of generate
 phase?
On Sep 22, 2014 6:01 AM, Edoardo Causarano 
   edoardo.causar...@gmail.com
wrote:
   
 Hi all,

 I’m building an Oozie workflow to schedule the generate, fetch,
 etc…
 workflow. Right now I'm trying to feed the list of generated
 segments
   into
 the following fetch stage.

 The “crawl” script assumes that the most recently added segment is
 un-fetched and does some hdfs shell scripting to determine its
 name and
 stuff this into a shell variable, but I’d like to avoid this and
   somehow
 feed the list of generated segments directly into the following
 step.

 I have the feeling that I could use the ooze “capture data from
 action”
 option but I think that will require fiddling with the Generator
 class
 source; that’s ok but I’m a bit weary of adding custom code that
 may
   not be
 part of the core distribution. Has anyone already done something
   similar,
 preferably without touching the source? (e.g.
 http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch
   but it
 now 404s on GitHub)


 Best,
 Edoardo

 --
 Edoardo Causarano
 Sent with Airmail
--
Edoardo Causarano
Sent with Airmail
  
 



Re: Running multiple fetch map tasks on a Hadoop Cluster.

2014-09-19 Thread Meraj A. Khan
Julien,

How would you achieve parallelism then on a Hadoop cluster , am I missing
something here? My understanding was that we could scale the crawl  by
allowing fetch to happen in multiple map tasks in multiple nodes in a
Hadoop cluster , otherwise I am stuck in sequentially crawling a large set
of urls spread across mutiple domains.

If that is indeed the way to scale the crawl , then we would need to
generate multiple segments at the generate time so that these could be
fetched in paralle.

So I guess I really need help in .


   1. Making the generate phase generate multiple segments
   2. Being able to fetch these segments in parallel.


Can you please let me know if my approach to scale the crawl sounds right
to you ?


Thanks and much appreciated, all the help I have gotten so far



On Fri, Sep 19, 2014 at 10:40 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 The fetching operates segment by segment and won't fetch more than one at
 the same time. You can get the generation step to build multiple segments
 in one go but you'd need to modify the script so that the fetching step is
 called as many times as you have segments + you'd probably need to add some
 logic for detecting that they've all finished before you move on to the
 update step.
 Out of curiosity : why do you want to fetch multiple segments at the same
 time?

 On 19 September 2014 06:00, Meraj A. Khan mera...@gmail.com wrote:

  Hello Folks,
 
  I am  unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN.
 
  Based on Julien's suggestion I am using the bin/crawl script and did the
  following tweaks to trigger a fetch with multiple map tasks , however I
 am
  unable to do so.
 
  1. Added maxNumSegments and numFetchers parameters to the generate phase.
  $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
 $CRAWL_PATH/segments
  -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
 
  2. Removed the topN paramter and removed the noParsing parameter because
 I
  want the parsing to happen at the time of fetch.
  $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch
  $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
 
  The generate phase is not generating more than one segment.
 
  And as a result the fetch phase is not creating multiple map tasks, also
 I
  belive the way the script is written it does not allow the fecth to fecth
  multiple segements in parallel  even if the generate were to generate
  multiple segments.
 
  Can someone please let me know , how they go the script to run in a
  distributed Hadoop cluster ? Or if there is a different version of script
  that should be used?
 
  Thanks.
 



 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble



Re: Running multiple fetch map tasks on a Hadoop Cluster.

2014-09-19 Thread Meraj A. Khan
Jake,

I am not sure how to make that happen, every time I run the nutch 1.7 job
on YARN , I see a single segment being generated a nd a single map task
bein launched,underutilizing the capacity of the cluster and slowing the
crawl.

Are you suggesting  I should be seeing multiple fetch map tasks for a
single segment, if so I am not.

Thanks.
On Sep 19, 2014 5:13 PM, Jake Dodd j...@ontopic.io wrote:

 Hi Meraj,

 Nutch and Hadoop abstract all of that for you, so you don’t need to worry
 about it. When you execute the fetch command for a segment, it will be
 parallelized across the nodes in your cluster.

 Cheers

 Jake

 On Sep 19, 2014, at 1:52 PM, Meraj A. Khan mera...@gmail.com wrote:

  Julien,
 
  How would you achieve parallelism then on a Hadoop cluster , am I missing
  something here? My understanding was that we could scale the crawl  by
  allowing fetch to happen in multiple map tasks in multiple nodes in a
  Hadoop cluster , otherwise I am stuck in sequentially crawling a large
 set
  of urls spread across mutiple domains.
 
  If that is indeed the way to scale the crawl , then we would need to
  generate multiple segments at the generate time so that these could be
  fetched in paralle.
 
  So I guess I really need help in .
 
 
1. Making the generate phase generate multiple segments
2. Being able to fetch these segments in parallel.
 
 
  Can you please let me know if my approach to scale the crawl sounds right
  to you ?
 
 
  Thanks and much appreciated, all the help I have gotten so far
 
 
 
  On Fri, Sep 19, 2014 at 10:40 AM, Julien Nioche 
  lists.digitalpeb...@gmail.com wrote:
 
  The fetching operates segment by segment and won't fetch more than one
 at
  the same time. You can get the generation step to build multiple
 segments
  in one go but you'd need to modify the script so that the fetching step
 is
  called as many times as you have segments + you'd probably need to add
 some
  logic for detecting that they've all finished before you move on to the
  update step.
  Out of curiosity : why do you want to fetch multiple segments at the
 same
  time?
 
  On 19 September 2014 06:00, Meraj A. Khan mera...@gmail.com wrote:
 
  Hello Folks,
 
  I am  unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop
 YARN.
 
  Based on Julien's suggestion I am using the bin/crawl script and did
 the
  following tweaks to trigger a fetch with multiple map tasks , however I
  am
  unable to do so.
 
  1. Added maxNumSegments and numFetchers parameters to the generate
 phase.
  $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
  $CRAWL_PATH/segments
  -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
 
  2. Removed the topN paramter and removed the noParsing parameter
 because
  I
  want the parsing to happen at the time of fetch.
  $bin/nutch fetch $commonOptions -D
 fetcher.timelimit.mins=$timeLimitFetch
  $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
 
  The generate phase is not generating more than one segment.
 
  And as a result the fetch phase is not creating multiple map tasks,
 also
  I
  belive the way the script is written it does not allow the fecth to
 fecth
  multiple segements in parallel  even if the generate were to generate
  multiple segments.
 
  Can someone please let me know , how they go the script to run in a
  distributed Hadoop cluster ? Or if there is a different version of
 script
  that should be used?
 
  Thanks.
 
 
 
 
  --
 
  Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble
 




Re: Fetch Job Started Failing on Hadoop Cluster

2014-09-16 Thread Meraj A. Khan
Markus,

Thanks, the issue I was setting the PATH variable in the bin/crawl script
and once I removed it and set it outside of the bin/crawl script , it
started working fine now.



On Tue, Sep 16, 2014 at 6:39 AM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Hi - you made Nutch believe that
 hdfs://server1.mydomain.com:9000/user/df/crawldirectory/segments/ is a
 segment, but it is not. So either no segment was created or written to the
 wrong location.

 I don't know what kind of script you are using but you should check the
 return
 code of the generator, if gives a -1 for no segment created.

 Markus






 -Original message-
  From:Meraj A. Khan mera...@gmail.com mailto:mera...@gmail.com 
  Sent: Monday 15th September 2014 7:02
  To: user@nutch.apache.org mailto:user@nutch.apache.org
  Subject: Fetch Job Started Failing on Hadoop Cluster
 
  Hello Folks,
 
  My Nutch crawl which was running fine , started failing in the first
 Fetch
  Job/Application, I am unable to figure out whats going on here, I have
  attached the last snippet of the log below , can some please let me know
  whats going on here ?
 
  What I noticed is that even though the generate phase created a
  segment 20140915004940
  , the fetch phase is only looking up to the segments directory for the
  segments.
 
  Thanks.
 
  14/09/15 00:50:07 INFO crawl.Generator: Generator: finished at 2014-09-15
  00:50:07, elapsed: 00:00:59
  ls: cannot access crawldirectory/segments/: No such file or directory
  Operating on segment :
  Fetching :
  14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: starting at 2014-09-15
  00:50:09
  14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: segment:
  crawldirectory/segments
  14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher Timelimit set for :
  1410767409664
  Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
  /opt/hadoop-2.3.0/lib/native/libhadoop.so.1.0.0 which might have disabled
  stack guard. The VM will try to fix the stack guard now.
  It's highly recommended that you fix the library with 'execstack -c
  libfile', or link it with '-z noexecstack'.
  14/09/15 00:50:10 WARN util.NativeCodeLoader: Unable to load
 native-hadoop
  library for your platform... using builtin-java classes where applicable
  14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at
  server1.mydomain.com/170.75.152.162:8040
  14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at
  server1.mydomain.com/170.75.152.162:8040
  14/09/15 00:50:12 INFO mapreduce.JobSubmitter: Cleaning up the staging
 area
  /tmp/hadoop-yarn/staging/df/.staging/job_1410742329411_0010
  14/09/15 00:50:12 WARN security.UserGroupInformation:
  PriviledgedActionException as:df (auth:SIMPLE)
  cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
  exist: hdfs://
  server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
  14/09/15 00:50:12 WARN security.UserGroupInformation:
  PriviledgedActionException as:df (auth:SIMPLE)
  cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
  exist: hdfs://
  server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
  14/09/15 00:50:12 ERROR fetcher.Fetcher: Fetcher:
  org.apache.hadoop.mapred.InvalidInputException: Input path does not
 exist:
  hdfs://
  server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
  at
 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
  at
 
 org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
  at
  org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:108)
  at
 
 org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
  at
 
 org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
  at
 
 org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
  at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
  at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at
 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
  at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
  at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
  at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at
 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
  at
  org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
  at 

Re: Nutch 1.7 fetch happening in a single map task.

2014-09-08 Thread Meraj A. Khan
AFAIK, the script does not go by the mode you set , but the presence of the
*nutch*.job file in the a directory a level above script it self i.
../*.job.

Can you please check if you have the Hadoop job file at the appropriate
location?

On Mon, Sep 8, 2014 at 9:22 AM, Simon Z simonz.nu...@gmail.com wrote:

 Thank you very Meraj for your reply, I also thought it's a typo.

 I had set the numFetchers via numSlaves, and the echo of generator showed
 that numFetcher is 8 (numTasks=`expr $numSlaves \* 2` , that is 4 by 2),
 but the output of generator showed that the  run mode is local and
 generate exact one mapper, although I had changed mode=distributed, any
 idea about this please?

 Many regards,

 Simon




 On Mon, Sep 8, 2014 at 7:18 AM, Meraj A. Khan mera...@gmail.com wrote:

  I think that is a typo , and it is actually CrawlDirectory. For  the
 single
  map task issue although I have not tried it yet,but  we can control the
  number of fetchers by numFetchers parameter when doing the generate via
 the
  bin/generate.
  On Sep 7, 2014 9:23 AM, Simon Z simonz.nu...@gmail.com wrote:
 
   Hi Julien,
  
   What do you mean by crawlID please? I am using nutch 1.8 and follow
  the
   instruction in the tutorial as mentioned before, and seems have a
 similar
   situation, that is, fetch runs on only one map task. I am running on a
   cluster of four nodes on hadoop 2.4.1.
  
   Notice that the map task can be assigned to any node, but only one map
  each
   round.
  
   I have set
  
   numSlaves=4
   mode=distributed
  
  
   The seed url list includes five different websites from different host.
  
  
   Is there any settings I missed out?
  
   Thanks in advance.
  
   Regards,
  
   Simon
  
  
   On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche 
   lists.digitalpeb...@gmail.com wrote:
  
No, just do 'bin/crawl seedDir crawlID solrURL
 numberOfRounds'
   from
the master node. It internally calls the nutch script for the
  individual
commands, which takes care of sending the job jar to your hadoop
  cluster,
see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271
   
   
   
   
On 29 August 2014 15:24, S.L simpleliving...@gmail.com wrote:
   
 Sorry Julien , I overlooked the directory names.

 My understanding is that the Hadoop Job is submitted  to a cluster
 by
using
 the following command on the RM node bin/hadoop .job file params

 Are you suggesting I submit the script instead of the Nutch .job
 jar
   like
 below?

 bin/hadoop  bin/crawl seedDir crawlID solrURL
 numberOfRounds


 On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

  As the name runtime/deploy suggest - it is used exactly for that
purpose
  ;-) Just make sure HADOOP_HOME/bin is added to the path and run
 the
 script,
  that's all.
  Look at the bottom of the nutch script for details.
 
  Julien
 
  PS: there will be a Nutch tutorial at the forthcoming ApacheCon
 EU
  (
  http://sched.co/1pbE15n) were we'll cover things like these
 
 
 
  On 29 August 2014 14:30, S.L simpleliving...@gmail.com wrote:
 
   Thanks, can this be used on a hadoop cluster?
  
   Sent from my HTC
  
   - Reply message -
   From: Julien Nioche lists.digitalpeb...@gmail.com
   To: user@nutch.apache.org user@nutch.apache.org
   Subject: Nutch 1.7 fetch happening in a single map task.
   Date: Fri, Aug 29, 2014 9:00 AM
  
   See
  

  
 http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
  
   just go to runtime/deploy/bin and run the script from there.
  
   Julien
  
  
   On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com
  wrote:
  
Hi Julien,
   
I have 15 domains and they are all being fetched in a single
  map
task
   which
does not fetch all the urls no matter what depth or topN i
  give.
   
I am submitting the Nutch job jar which seems to be using the
  Crawl.java
class, how do I use the Crawl script on a Hadoop cluster, are
   there
 any
pointers you can share?
   
Thanks.
On Aug 29, 2014 4:40 AM, Julien Nioche 
  lists.digitalpeb...@gmail.com
wrote:
   
 Hi Meraj,

 The generator will place all the URLs in a single segment
 if
   all
 they
 belong to the same host for politeness reason. Otherwise it
   will
 use
 whichever value is passed with the -numFetchers parameter
 in
   the
generation
 step.

 Why don't you use the crawl script in /bin instead of
  tinkering
 with
   the
 (now deprecated) Crawl class? It comes with a good default
   configuration
 and should make your life easier.

 Julien

Re: Nutch 1.7 fetch happening in a single map task.

2014-09-07 Thread Meraj A. Khan
I think that is a typo , and it is actually CrawlDirectory. For  the single
map task issue although I have not tried it yet,but  we can control the
number of fetchers by numFetchers parameter when doing the generate via the
bin/generate.
On Sep 7, 2014 9:23 AM, Simon Z simonz.nu...@gmail.com wrote:

 Hi Julien,

 What do you mean by crawlID please? I am using nutch 1.8 and follow the
 instruction in the tutorial as mentioned before, and seems have a similar
 situation, that is, fetch runs on only one map task. I am running on a
 cluster of four nodes on hadoop 2.4.1.

 Notice that the map task can be assigned to any node, but only one map each
 round.

 I have set

 numSlaves=4
 mode=distributed


 The seed url list includes five different websites from different host.


 Is there any settings I missed out?

 Thanks in advance.

 Regards,

 Simon


 On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

  No, just do 'bin/crawl seedDir crawlID solrURL numberOfRounds'
 from
  the master node. It internally calls the nutch script for the individual
  commands, which takes care of sending the job jar to your hadoop cluster,
  see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271
 
 
 
 
  On 29 August 2014 15:24, S.L simpleliving...@gmail.com wrote:
 
   Sorry Julien , I overlooked the directory names.
  
   My understanding is that the Hadoop Job is submitted  to a cluster by
  using
   the following command on the RM node bin/hadoop .job file params
  
   Are you suggesting I submit the script instead of the Nutch .job jar
 like
   below?
  
   bin/hadoop  bin/crawl seedDir crawlID solrURL numberOfRounds
  
  
   On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche 
   lists.digitalpeb...@gmail.com wrote:
  
As the name runtime/deploy suggest - it is used exactly for that
  purpose
;-) Just make sure HADOOP_HOME/bin is added to the path and run the
   script,
that's all.
Look at the bottom of the nutch script for details.
   
Julien
   
PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (
http://sched.co/1pbE15n) were we'll cover things like these
   
   
   
On 29 August 2014 14:30, S.L simpleliving...@gmail.com wrote:
   
 Thanks, can this be used on a hadoop cluster?

 Sent from my HTC

 - Reply message -
 From: Julien Nioche lists.digitalpeb...@gmail.com
 To: user@nutch.apache.org user@nutch.apache.org
 Subject: Nutch 1.7 fetch happening in a single map task.
 Date: Fri, Aug 29, 2014 9:00 AM

 See

  
 http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script

 just go to runtime/deploy/bin and run the script from there.

 Julien


 On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote:

  Hi Julien,
 
  I have 15 domains and they are all being fetched in a single map
  task
 which
  does not fetch all the urls no matter what depth or topN i give.
 
  I am submitting the Nutch job jar which seems to be using the
Crawl.java
  class, how do I use the Crawl script on a Hadoop cluster, are
 there
   any
  pointers you can share?
 
  Thanks.
  On Aug 29, 2014 4:40 AM, Julien Nioche 
lists.digitalpeb...@gmail.com
  wrote:
 
   Hi Meraj,
  
   The generator will place all the URLs in a single segment if
 all
   they
   belong to the same host for politeness reason. Otherwise it
 will
   use
   whichever value is passed with the -numFetchers parameter in
 the
  generation
   step.
  
   Why don't you use the crawl script in /bin instead of tinkering
   with
 the
   (now deprecated) Crawl class? It comes with a good default
 configuration
   and should make your life easier.
  
   Julien
  
  
   On 28 August 2014 06:47, Meraj A. Khan mera...@gmail.com
  wrote:
  
Hi All,
   
I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I
  noticed
that
   there
is only a single reducer in the generate partition job. I am
running
  in
   a
situation where the subsequent fetch is only running in a
  single
map
  task
(I believe as a consequence of a single reducer in the
 earlier
 phase).
   How
can I force Nutch to do fetch in multiple map tasks , is
 there
  a
  setting
   to
force more than one reducers in the generate-partition job to
   have
 more
   map
tasks ?.
   
Please also note that I have commented out the code in
  Crawl.java
to
  not
   do
the LInkInversion phase as , I dont need the scoring of the
  URLS
that
   Nutch
crawls, every URL is equally important to me.
   
Thanks.
   
  
  
  
   --
  
   Open Source Solutions for Text Engineering
  
   http

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-31 Thread Meraj A. Khan
Julien,

Thank you for the decisive advice,using the crawl script seems to have
solved the problem of abrupt termination of the crawl , the bin/crawl
script respects the depth and topN parameters and iterates accordingly.

However , I have an issue with the number of maps that are being used for
fetch phase , its always 1 , I see that the script sets the numFetchers
parameters at the time of generate phase euqal to the number of slaves
which is 3 in my case , however  only a single map task is being used,
under-utilizing my Hadoop cluster and slowing down the crawl .

I see that in the Crawldb update phase there millions on 'db_unfetched'
urls still the generate phase only creates a single segment with about
20-30k urls and as a result only a single map tasks is being used for the
fetch phase, I guess I need to make the generate phase  generate more
segments than one , how do I do that using the bin/crawl script.

Please note that this is for Nutch 1.7 on Hadoop 2.3.0.

Thanks.


On Fri, Aug 29, 2014 at 10:39 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 No, just do 'bin/crawl seedDir crawlID solrURL numberOfRounds' from
 the master node. It internally calls the nutch script for the individual
 commands, which takes care of sending the job jar to your hadoop cluster,
 see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271




 On 29 August 2014 15:24, S.L simpleliving...@gmail.com wrote:

  Sorry Julien , I overlooked the directory names.
 
  My understanding is that the Hadoop Job is submitted  to a cluster by
 using
  the following command on the RM node bin/hadoop .job file params
 
  Are you suggesting I submit the script instead of the Nutch .job jar like
  below?
 
  bin/hadoop  bin/crawl seedDir crawlID solrURL numberOfRounds
 
 
  On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche 
  lists.digitalpeb...@gmail.com wrote:
 
   As the name runtime/deploy suggest - it is used exactly for that
 purpose
   ;-) Just make sure HADOOP_HOME/bin is added to the path and run the
  script,
   that's all.
   Look at the bottom of the nutch script for details.
  
   Julien
  
   PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (
   http://sched.co/1pbE15n) were we'll cover things like these
  
  
  
   On 29 August 2014 14:30, S.L simpleliving...@gmail.com wrote:
  
Thanks, can this be used on a hadoop cluster?
   
Sent from my HTC
   
- Reply message -
From: Julien Nioche lists.digitalpeb...@gmail.com
To: user@nutch.apache.org user@nutch.apache.org
Subject: Nutch 1.7 fetch happening in a single map task.
Date: Fri, Aug 29, 2014 9:00 AM
   
See
   
  http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
   
just go to runtime/deploy/bin and run the script from there.
   
Julien
   
   
On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote:
   
 Hi Julien,

 I have 15 domains and they are all being fetched in a single map
 task
which
 does not fetch all the urls no matter what depth or topN i give.

 I am submitting the Nutch job jar which seems to be using the
   Crawl.java
 class, how do I use the Crawl script on a Hadoop cluster, are there
  any
 pointers you can share?

 Thanks.
 On Aug 29, 2014 4:40 AM, Julien Nioche 
   lists.digitalpeb...@gmail.com
 wrote:

  Hi Meraj,
 
  The generator will place all the URLs in a single segment if all
  they
  belong to the same host for politeness reason. Otherwise it will
  use
  whichever value is passed with the -numFetchers parameter in the
 generation
  step.
 
  Why don't you use the crawl script in /bin instead of tinkering
  with
the
  (now deprecated) Crawl class? It comes with a good default
configuration
  and should make your life easier.
 
  Julien
 
 
  On 28 August 2014 06:47, Meraj A. Khan mera...@gmail.com
 wrote:
 
   Hi All,
  
   I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I
 noticed
   that
  there
   is only a single reducer in the generate partition job. I am
   running
 in
  a
   situation where the subsequent fetch is only running in a
 single
   map
 task
   (I believe as a consequence of a single reducer in the earlier
phase).
  How
   can I force Nutch to do fetch in multiple map tasks , is there
 a
 setting
  to
   force more than one reducers in the generate-partition job to
  have
more
  map
   tasks ?.
  
   Please also note that I have commented out the code in
 Crawl.java
   to
 not
  do
   the LInkInversion phase as , I dont need the scoring of the
 URLS
   that
  Nutch
   crawls, every URL is equally important to me.
  
   Thanks.
  
 
 
 
  --
 
  Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Meraj A. Khan
Hi Julien,

I have 15 domains and they are all being fetched in a single map task which
does not fetch all the urls no matter what depth or topN i give.

I am submitting the Nutch job jar which seems to be using the Crawl.java
class, how do I use the Crawl script on a Hadoop cluster, are there any
pointers you can share?

Thanks.
On Aug 29, 2014 4:40 AM, Julien Nioche lists.digitalpeb...@gmail.com
wrote:

 Hi Meraj,

 The generator will place all the URLs in a single segment if all they
 belong to the same host for politeness reason. Otherwise it will use
 whichever value is passed with the -numFetchers parameter in the generation
 step.

 Why don't you use the crawl script in /bin instead of tinkering with the
 (now deprecated) Crawl class? It comes with a good default configuration
 and should make your life easier.

 Julien


 On 28 August 2014 06:47, Meraj A. Khan mera...@gmail.com wrote:

  Hi All,
 
  I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that
 there
  is only a single reducer in the generate partition job. I am  running in
 a
  situation where the subsequent fetch is only running in a single map task
  (I believe as a consequence of a single reducer in the earlier phase).
 How
  can I force Nutch to do fetch in multiple map tasks , is there a setting
 to
  force more than one reducers in the generate-partition job to have more
 map
  tasks ?.
 
  Please also note that I have commented out the code in Crawl.java to not
 do
  the LInkInversion phase as , I dont need the scoring of the URLS that
 Nutch
  crawls, every URL is equally important to me.
 
  Thanks.
 



 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble



Nutch 1.7 fetch happening in a single map task.

2014-08-27 Thread Meraj A. Khan
Hi All,

I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there
is only a single reducer in the generate partition job. I am  running in a
situation where the subsequent fetch is only running in a single map task
(I believe as a consequence of a single reducer in the earlier phase).  How
can I force Nutch to do fetch in multiple map tasks , is there a setting to
force more than one reducers in the generate-partition job to have more map
tasks ?.

Please also note that I have commented out the code in Crawl.java to not do
the LInkInversion phase as , I dont need the scoring of the URLS that Nutch
crawls, every URL is equally important to me.

Thanks.


Nutch 1.7 on Hadoop Yarn 2.3.0 performing only 3 rounds of fetching.

2014-08-24 Thread Meraj A. Khan
Hi All,

After spending some time on this I was able to diagnose the problem that
when I submit the Nutch 1.7 job to a Hadoop Yarn Cluster , I notice that in
the Hadoop UI , it lists the tasks that its executing , only 3 rounds of
fetch happen , even though I have  given a depth on 100 and my seed list
has 10 URLs .

Any idea why this is happening ? Please note when I run the same Nutch
configuration in my local mode i.e in eclipse it does appropriate number of
fetches and also fetches all the URLs from all the domains.

Thanks in advance!


Re: Crawl-Delay in robots.txt and fetcher.threads.per.queue config property.

2014-06-26 Thread Meraj A. Khan
Perfect, thank you Julien!


On Thu, Jun 26, 2014 at 10:21 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 
  If I set fetcher.threads.per.queue property to more than 1 , I believe
 the
  behavior would be to have those many number of threads per host from
 Nutch,
  in that case would Nutch still respect the Crawl-Delay directive in
  robots.txt and not crawl at a faster pace that what is specified in
  robots.txt.
 

  In short what I am trying to ask is if setting fetcher.threads.per.queue
  to 1 is required for being as polite as Crawl-Delay in robots.txt
 expects?
 

 Using more than 1 thread per queue will ignore any crawl-delay obtained
 from robots.txt (see

 https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java#L317
 )
 and use the fetcher.server.min.delay configuration which has a default
 value of 0. So yes, setting fetcher.threads.per.queue to 1 is required for
 being as polite as Crawl-Delay in robots.txt expects.

 HTH

 Julien

 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble



Please share your experience of using Nutch in production

2014-06-22 Thread Meraj A. Khan
Hello Folks,

I have  noticed that Nutch resources and mailing lists are mostly geared
towards the usage of Nutch in research oriented projects , I would like to
know from those of you who are using Nutch in production for large scale
crawling (vertical or non-vertical) about what challenges to expect and how
to overcome them.

I will list a few  challenges that  I faced below and would like to hear
from if you faced these challenges you on how you overcame these.


   1. If I were to go for a vertical search engine for websites in a
   particular domain  and follow the crawl-delay directive for politeness in
   the robots.txt , there is a possibility that the web master could still
   block my IP address and I start getting HTTP 403 forbidden/access denied
   messages. How can I  overcome these kind of issues , other than providing
   full contact info in the nutch-site.xml for the web master to get in touch
   with me, before blocking me ?.
   2. The fact that you will be considered as just another Nutch variant by
   web master puts you at a great level of dis-advantage , where you could be
   blocked from accessing the web site at the whims of the web master.
   3. Can anyone share info as to how they overcame this issue when they
   were starting out , did you establish a relationship with each website
   owner/master to allows unhindered access ?
   4. Any other tips and suggestions would also be greatly appreciated.


Thanks.


Re: Please share your experience of using Nutch in production

2014-06-22 Thread Meraj A. Khan
Gora,

Thanks for sharing your admin perspective , rest assured  I am not trying
to circumvent any politeness requirements in any way , as I mentioned
earlier , I am with in the crawl-delay limits that are being set by the web
masters if any , however , you have confirmed my hunch that I might have to
reach out to individual webmasters to try and convince them to not block my
IP address .

Even if I have as small a number as 100 web sites to crawl , it would be a
huge challenge for us to communicate with each and every webmaster , how
would one go about doing that ? Also is there a standard way the web
masters list their contact info so as to sell them the pitch to or persuade
them to allows us to crawl their websites at a reasonable frequency?

By being at a disadvantage , I meant at a disadvantage compared to major
players like Google, Bing and Yahoo bots , whom the webmasters probably
would not block access, and by Nutch variant , I meant an instance of a
customized crawler based on Nutch.

Thanks.


On Sun, Jun 22, 2014 at 1:33 PM, Gora Mohanty g...@mimirtech.com wrote:

 On 22 June 2014 22:07, Meraj A. Khan mera...@gmail.com wrote:
 
  Hello Folks,
 
  I have  noticed that Nutch resources and mailing lists are mostly geared
  towards the usage of Nutch in research oriented projects , I would like
 to
  know from those of you who are using Nutch in production for large scale
  crawling (vertical or non-vertical) about what challenges to expect and
 how
  to overcome them.
 
  I will list a few  challenges that  I faced below and would like to hear
  from if you faced these challenges you on how you overcame these.
 
 
 1. If I were to go for a vertical search engine for websites in a
 particular domain  and follow the crawl-delay directive for
 politeness in
 the robots.txt , there is a possibility that the web master could
 still
 block my IP address and I start getting HTTP 403 forbidden/access
 denied
 messages. How can I  overcome these kind of issues , other than
 providing
 full contact info in the nutch-site.xml for the web master to get in
 touch
 with me, before blocking me ?.

 Er, providing full access info. is just basic politeness, and IMHO
 should become a requirement for Nutch. If you are going to hit some
 sites particularly hard, with good reasons, try contacting the website
 administrators and explaining to them why you need such access. We
 both administer, and crawl sites, and as an administrator I am quite
 willing to accept reasonable requests. After all, it is also our goal
 to promote our websites, and already most traffic on the web is
 through search engines.

 2. The fact that you will be considered as just another Nutch variant
 by
 web master puts you at a great level of dis-advantage , where you
 could be
 blocked from accessing the web site at the whims of the web master.

 Not sure what you mean by just another Nutch variant, nor why you
 think that it puts you at a disadvantage. Disadvantage compared to
 whom? Also, whims of the web master? Really? After all, it is their
 resources that you are using, and they are perfectly within their
 rights to ban you if they feel, for whatever reason, that you are
 abusing such resources.

 3. Can anyone share info as to how they overcame this issue when they
 were starting out , did you establish a relationship with each website
 owner/master to allows unhindered access ?
 4. Any other tips and suggestions would also be greatly appreciated.

 Sorry if I am misreading the above, but what you are asking for smells
 like trying to circumvent reasonable requirements. Yes, do try talking
 to website administrators. You might find them to be surprisingly
 accommodating if you are reasonable in return.

 Regards,
 Gora



Re: Relationship between fetcher.threads.fetch and fetcher.threads.per.host

2014-06-22 Thread Meraj A. Khan
Sebastian,

Thanks for the clear explanation , I have a similar questions .


   1. If I set the fetcher.threads.per.host or the renamed
   fetcher.threads.per.queue property to more than the edefault 1 , would my
   cralwer still be with in the crawl-delay limits for each host as specified
   in its robots.txt ?
   2. Looks like the max value we set in fetcher.threads.per.host value
   only comes into play when the total number of threads for the map task are
   less than the value we specify in the fetcher.threads.fetch property ?

Thanks.


On Sun, Jun 22, 2014 at 2:13 PM, Sebastian Nagel wastl.na...@googlemail.com
 wrote:

 Hi,

  1. fetcher.threads.per.host: 10*3 = 30
 Correct. But if there are 1000 hosts you hardly
 would set it to 3000, see question 2.

 Keep in mind, that the property has been renamed into
 fetcher.threads.per.queue with Nutch 1.4!
 A queue can be defined by host or ip, see fetcher.queue.mode.

  2. fetcher.threads.fetch
 If there are many hosts you would set fetcher.threads.per.host
 to 1 (the default), and use fetcher.threads.fetch to limit the
 load on your system (esp. to limit the network load).

  3. in distributed mode
 All URLs from the same host are placed in the same partition.
 This ensures that host-level blocking can be done in one single
 JVM.

 Sebastian


 On 06/22/2014 05:51 PM, S.L wrote:
  Hi All,
 
  I would like to know the relationship between the two config properties
  *fetcher.threads.fetch* and *fetcher.threads.per.host*.
 
 
 1. If lets say I am crawling 10 hosts in my seed file and set the
 fetcher.threads.per.host property to 3 , should I set the
 fetcher.threads.fetch property to 10*3 i.e =30 ?
 2. I can understand the *fetcher.threads.per.host *property as it is
 self explanatory , which means number to concurrent connections to a
 particular host , however , I am not able to clearly follow what
  *fetcher.threads.fetch
 *does.
 3. Also I would like to know how the *fetcher.threads.per.host*
 property
 comes into play in a distributed mode  ?
 
 
 
  Thanks in advance.