I understand it's expected. Especially if the page is in the list of seeds.
You can control this by changing the relevant config XML variables. On 24 November 2016 20:10:02 GMT+00:00, Vladimir Loubenski <[email protected]> wrote: >Hi , >I am using Nutch 2.3.1. >I run in loop generate, fetch, parse, updateDB steps. >I noted that during re-crawl even if a web page doesn't change nutch >doesn't detect it by value of ETag, Last-Modified or signature fields >and continue process all these steps for unchanged web pages. > Is it expected behaviour? >Are there plans to fix it in future releases? > >Regards, >Vladimir. > >-----Original Message----- >From: Jim Lamb [mailto:[email protected]] >Sent: November-22-16 6:22 AM >To: [email protected] >Subject: Re: Automating Nutch 2.3.1 on Amazon EMR > > >Further to this, I have found that I can only submit a maximum of 256 >steps to EMR. Some of our crawls take over 100 rounds, so defining an >arbitrary number of (generate,fetch,parse,updatedb,index,solrdedup) >rounds each with 6 steps isn't going to work either :-( > >Has nobody automated this? > >Thanks, > >Jim > > >Sent: Thursday, November 17, 2016 at 11:30 AM >From: "Jim Lamb" <[email protected]> >To: [email protected] >Subject: Re: Automating Nutch 2.3.1 on Amazon EMR Hi Sebastian, > >Thanks for coming back to me. > >> Adding >> set -x >> to bin/nutch and then running bin/crawl with a sample crawl which >> includes all steps should log all commands with a full list of >arguments. > >Yes, that's a great idea. Thanks. > >> But on EMR it should be possible to directly reference the Nutch job >> file by a s3:// URL. (but haven't tried it this way) > >Yes, that is possible. You add an S3 URL to the Jar= argument in your >step definition of the create-cluster command. > >> aws emr terminate-cluster ... > >Ah, yes. I did wonder if the master instance had appropriate instance >role privilege to do this. I'll try. > >Unfortunately, it still doesn't solve the iteration issue. Short of >defining many many repeated sets of steps, I don't see how I would get >multiple rounds. What am I missing? > >Thanks, > >Jim > >______________________________________________________________________ >This email has been scanned by the Symantec Email Security.cloud >service. >For more information please visit http://www.symanteccloud.com >______________________________________________________________________ -- Tom Chiverton Sent from my phone. Please excuse my brevity.

