I understand it's expected. Especially if the page is in the list of seeds. 

You can control this by changing the relevant config XML variables. 

On 24 November 2016 20:10:02 GMT+00:00, Vladimir Loubenski 
<[email protected]> wrote:
>Hi ,
>I am using Nutch 2.3.1.
>I run in loop generate, fetch, parse, updateDB steps. 
>I noted that during re-crawl even if a  web page doesn't change nutch
>doesn't detect it  by value of  ETag, Last-Modified or signature fields
>and continue process all these steps for unchanged web pages.
> Is it expected behaviour?
>Are there plans to fix it in future releases?  
>
>Regards,
>Vladimir.
>
>-----Original Message-----
>From: Jim Lamb [mailto:[email protected]] 
>Sent: November-22-16 6:22 AM
>To: [email protected]
>Subject: Re: Automating Nutch 2.3.1 on Amazon EMR
>
>
>Further to this, I have found that I can only submit a maximum of 256
>steps to EMR. Some of our crawls take over 100 rounds, so defining an
>arbitrary number of (generate,fetch,parse,updatedb,index,solrdedup)
>rounds each with 6 steps isn't going to work either :-(
>
>Has nobody automated this?
>
>Thanks,
>
>Jim
> 
>
>Sent: Thursday, November 17, 2016 at 11:30 AM
>From: "Jim Lamb" <[email protected]>
>To: [email protected]
>Subject: Re: Automating Nutch 2.3.1 on Amazon EMR Hi Sebastian,
>
>Thanks for coming back to me.
>
>> Adding
>> set -x
>> to bin/nutch and then running bin/crawl with a sample crawl which 
>> includes all steps should log all commands with a full list of
>arguments.
>
>Yes, that's a great idea. Thanks.
>
>> But on EMR it should be possible to directly reference the Nutch job 
>> file by a s3:// URL. (but haven't tried it this way)
>
>Yes, that is possible. You add an S3 URL to the Jar= argument in your
>step definition of the create-cluster command.
>
>> aws emr terminate-cluster ...
>
>Ah, yes. I did wonder if the master instance had appropriate instance
>role privilege to do this. I'll try.
>
>Unfortunately, it still doesn't solve the iteration issue. Short of
>defining many many repeated sets of steps, I don't see how I would get
>multiple rounds. What am I missing?
>
>Thanks,
>
>Jim
>
>______________________________________________________________________
>This email has been scanned by the Symantec Email Security.cloud
>service.
>For more information please visit http://www.symanteccloud.com
>______________________________________________________________________

-- 
Tom Chiverton 
Sent from my phone. Please excuse my brevity. 

Reply via email to