Re: recrawl a URL?

Lewis John Mcgibbney Mon, 27 Aug 2012 06:03:34 -0700

The crawldb needs to receive updates of data in fetched segments, once
you generate it will calculate what needs to be fetched in next
iteration. It is OK to store segments in different locations but
typicaly you would want to maintain one crawldb for all of your
records... unless of course you have an alternative usage scenario.


Lewis

On Mon, Aug 27, 2012 at 1:51 PM, Max Dzyuba <[email protected]> wrote:
> Do I understand correctly that if I use "bin/nutch crawl" command and store 
> the crawl data into a new directory every time, then there is no way for 
> Nutch to know whether the page has been changed since the last crawl?
>
> Thanks,
> Max
>
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: den 24 augusti 2012 21:26
> To: [email protected]; [email protected]
> Subject: RE: recrawl a URL?
>
> No, the CrawlDatum's status field will be set to db_notmodified if the 
> signatures match regardless of the HTTP headers. The header only sets a 
> fetch_notmodified but it is not relevant for the db_* status.
>
>
>
> -----Original message-----
>> From:[email protected] <[email protected]>
>> Sent: Fri 24-Aug-2012 20:14
>> To: [email protected]; [email protected]
>> Subject: Re: recrawl a URL?
>>
>> This will work only for urls that has If-Modified-Since headers. But most 
>> urls does not have this header.
>>
>> Thanks.
>> Alex.
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Max Dzyuba <[email protected]>
>> To: Markus Jelsma <[email protected]>; user
>> <[email protected]>
>> Sent: Fri, Aug 24, 2012 9:02 am
>> Subject: RE: recrawl a URL?
>>
>>
>> Thanks again! I'll have to test it more then in my 1.5.1.
>>
>>
>>
>>
>> Best regards,
>> MaxMarkus Jelsma <[email protected]> wrote:Hmm, i had to look
>> it up but it is supported in 1.5 and 1.5.1:
>>
>> http://svn.apache.org/viewvc/nutch/tags/release-1.5.1/src/java/org/apa
>> che/nutch/indexer/IndexerMapReduce.java?view=markup
>>
>>
>> -----Original message-----
>> > From:Max Dzyuba <[email protected]>
>> > Sent: Fri 24-Aug-2012 17:35
>> > To: Markus Jelsma <[email protected]>;
>> > [email protected]
>> > Subject: RE: recrawl a URL?
>> >
>> > Thank you for the reply. Does it mean that it is not supported in
>> > latest
>> stable release of Nutch?
>> >
>> >
>> > -----Original Message-----
>> > From: Markus Jelsma [mailto:[email protected]]
>> > Sent: den 24 augusti 2012 17:21
>> > To: [email protected]; Max Dzyuba
>> > Subject: RE: recrawl a URL?
>> >
>> > Hi,
>> >
>> > Trunk has a feature for this: indexer.skip.notmodified
>> >
>> > Cheers
>> >
>> > -----Original message-----
>> > > From:Max Dzyuba <[email protected]>
>> > > Sent: Fri 24-Aug-2012 17:19
>> > > To: [email protected]
>> > > Subject: recrawl a URL?
>> > >
>> > > Hello everyone,
>> > >
>> > >
>> > >
>> > > I run a crawl command every day, but I don't want Nutch to submit
>> > > an update to Solr if a particular page hasn't changed. How do I
>> > > achieve that? Right now the value of db.fetch.interval.default
>> > > doesn't seem to help prevent the crawl since the updates are
>> > > submitted to Solr as if the page has been changed. I know for sure
>> > > that the page has not been changed. This happens for every new crawl 
>> > > command.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > Thanks in advance,
>> > >
>> > > Max
>> > >
>> > >
>> >
>> >
>>
>>
>>
>>
>



-- 
Lewis

Re: recrawl a URL?

Reply via email to