RE: recrawl a URL?

Max Dzyuba Thu, 30 Aug 2012 03:07:23 -0700

Does anybody know the answer to my question below? Let me know if the
question is not clear.


Is it possible to use the same crawldb but store segment data in a different
directory for consecutive crawls using the "bin/nutch crawl" command? I
thought that there is no option to specify the path to crawldb or linkdb,
but only the path to a directory where to save all crawl data (including
crawldb, linkdb and segments) into. I'm using Nutch 1.5. If it's possible,
how would the crawl command look like?


Thanks in advance!
Max


-----Original Message-----
From: Lewis John Mcgibbney [mailto:[email protected]] 
Sent: den 27 augusti 2012 15:03
To: [email protected]
Subject: Re: recrawl a URL?

The crawldb needs to receive updates of data in fetched segments, once you
generate it will calculate what needs to be fetched in next iteration. It is
OK to store segments in different locations but typicaly you would want to
maintain one crawldb for all of your records... unless of course you have an
alternative usage scenario.

Lewis

On Mon, Aug 27, 2012 at 1:51 PM, Max Dzyuba <[email protected]>
wrote:
> Do I understand correctly that if I use "bin/nutch crawl" command and
store the crawl data into a new directory every time, then there is no way
for Nutch to know whether the page has been changed since the last crawl?
>
> Thanks,
> Max
>
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: den 24 augusti 2012 21:26
> To: [email protected]; [email protected]
> Subject: RE: recrawl a URL?
>
> No, the CrawlDatum's status field will be set to db_notmodified if the
signatures match regardless of the HTTP headers. The header only sets a
fetch_notmodified but it is not relevant for the db_* status.
>
>
>
> -----Original message-----
>> From:[email protected] <[email protected]>
>> Sent: Fri 24-Aug-2012 20:14
>> To: [email protected]; [email protected]
>> Subject: Re: recrawl a URL?
>>
>> This will work only for urls that has If-Modified-Since headers. But most
urls does not have this header.
>>
>> Thanks.
>> Alex.
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Max Dzyuba <[email protected]>
>> To: Markus Jelsma <[email protected]>; user 
>> <[email protected]>
>> Sent: Fri, Aug 24, 2012 9:02 am
>> Subject: RE: recrawl a URL?
>>
>>
>> Thanks again! I'll have to test it more then in my 1.5.1.
>>
>>
>>
>>
>> Best regards,
>> MaxMarkus Jelsma <[email protected]> wrote:Hmm, i had to 
>> look it up but it is supported in 1.5 and 1.5.1:
>>
>> http://svn.apache.org/viewvc/nutch/tags/release-1.5.1/src/java/org/ap
>> a che/nutch/indexer/IndexerMapReduce.java?view=markup
>>
>>
>> -----Original message-----
>> > From:Max Dzyuba <[email protected]>
>> > Sent: Fri 24-Aug-2012 17:35
>> > To: Markus Jelsma <[email protected]>; 
>> > [email protected]
>> > Subject: RE: recrawl a URL?
>> >
>> > Thank you for the reply. Does it mean that it is not supported in 
>> > latest
>> stable release of Nutch?
>> >
>> >
>> > -----Original Message-----
>> > From: Markus Jelsma [mailto:[email protected]]
>> > Sent: den 24 augusti 2012 17:21
>> > To: [email protected]; Max Dzyuba
>> > Subject: RE: recrawl a URL?
>> >
>> > Hi,
>> >
>> > Trunk has a feature for this: indexer.skip.notmodified
>> >
>> > Cheers
>> >
>> > -----Original message-----
>> > > From:Max Dzyuba <[email protected]>
>> > > Sent: Fri 24-Aug-2012 17:19
>> > > To: [email protected]
>> > > Subject: recrawl a URL?
>> > >
>> > > Hello everyone,
>> > >
>> > >
>> > >
>> > > I run a crawl command every day, but I don't want Nutch to submit 
>> > > an update to Solr if a particular page hasn't changed. How do I 
>> > > achieve that? Right now the value of db.fetch.interval.default 
>> > > doesn't seem to help prevent the crawl since the updates are 
>> > > submitted to Solr as if the page has been changed. I know for 
>> > > sure that the page has not been changed. This happens for every new
crawl command.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > Thanks in advance,
>> > >
>> > > Max
>> > >
>> > >
>> >
>> >
>>
>>
>>
>>
>



--
Lewis

RE: recrawl a URL?

Reply via email to