On Wed, Nov 21, 2012 at 2:35 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Joe,
>
> On Wed, Nov 21, 2012 at 9:25 PM, Joe Zhang <[email protected]> wrote:
>
> > Are you saying that as long as I crawl some page once, nutch will go and
> > refetch the page in 30 days by default, without me running the command
> > again?
>
> No this is impossible (unless you have an automated job running).
> Unless you invoke the command(s) Nutch will do nothing.
> What I was explaining is that Nutch by default (assuming you are using
> the default fetching schedule) sets a next fetch time to 30 days,
> although I think this is configurable within nutch-site.xml
>
>
Thanks for the clarification.


> >>
> > So refetching will always happen, even if the same URL has already
> existed
> > in Solr index (and used as an ID field)?
> >
>
> Nutch and Solr are different systems and Solr doesn't know very much
> about Nutch at all. I think you you would really benefit from trying
> to learn a bit about the crawldb and its constituent parts... and
> consequently what role these parts/features play within maintaining a
> crawl cycle.
> Refetching of an URL will occur when the specified next fetch time has
> expired and is in the past regardless of what is within your Solr
> index.
>
>
>
Yes, I'm doing it.

So, when Nutch recrawls a URL (based on whatever specifided schedule) and
resend the the fetched info to Solr, Solr will update the entry in its
index corresponding to the same URL with the new timestamp and the new page
content, correct? Just to confirm.

Thanks again!

Reply via email to