Re: Nutch Incremental Crawl

Sebastian Nagel Fri, 01 Feb 2013 15:57:34 -0800

Hi David,

> So even If there is any modification made on a fetched
> page before this interval and the crawl job is run, it will still not be
> re-fetched/updated unless this interval is crossed.
Yes. That's correct.


> is there any way to do immediate update?
Yes, provided that you know which documents have been changed, of course.
Have a look at o.a.n.tools.FreeGenerator (Nutch 1.x). Start a segment
for a list of URLs, fetch and parse it, and update CrawlDb.

Sebastian


On 02/01/2013 10:32 AM, David Philip wrote:
> Hi Team,
> 
>    I have a question on nutch re-crawl. Please let me know if there are any
> links explaining this.
> Q is, Does nutch re-crawl the sites/pages by checking the last date
> modified?
> 
> I understand from this
> <http://nutch.wordpress.com/category/recrawling/>website that at the
> time of fresh crawl, the
> *db.default.fetch.interval* parameter will be set to each URL. This
> property will decide whether or not the page should be re-fetched at
> consequent crawls. So even If there is any modification made on a fetched
> page before this interval and the crawl job is run, it will still not be
> re-fetched/updated unless this interval is crossed. If above is the case,
> is there any way to do immediate update?
> 
> *
> *
> *Example case: *
> Say We crawled a blog-site that had x blogs , the fetch interval was set to
> 7 days.
> But the blog-user added new blog and also modified an existing blog within
> 2days. So in such case when we want to immediately update our crawl data
> and  include newly added changes[not waiting for 7days], what need to be
> done?  [nutch merge concept? ]
> 
> 
> 
> 
> Awaiting reply,
> Thanks -David
>

Re: Nutch Incremental Crawl

Reply via email to