Hi David, > So even If there is any modification made on a fetched > page before this interval and the crawl job is run, it will still not be > re-fetched/updated unless this interval is crossed. Yes. That's correct.
> is there any way to do immediate update? Yes, provided that you know which documents have been changed, of course. Have a look at o.a.n.tools.FreeGenerator (Nutch 1.x). Start a segment for a list of URLs, fetch and parse it, and update CrawlDb. Sebastian On 02/01/2013 10:32 AM, David Philip wrote: > Hi Team, > > I have a question on nutch re-crawl. Please let me know if there are any > links explaining this. > Q is, Does nutch re-crawl the sites/pages by checking the last date > modified? > > I understand from this > <http://nutch.wordpress.com/category/recrawling/>website that at the > time of fresh crawl, the > *db.default.fetch.interval* parameter will be set to each URL. This > property will decide whether or not the page should be re-fetched at > consequent crawls. So even If there is any modification made on a fetched > page before this interval and the crawl job is run, it will still not be > re-fetched/updated unless this interval is crossed. If above is the case, > is there any way to do immediate update? > > * > * > *Example case: * > Say We crawled a blog-site that had x blogs , the fetch interval was set to > 7 days. > But the blog-user added new blog and also modified an existing blog within > 2days. So in such case when we want to immediately update our crawl data > and include newly added changes[not waiting for 7days], what need to be > done? [nutch merge concept? ] > > > > > Awaiting reply, > Thanks -David >

