Hi Sebastian, Thank you for the reply. Executed the following steps, please correct me if I am wrong. I do not see the changes updated. Run:
- org.apache.nutch.tools.FreeGenerator *arguments* :urls2 crawl/segments [urls2/seed.txt - url of the page that was modified] - org.apache.nutch.fetcher.Fetcher *arguments* :crawl/segments/* - org.apache.nutch.parse.ParseSegment *arguments*: crawl/segments/* - org.apache.nutch.crawl.CrawlDb* arguments : crawlDBOld/crawldb crawl/segments/* After this I ran crawl command :* - *org.apache.nutch.crawl.Crawl arguments: urls -dir crawlDB_Old -depth 10 -solr http://localhost:8080/solrnutch* But, I don't see the changes modified for that url in the solr indexes. Can you please tell me which step was skipped or not executed properly? I wanted to see the changes reflected in solr indexes for that document. Thanks - David On Sat, Feb 2, 2013 at 5:27 AM, Sebastian Nagel <[email protected]>wrote: > Hi David, > > > So even If there is any modification made on a fetched > > page before this interval and the crawl job is run, it will still not be > > re-fetched/updated unless this interval is crossed. > Yes. That's correct. > > > is there any way to do immediate update? > Yes, provided that you know which documents have been changed, of course. > Have a look at o.a.n.tools.FreeGenerator (Nutch 1.x). Start a segment > for a list of URLs, fetch and parse it, and update CrawlDb. > > Sebastian > > > On 02/01/2013 10:32 AM, David Philip wrote: > > Hi Team, > > > > I have a question on nutch re-crawl. Please let me know if there are > any > > links explaining this. > > Q is, Does nutch re-crawl the sites/pages by checking the last date > > modified? > > > > I understand from this > > <http://nutch.wordpress.com/category/recrawling/>website that at the > > time of fresh crawl, the > > *db.default.fetch.interval* parameter will be set to each URL. This > > property will decide whether or not the page should be re-fetched at > > consequent crawls. So even If there is any modification made on a fetched > > page before this interval and the crawl job is run, it will still not be > > re-fetched/updated unless this interval is crossed. If above is the case, > > is there any way to do immediate update? > > > > * > > * > > *Example case: * > > Say We crawled a blog-site that had x blogs , the fetch interval was set > to > > 7 days. > > But the blog-user added new blog and also modified an existing blog > within > > 2days. So in such case when we want to immediately update our crawl data > > and include newly added changes[not waiting for 7days], what need to be > > done? [nutch merge concept? ] > > > > > > > > > > Awaiting reply, > > Thanks -David > > > >

