What is the nature of webpages that are changing? twitter feeds, news streams?
Do you have any indication of how frequently they are changing? If you think it is very frequent then I would suggest setting Nutch up as a cron job. If you are indexing to Solr, using dedup and clean coomands as well as setting specific properties in nutch-site would allow you to maintain a pretty healthy representation of the web graph this way. Lewis ________________________________________ From: Bupo Jung [[email protected]] Sent: 24 May 2011 16:31 To: [email protected] Subject: How to re-fetch all the modified page? Hi, In my case, the webpage content may modified frequently,and I want to re-fetch the modified pages as soon as possible. I have read the nutch wiki about Intranet recrawl. It only consider the db.fetch.interval property to decide weather to re-fetch the page. How can I do? Any idea? thanks. bupo.jung -- Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

