I think Markus explained the best part of the solution in a related thread the other day.
It may require a combination of configuration options but setting logical web db properties in nutch-site is essential (amongst others). ________________________________________ From: Bupo Jung [[email protected]] Sent: 25 May 2011 02:50 To: [email protected] Subject: Re: How to re-fetch all the modified page? thanks for your reply. the website I try to fetch is a social bbs which can produce tens of thousands of new threads and replies. when a new reply is added to a thread the page may changed. In my case I need to re-fetch the changed page and new page in every hours(each hour if possible). 2011/5/24 McGibbney, Lewis John <[email protected]> > What is the nature of webpages that are changing? twitter feeds, news > streams? > > Do you have any indication of how frequently they are changing? If you > think it is very frequent then I would suggest setting Nutch up as a cron > job. If you are indexing to Solr, using dedup and clean coomands as well as > setting specific properties in nutch-site would allow you to maintain a > pretty healthy representation of the web graph this way. > > Lewis > > ________________________________________ > From: Bupo Jung [[email protected]] > Sent: 24 May 2011 16:31 > To: [email protected] > Subject: How to re-fetch all the modified page? > > Hi, > In my case, the webpage content may modified frequently,and I want to > re-fetch the modified pages as soon as possible. > I have read the nutch wiki about Intranet recrawl. It only consider the > db.fetch.interval property to decide weather to re-fetch the page. > How can I do? > Any idea? > > thanks. > bupo.jung > -- > > Email has been scanned for viruses by Altman Technologies' email management > service - www.altman.co.uk/emailsystems > > Glasgow Caledonian University is a registered Scottish charity, number > SC021474 > > Winner: Times Higher Education’s Widening Participation Initiative of the > Year 2009 and Herald Society’s Education Initiative of the Year 2009. > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html > > Winner: Times Higher Education’s Outstanding Support for Early Career > Researchers of the Year 2010, GCU as a lead with Universities Scotland > partners. > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html > -- Yizhong Zhuang Beijing University of Posts and Telecommunications Email:[email protected] Myblog:www.mikkoo.info Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

