What is the nature of webpages that are changing? twitter feeds, news streams?

Do you have any indication of how frequently they are changing? If you think it 
is very frequent then I would suggest setting Nutch up as a cron job. If you 
are indexing to Solr, using dedup and clean coomands as well as setting 
specific properties in nutch-site would allow you to maintain a pretty healthy 
representation of the web graph this way.

Lewis

________________________________________
From: Bupo Jung [[email protected]]
Sent: 24 May 2011 16:31
To: [email protected]
Subject: How to re-fetch all the modified page?

Hi,
In my case, the webpage content may modified frequently,and I want to
re-fetch the modified pages as soon as possible.
I have read the nutch wiki about Intranet recrawl. It only consider the
db.fetch.interval property to decide weather to re-fetch the page.
How can I do?
Any idea?

thanks.
bupo.jung
--

Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Reply via email to