hi there, I am trying to figure out what the best method is to recrawl certain sites. I am crawling news-sites and they update their frontpage quite often, so I need o crawl their frontpage/index.php etc. often and have Nutch fetch the new links + content.
I cannot find an answer to my question in the mailing archive. I also checked other websites one of them is quite good in explaining http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ This is discussed on : How to recrawle nutch<http://stackoverflow.com/questions/13873694/how-to-recrawle-nutch> I followed the instructions found on that website, and changed the db.interval to start with a re-fetch after 24 hours. Hoping that the mechanism of Nutch will work and assign websites that have static pages a higher fetch number, so that eventually it will be fetched once every 90 days. And those pages that update often, will be fetched multiple times per day. At the moment I am testing if this works. I wonder though, once this set-up works and I start to crawl more news sites all of the pages will be fetched every 24 hours, and according to the rules it will be fetched later or earlier from that moment on. This is not desirable as this means that ALL urls will be fetched daily. Is there some way that I can assign priority to certain pages. For example, give all index.php/htm/html pages a fetch interval of 60 minutes, and all other pages 3 days? I use Nutch 2.1 with Mysql backend on Ubuntu 12.04 Thanks in advance, J

