hi there,

I am trying to figure out what the best method is to recrawl certain sites.
I am crawling news-sites and they update their frontpage quite often, so I
need o crawl their frontpage/index.php etc. often and have Nutch fetch the
new links + content.

I cannot find an answer to my question in the mailing archive. I also
checked other websites one of them is quite good in explaining

 http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

This is discussed on : How to recrawle
nutch<http://stackoverflow.com/questions/13873694/how-to-recrawle-nutch>

I followed the instructions found on that website, and changed the
db.interval to start with a re-fetch after 24 hours. Hoping that the
mechanism of Nutch will work and assign websites that have static pages a
higher fetch number, so that eventually it will be fetched once every 90
days. And those pages that update often, will be fetched multiple times per
day.

At the moment I am testing if this works.

I wonder though, once this set-up works and I start to crawl more news
sites all of the pages will be fetched every 24 hours, and according to the
rules it will be fetched later or earlier from that moment on. This is not
desirable as this means that ALL urls will be fetched daily. Is there some
way that I can assign priority to certain pages. For example, give all
index.php/htm/html pages a fetch interval of 60 minutes, and all other
pages 3 days?

I use Nutch 2.1 with Mysql backend on Ubuntu 12.04

Thanks in advance,

J

Reply via email to