Recrawl and crawl-urlfilter.txt

Joshua J Pavel Fri, 12 Mar 2010 12:10:00 -0800


I'm having multiple problems recrawling with nutch 0.9.  Here are 2
questions.  :-)


Right now, using the script I find here (
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
), I think I'm close to a workable solution, but the recrawl doesn't
respect the crawl-urlfilter.txt.  Is there a way to specify this
configuration for the recrawl?

Our final implementation will be a single-sited crawl with
close-to-realtime search results (ideally, we'll crawl about every 30
minutes or 1 hour).  In that regard, is there any way to have nutch respect
cache value response codes (304 Not Modified) instead of the fetcher time
in the configuration file?

Thanks!
-Josh Pavel

Recrawl and crawl-urlfilter.txt

Reply via email to