Hello Eddie; I think the way to do this is to delete that url (page) from the current crawldb the recrawl again,, but I see that there is still no way to delete an url directly form the crawldb ( I think prunedbtool is not implemented yet ! ) If you don't find something to delet the index url directly ;; you can try to use pruneindextool to delete it from the segments and index ( http://wiki.apache.org/nutch/bin/nutch_prune ) then delete the crawldb folder ,, and use updatedb to generate the new crawldb from the segments ,,, I don't know if it will work this way, but you can give it a try :)
also take a look here : http://netlikon.de/docs/javadoc-nutch/trunk/org/apache/nutch/tools/FreeGenerator.html Regards; Ahmad ________________________________ From: Eddie Drapkin <[email protected]> To: [email protected] Sent: Thu, July 15, 2010 9:06:55 PM Subject: Force recrawl of exactly one URL? Hello, I'm using Nutch to crawl a mailing list index that we have here internally. I'd like to be able to force Nutch to recrawl just the index page - so it can find the mailing list posts that are new since the last crawl. Will re-injecting the URL into the crawldb accomplish this or is there some other way to do it? I'd like to set the max recrawl age high enough that the pages would, theoretically, never get re-crawled (there's no point, because it's an email archive that's never going to change) but I can't do that until I'm sure that I can force a recrawl on this one specific page. Thanks! Thanks, Eddie

