Re: Force recrawl of exactly one URL?

Ahmad Al-Amri Thu, 15 Jul 2010 16:42:38 -0700

Hello Eddie;

I think the way to do this is to delete that url (page) from the current 
crawldb 
the recrawl again,,
but I see that there is still no way to delete an url directly form the crawldb 
( I think prunedbtool is not implemented yet ! )
If you don't find something to delet the index url directly ;; you can try to 
use pruneindextool to delete it from the segments and index ( 
http://wiki.apache.org/nutch/bin/nutch_prune )
then delete the crawldb folder ,, and use updatedb to generate the new crawldb 
from the segments ,,, I don't know if it will work this way, but you can give 
it 
a try :)



also take a look here :  
http://netlikon.de/docs/javadoc-nutch/trunk/org/apache/nutch/tools/FreeGenerator.html
 



Regards;
Ahmad




________________________________
From: Eddie Drapkin <[email protected]>
To: [email protected]
Sent: Thu, July 15, 2010 9:06:55 PM
Subject: Force recrawl of exactly one URL?

Hello,

I'm using Nutch to crawl a mailing list index that we have here
internally. I'd like to be able to force Nutch to recrawl just the
index page - so it can find the mailing list posts that are new since
the last crawl. Will re-injecting the URL into the crawldb accomplish
this or is there some other way to do it? I'd like to set the max
recrawl age high enough that the pages would, theoretically, never get
re-crawled (there's no point, because it's an email archive that's never
going to change) but I can't do that until I'm sure that I can force a
recrawl on this one specific page. Thanks!

Thanks,
Eddie

Re: Force recrawl of exactly one URL?

Reply via email to