I have Nutch set-up to crawl my local filesystem and have it linked to Solr.  

Everything works fine except when I recrawl using the (./nutch crawl
command) and have deleted a document that was previously indexed it doesnt
seem to register it as status DB_GONE.  Post recrawl I run "./nutch readdb
<crawldb> -stats" command and the deleted ones are marked as unfetched.

The wierd thing is, if I add 404 purging to my nutch-site.xml file, it
deletes the links that have been deleted, so it seems that during the crawl
it may be marked as DB_GONE but at the end of the crawl it is not.

If you need to know any of my configuration settings then you can check out
my posts on my blog, which are in the form of set-up guides:

http://amac4.blogspot.co.uk/

Thanks
Allan



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-Dead-urls-not-marked-as-DB-GONE-tp4085450.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to