I am having an issue with removing deleted file: urls on subsequent crawls. 
It stays with a status of db_unfetched and doesn't seem to want to use the
404 (db_gone) status.  This means that I can't run solrclean to get rid of
the old file: urls.  

I poked around in the protocol-file code and made some changes ot the
ProtocolOutput class to force a 404 if a file url has been deleted.  It
didn't seem to make a difference when it was fetched however.

Any ideas how to get rid of deleted file: urls?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Deleting-file-urls-from-crawldb-that-give-404-status-tp3990391.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to