I am having an issue with removing deleted file: urls on subsequent crawls. It stays with a status of db_unfetched and doesn't seem to want to use the 404 (db_gone) status. This means that I can't run solrclean to get rid of the old file: urls.
I poked around in the protocol-file code and made some changes ot the ProtocolOutput class to force a 404 if a file url has been deleted. It didn't seem to make a difference when it was fetched however. Any ideas how to get rid of deleted file: urls? -- View this message in context: http://lucene.472066.n3.nabble.com/Deleting-file-urls-from-crawldb-that-give-404-status-tp3990391.html Sent from the Nutch - User mailing list archive at Nabble.com.