I am performing a crawl of a subset of pages from about 200 sites. The subset of pages that are of interest are controlled by regex-urlfilter and a custom filter. The pages are indexed into Solr.
When a page is removed from some of these sites a re-crawl will return a 404 and then SolrClean or SolrIndex with -deleteGone will remove the document from Solr. Perfect. However, some of these sites will return a 302 or 304 redirect to a custom "not found" page that does not match the patterns permitted by the regex-urlfilter. I then need to remove these documents from Solr. What is the best way to do this? 1. Use some option or configuration parameter that I have not yet discovered? 2. Modify SolrClean to also delete documents with a status of db_redir_temp? This would seem to be of general utility and worth adding as an option to Nutch. 3. Modify my custom filter to force a db_gone status on a document if it is outside the bounds that are of interest (yuck! - I hope this isn't the answer) 4. Others? Thanks, Iain

