I am performing a crawl of a subset of pages from about 200 sites.  The
subset of pages that are of interest are controlled by regex-urlfilter and a
custom filter.  The pages are indexed into Solr.

 

When a page is removed from some of these sites a re-crawl will return a 404
and then SolrClean or SolrIndex with -deleteGone will remove the document
from Solr.  Perfect.  However, some of these sites will return a 302 or 304
redirect to a custom "not found" page that does not match the patterns
permitted by the regex-urlfilter.  I then need to remove these documents
from Solr.  What is the best way to do this?

 

1.       Use some option or configuration parameter that I have not yet
discovered?

2.       Modify SolrClean to also delete documents with a status of
db_redir_temp?  This would seem to be of general utility and worth adding as
an option to Nutch.

3.       Modify my custom filter to force a db_gone status on a document if
it is outside the bounds that are of interest (yuck! - I hope this isn't the
answer)

4.       Others?

 

Thanks,

 

Iain

Reply via email to