Hi Iain, which Nutch version is used?
> However, some of these sites will return a 302 or 304 > redirect to a custom "not found" page that does not match the patterns > permitted by the regex-urlfilter. I then need to remove these documents > from Solr. What is the best way to do this? In 1.8 the index job with option -deleteGone should delete also redirects from the index. > 2. Modify SolrClean to also delete documents with a status of > db_redir_temp? This would seem to be of general utility and worth adding as > an option to Nutch. Would be a good idea to make clean behave same as index -deleteGone. Feel free to open an issue for that. Thanks, Sebastian On 04/26/2014 02:15 PM, Iain Lopata wrote: > I am performing a crawl of a subset of pages from about 200 sites. The > subset of pages that are of interest are controlled by regex-urlfilter and a > custom filter. The pages are indexed into Solr. > > > > When a page is removed from some of these sites a re-crawl will return a 404 > and then SolrClean or SolrIndex with -deleteGone will remove the document > from Solr. Perfect. However, some of these sites will return a 302 or 304 > redirect to a custom "not found" page that does not match the patterns > permitted by the regex-urlfilter. I then need to remove these documents > from Solr. What is the best way to do this? > > > > 1. Use some option or configuration parameter that I have not yet > discovered? > > 2. Modify SolrClean to also delete documents with a status of > db_redir_temp? This would seem to be of general utility and worth adding as > an option to Nutch. > > 3. Modify my custom filter to force a db_gone status on a document if > it is outside the bounds that are of interest (yuck! - I hope this isn't the > answer) > > 4. Others? > > > > Thanks, > > > > Iain > >

