I am using 1.6. I will open an issue. Thanks -----Original Message----- From: Sebastian Nagel [mailto:[email protected]] Sent: Saturday, April 26, 2014 5:24 PM To: [email protected] Subject: Re: SolrClean for db_redir_temp?
Hi Iain, which Nutch version is used? > However, some of these sites will return a 302 or 304 redirect to a > custom "not found" page that does not match the patterns permitted by > the regex-urlfilter. I then need to remove these documents from Solr. > What is the best way to do this? In 1.8 the index job with option -deleteGone should delete also redirects from the index. > 2. Modify SolrClean to also delete documents with a status of > db_redir_temp? This would seem to be of general utility and worth > adding as an option to Nutch. Would be a good idea to make clean behave same as index -deleteGone. Feel free to open an issue for that. Thanks, Sebastian On 04/26/2014 02:15 PM, Iain Lopata wrote: > I am performing a crawl of a subset of pages from about 200 sites. > The subset of pages that are of interest are controlled by > regex-urlfilter and a custom filter. The pages are indexed into Solr. > > > > When a page is removed from some of these sites a re-crawl will return > a 404 and then SolrClean or SolrIndex with -deleteGone will remove the > document from Solr. Perfect. However, some of these sites will > return a 302 or 304 redirect to a custom "not found" page that does > not match the patterns permitted by the regex-urlfilter. I then need > to remove these documents from Solr. What is the best way to do this? > > > > 1. Use some option or configuration parameter that I have not yet > discovered? > > 2. Modify SolrClean to also delete documents with a status of > db_redir_temp? This would seem to be of general utility and worth > adding as an option to Nutch. > > 3. Modify my custom filter to force a db_gone status on a document if > it is outside the bounds that are of interest (yuck! - I hope this > isn't the > answer) > > 4. Others? > > > > Thanks, > > > > Iain > >

