I am using 1.6.  I will open an issue.  Thanks

-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]] 
Sent: Saturday, April 26, 2014 5:24 PM
To: [email protected]
Subject: Re: SolrClean for db_redir_temp?

Hi Iain,

which Nutch version is used?

> However, some of these sites will return a 302 or 304 redirect to a 
> custom "not found" page that does not match the patterns permitted by 
> the regex-urlfilter.  I then need to remove these documents from Solr.  
> What is the best way to do this?

In 1.8 the index job with option -deleteGone should delete also redirects from 
the index.

> 2.       Modify SolrClean to also delete documents with a status of
> db_redir_temp?  This would seem to be of general utility and worth 
> adding as an option to Nutch.

Would be a good idea to make clean behave same as index -deleteGone.
Feel free to open an issue for that.


Thanks,
Sebastian


On 04/26/2014 02:15 PM, Iain Lopata wrote:
> I am performing a crawl of a subset of pages from about 200 sites.  
> The subset of pages that are of interest are controlled by 
> regex-urlfilter and a custom filter.  The pages are indexed into Solr.
> 
>  
> 
> When a page is removed from some of these sites a re-crawl will return 
> a 404 and then SolrClean or SolrIndex with -deleteGone will remove the 
> document from Solr.  Perfect.  However, some of these sites will 
> return a 302 or 304 redirect to a custom "not found" page that does 
> not match the patterns permitted by the regex-urlfilter.  I then need 
> to remove these documents from Solr.  What is the best way to do this?
> 
>  
> 
> 1.       Use some option or configuration parameter that I have not yet
> discovered?
> 
> 2.       Modify SolrClean to also delete documents with a status of
> db_redir_temp?  This would seem to be of general utility and worth 
> adding as an option to Nutch.
> 
> 3.       Modify my custom filter to force a db_gone status on a document if
> it is outside the bounds that are of interest (yuck! - I hope this 
> isn't the
> answer)
> 
> 4.       Others?
> 
>  
> 
> Thanks,
> 
>  
> 
> Iain
> 
> 


Reply via email to