Hi Iain,

which Nutch version is used?

> However, some of these sites will return a 302 or 304
> redirect to a custom "not found" page that does not match the patterns
> permitted by the regex-urlfilter.  I then need to remove these documents
> from Solr.  What is the best way to do this?

In 1.8 the index job with option -deleteGone
should delete also redirects from the index.

> 2.       Modify SolrClean to also delete documents with a status of
> db_redir_temp?  This would seem to be of general utility and worth adding as
> an option to Nutch.

Would be a good idea to make clean behave same as index -deleteGone.
Feel free to open an issue for that.


Thanks,
Sebastian


On 04/26/2014 02:15 PM, Iain Lopata wrote:
> I am performing a crawl of a subset of pages from about 200 sites.  The
> subset of pages that are of interest are controlled by regex-urlfilter and a
> custom filter.  The pages are indexed into Solr.
> 
>  
> 
> When a page is removed from some of these sites a re-crawl will return a 404
> and then SolrClean or SolrIndex with -deleteGone will remove the document
> from Solr.  Perfect.  However, some of these sites will return a 302 or 304
> redirect to a custom "not found" page that does not match the patterns
> permitted by the regex-urlfilter.  I then need to remove these documents
> from Solr.  What is the best way to do this?
> 
>  
> 
> 1.       Use some option or configuration parameter that I have not yet
> discovered?
> 
> 2.       Modify SolrClean to also delete documents with a status of
> db_redir_temp?  This would seem to be of general utility and worth adding as
> an option to Nutch.
> 
> 3.       Modify my custom filter to force a db_gone status on a document if
> it is outside the bounds that are of interest (yuck! - I hope this isn't the
> answer)
> 
> 4.       Others?
> 
>  
> 
> Thanks,
> 
>  
> 
> Iain
> 
> 

Reply via email to