Ah, this is very ugly indeed, soft-404's. You cannot get rid of them easily. 
But just index with -deleteGone and they should be removed from Solr. You shoud 
use db.update.purge.404 only when in maintenance, not before indexing.
 
-----Original message-----
> From:Arthur Yarwood <[email protected]>
> Sent: Saturday 5th March 2016 23:33
> To: [email protected]
> Subject: Best tactic: Sites reporting a redirect instead of 404 gone.
> 
> I've noticed a number of sites I'm crawling and indexing, which happen 
> to have fairly transient content I wish to index (lifespan of ~few 
> weeks), are reporting a 301 permanent redirect, rather than a 404. The 
> redirect just goes to a generic content no longer here page to be more 
> helpful to normal web users. Not ideal at all, and not within my control 
> at all.
> 
> What tactics and strategies can help mitigate this scenario?
> In particular:
> 1) Removing these URL's from crawl DB (as they would if 404's and 
> db.update.purge.404 = true).
> 2) Removing these from my Solr DB I'm indexing into.
> 
> I'm leaning towards the idea of writing an additional maintenance script 
> that manually queries the crawldb for db_redir_perm status on urls from 
> given hosts and manually removing these from Solr. I just fear it maybe 
> over zealous in removing content from the index, in cases of a 
> legitimate redirect...
> 
> Thanks!
> 
> -- 
> Arthur Yarwood
> 
> 

Reply via email to