Ah, this is very ugly indeed, soft-404's. You cannot get rid of them easily.
But just index with -deleteGone and they should be removed from Solr. You shoud
use db.update.purge.404 only when in maintenance, not before indexing.
-----Original message-----
> From:Arthur Yarwood <[email protected]>
> Sent: Saturday 5th March 2016 23:33
> To: [email protected]
> Subject: Best tactic: Sites reporting a redirect instead of 404 gone.
>
> I've noticed a number of sites I'm crawling and indexing, which happen
> to have fairly transient content I wish to index (lifespan of ~few
> weeks), are reporting a 301 permanent redirect, rather than a 404. The
> redirect just goes to a generic content no longer here page to be more
> helpful to normal web users. Not ideal at all, and not within my control
> at all.
>
> What tactics and strategies can help mitigate this scenario?
> In particular:
> 1) Removing these URL's from crawl DB (as they would if 404's and
> db.update.purge.404 = true).
> 2) Removing these from my Solr DB I'm indexing into.
>
> I'm leaning towards the idea of writing an additional maintenance script
> that manually queries the crawldb for db_redir_perm status on urls from
> given hosts and manually removing these from Solr. I just fear it maybe
> over zealous in removing content from the index, in cases of a
> legitimate redirect...
>
> Thanks!
>
> --
> Arthur Yarwood
>
>