Re: OOM spreads to other replica's/HA when OOM

Toke Eskildsen Tue, 19 Dec 2017 07:39:24 -0800

On Mon, 2017-12-18 at 15:56 -0500, Susheel Kumar wrote:
> Technically I agree Shawn with you on fixing OOME cause, Infact it is
> not an issue any more but I was testing for HA when planing for any
> failures.
> Same time it's hard to convince Business folks that HA wouldn't be
> there in case of OOME.


Let's say we change Solr, so that it does not re-issue queries that
caused nodes to fail. Unfortunately that does not solve your problem as
the user will do what users do on an internal server error: Press
reload.

So for a mechanism to work it would require the Solr cloud to maintain
a blacklist of queries that causes nodes to fail. But if it is paging
related, the user might try pressing "next" instead and then the query
will be different from the previous one, but still cause OOM. So maybe
a mechanism for detecting multiple OOM-triggering queries from the same
user and then blacklisting the user? But what if the query is a link
shared on a forum? And so forth.

Hardening by blacklisting is a game that is hard to win. So to
paraphrase Shawn: Make sure your users cannot issue OOMing queries.

- Toke Eskildsen, Royal Danish Library - Aarhus

Re: OOM spreads to other replica's/HA when OOM

Reply via email to