On Mon, 2017-12-18 at 15:56 -0500, Susheel Kumar wrote: > Technically I agree Shawn with you on fixing OOME cause, Infact it is > not an issue any more but I was testing for HA when planing for any > failures. > Same time it's hard to convince Business folks that HA wouldn't be > there in case of OOME.
Let's say we change Solr, so that it does not re-issue queries that caused nodes to fail. Unfortunately that does not solve your problem as the user will do what users do on an internal server error: Press reload. So for a mechanism to work it would require the Solr cloud to maintain a blacklist of queries that causes nodes to fail. But if it is paging related, the user might try pressing "next" instead and then the query will be different from the previous one, but still cause OOM. So maybe a mechanism for detecting multiple OOM-triggering queries from the same user and then blacklisting the user? But what if the query is a link shared on a forum? And so forth. Hardening by blacklisting is a game that is hard to win. So to paraphrase Shawn: Make sure your users cannot issue OOMing queries. - Toke Eskildsen, Royal Danish Library - Aarhus