Re: OOM spreads to other replica's/HA when OOM

Emir Arnautović Tue, 19 Dec 2017 03:14:14 -0800

Hi Susheel,
If a single query can cause node to fail and if retry cause replicas to be 
affected (still to be confirmed) then preventing retry logic on Solr side can 
only partially solve that issue - retry logic can exist on client side and it 
will result in replicas’ OOM. Again, not sure if Solr retries (Solrj does and 
would expect the same code base is used within Solr as well) and on what 
conditions, but maybe using shorter timeAllowed would help you in some cases. 
Also maybe using preferLocalShards would result in aggregating node to OOM, but 
that could result in client retry.


I agree with Shown that only true solution is to protect Solr from OOM - e.g. 
control max start and rows.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 Dec 2017, at 21:56, Susheel Kumar <susheel2...@gmail.com> wrote:
> 
> Technically I agree Shawn with you on fixing OOME cause, Infact it is not
> an issue any more but I was testing for HA when planing for any failures.
> Same time it's hard to convince Business folks that HA wouldn't be there in
> case of OOME.
> 
> I think the best option is to enable timeAllowed for now.
> 
> Thanks,
> Susheel
> 
> On Mon, Dec 18, 2017 at 11:37 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> 
>> On 12/18/2017 9:01 AM, Susheel Kumar wrote:
>>> Any thoughts on how one can provide HA in these situations.
>> 
>> As I have said already a couple of times today on other threads, there
>> are *exactly* two ways to deal with OOME.  No other solution is possible.
>> 
>> 1) Configure the system to allow the process to access more of the
>> resource that it's running out of.  This is typically the solution that
>> people will utilize.  In your case, you would need to make the heap larger.
>> 
>> 2) Change the configuration or the environment so fewer resources are
>> required.
>> 
>> OOME is special.  It is a problem that all the high availability steps
>> in the world cannot protect you from, for precisely the reasons that
>> Emir and I have described.  You must ensure that Solr is set up so there
>> are enough resources that OOME cannot occur.
>> 
>> I can see a general argument for making it possible to configure or
>> disable any retry mechanism in SolrCloud, but that is not the solution
>> here.  It would most likely only *delay* the problem to a later query.
>> The OOME itself must be fixed, using one of the two solutions already
>> outlined.
>> 
>> Thanks,
>> Shawn
>> 
>>

Re: OOM spreads to other replica's/HA when OOM

Reply via email to