Re: OOM only on one datacenter nodes

Surbhi Gupta Sun, 05 Apr 2020 21:44:25 -0700

We are using JRE and not JDK , hence not able to take heap dump .

On Sun, 5 Apr 2020 at 19:21, Jeff Jirsa <jji...@gmail.com> wrote:


>
> Set the jvm flags to heap dump on oom
>
> Open up the result in a heap inspector of your preference (like yourkit or
> similar)
>
> Find a view that counts objects by total retained size. Take a screenshot.
> Send that.
>
>
>
> On Apr 5, 2020, at 6:51 PM, Surbhi Gupta <surbhi.gupt...@gmail.com> wrote:
>
> 
> I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.
>
> I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
> What specific parameter I should check on OS ?
> We are using CentOS release 6.10.
>
> Currently disk_access_modeis not set hence it is auto in our env. Should
> setting disk_access_mode  to mmap_index_only  will help ?
>
> Thanks
> Surbhi
>
> On Sun, 5 Apr 2020 at 01:31, Alex Ott <alex...@gmail.com> wrote:
>
>> Have you set -Xmx32g ? In this case you may get significantly less
>> available memory because of switch to 64-bit references.  See
>> http://java-performance.info/over-32g-heap-java/ for details, and set
>> slightly less than 32Gb
>>
>> Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +0000" wrote:
>>  RP> Surbi:
>>
>>  RP> If you aren’t seeing connection activity in DC2, I’d check to see if
>> the operations hitting DC1 are quorum ops instead of local quorum.  That
>>  RP> still wouldn’t explain DC2 nodes going down, but would at least
>> explain them doing more work than might be on your radar right now.
>>
>>  RP> The hint replay being slow to me sounds like you could be fighting
>> GC.
>>
>>  RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already
>> been doing this, but if not, be sure to be under 32gb, like 31gb.
>>  RP> Otherwise you’re using larger object pointers and could actually
>> have less effective ability to allocate memory.
>>
>>  RP> As the problem is only happening in DC2, then there has to be a
>> thing that is true in DC2 that isn’t true in DC1.  A difference in
>> hardware, a
>>  RP> difference in O/S version, a difference in networking config or
>> physical infrastructure, a difference in client-triggered activity, or a
>>  RP> difference in how repairs are handled. Somewhere, there is a
>> difference.  I’d start with focusing on that.
>>
>>  RP> From: Erick Ramirez <erick.rami...@datastax.com>
>>  RP> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>  RP> Date: Saturday, April 4, 2020 at 8:28 PM
>>  RP> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>  RP> Subject: Re: OOM only on one datacenter nodes
>>
>>  RP> Message from External Sender
>>
>>  RP> With a lack of heapdump for you to analyse, my hypothesis is that
>> your DC2 nodes are taking on traffic (from some client somewhere) but you're
>>  RP> just not aware of it. The hints replay is just a side-effect of the
>> nodes getting overloaded.
>>
>>  RP> To rule out my hypothesis in the first instance, my recommendation
>> is to monitor the incoming connections to the nodes in DC2. If you don't
>>  RP> have monitoring in place, you could simply run netstat at regular
>> intervals and go from there. Cheers!
>>
>>  RP> GOT QUESTIONS? Apache Cassandra experts from the community and
>> DataStax have answers! Share your expertise on
>> https://community.datastax.com/.
>>
>>
>>
>> --
>> With best wishes,                    Alex Ott
>> Principal Architect, DataStax
>> http://datastax.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>

Re: OOM only on one datacenter nodes

Reply via email to