Re: OOM only on one datacenter nodes

2020-04-06 Thread Reid Pinchback
Centos 6.10 is a bit aged as a production server O/S platform, and I recall 
some odd-ball interactions with hardware variations, particularly around 
high-priority memory and network cards.  How good is your O/S level metric 
monitoring?  Not beyond the realm of possibility that your memory issues are 
outside of the JVM.  It isn’t easy to tell you what to specifically look for, 
but I would begin with metrics around memory and swap.  If you don’t see high 
consistent memory use outside of the JVM usage, saves wasting time chasing down 
details that are unlikely to matter.  You need to be used to seeing what those 
metrics are normally like though, so you aren’t chasing phantoms.

I second Jeff’s feedback.  You need the information you need.  It seems 
counterproductive to not configure these nodes to do what you need.  A 
fundamental value of C* is the ability to bring nodes up and down without 
risking availability.  When your existing technology approach is part of why 
you can’t gather the data you need, it helps to give yourself permission to 
improve what you have so you don’t remain in that situation.


From: Surbhi Gupta 
Date: Monday, April 6, 2020 at 12:44 AM
To: "user@cassandra.apache.org" 
Cc: Reid Pinchback 
Subject: Re: OOM only on one datacenter nodes

Message from External Sender
We are using JRE and not JDK , hence not able to take heap dump .

On Sun, 5 Apr 2020 at 19:21, Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:

Set the jvm flags to heap dump on oom

Open up the result in a heap inspector of your preference (like yourkit or 
similar)

Find a view that counts objects by total retained size. Take a screenshot. Send 
that.




On Apr 5, 2020, at 6:51 PM, Surbhi Gupta 
mailto:surbhi.gupt...@gmail.com>> wrote:
I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.

I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
What specific parameter I should check on OS ?
We are using CentOS release 6.10.

Currently disk_access_modeis not set hence it is auto in our env. Should 
setting disk_access_mode  to mmap_index_only  will help ?

Thanks
Surbhi

On Sun, 5 Apr 2020 at 01:31, Alex Ott 
mailto:alex...@gmail.com>> wrote:
Have you set -Xmx32g ? In this case you may get significantly less
available memory because of switch to 64-bit references.  See
http://java-performance.info/over-32g-heap-java/<https://urldefense.proofpoint.com/v2/url?u=http-3A__java-2Dperformance.info_over-2D32g-2Dheap-2Djava_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=J9_U9AIV95JbVJ-0c1OyjqGOmdLCltCRwMPnOsS7BCE=rB9HFbb7t-FJQZUGJNtN0wOPIGZj7Fn8cE271bR63HE=>
 for details, and set
slightly less than 32Gb

Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
 RP> Surbi:

 RP> If you aren’t seeing connection activity in DC2, I’d check to see if the 
operations hitting DC1 are quorum ops instead of local quorum.  That
 RP> still wouldn’t explain DC2 nodes going down, but would at least explain 
them doing more work than might be on your radar right now.

 RP> The hint replay being slow to me sounds like you could be fighting GC.

 RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already been 
doing this, but if not, be sure to be under 32gb, like 31gb.
 RP> Otherwise you’re using larger object pointers and could actually have less 
effective ability to allocate memory.

 RP> As the problem is only happening in DC2, then there has to be a thing that 
is true in DC2 that isn’t true in DC1.  A difference in hardware, a
 RP> difference in O/S version, a difference in networking config or physical 
infrastructure, a difference in client-triggered activity, or a
 RP> difference in how repairs are handled. Somewhere, there is a difference.  
I’d start with focusing on that.

 RP> From: Erick Ramirez 
mailto:erick.rami...@datastax.com>>
 RP> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
 RP> Date: Saturday, April 4, 2020 at 8:28 PM
 RP> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
 RP> Subject: Re: OOM only on one datacenter nodes

 RP> Message from External Sender

 RP> With a lack of heapdump for you to analyse, my hypothesis is that your DC2 
nodes are taking on traffic (from some client somewhere) but you're
 RP> just not aware of it. The hints replay is just a side-effect of the nodes 
getting overloaded.

 RP> To rule out my hypothesis in the first instance, my recommendation is to 
monitor the incoming connections to the nodes in DC2. If you don't
 RP> have monitoring in place, you could simply run netstat at regular 
intervals and go from there. Cheers!

 RP> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax 
have answers

Re: OOM only on one datacenter nodes

2020-04-06 Thread Jeff Jirsa
Nobody is going to do any better than guessing without a heap histogram

I’ve got pretty good intuition with cassandra in real prod environments and can 
think of like 8-9 different possible causes, but none of them really stand out 
as likely enough to describe in detail (mybe the memtable deadlock on 
flush, or maybe repair coordination in that dc), but really need a heap

If it’s causing you enough pain to email a list, seems like it’s worth making 
the changes you need to make to get a heap and debug properly.


> On Apr 5, 2020, at 9:44 PM, Surbhi Gupta  wrote:
> 
> 
> We are using JRE and not JDK , hence not able to take heap dump .
> 
>> On Sun, 5 Apr 2020 at 19:21, Jeff Jirsa  wrote:
>> 
>> Set the jvm flags to heap dump on oom
>> 
>> Open up the result in a heap inspector of your preference (like yourkit or 
>> similar)
>> 
>> Find a view that counts objects by total retained size. Take a screenshot. 
>> Send that. 
>> 
>> 
>> 
>>>> On Apr 5, 2020, at 6:51 PM, Surbhi Gupta  wrote:
>>>> 
>>> 
>>> I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.
>>> 
>>> I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
>>> What specific parameter I should check on OS ?
>>> We are using CentOS release 6.10.
>>> 
>>> Currently disk_access_modeis not set hence it is auto in our env. Should 
>>> setting disk_access_mode  to mmap_index_only  will help ?
>>> 
>>> Thanks
>>> Surbhi
>>> 
>>>> On Sun, 5 Apr 2020 at 01:31, Alex Ott  wrote:
>>>> Have you set -Xmx32g ? In this case you may get significantly less
>>>> available memory because of switch to 64-bit references.  See
>>>> http://java-performance.info/over-32g-heap-java/ for details, and set
>>>> slightly less than 32Gb
>>>> 
>>>> Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
>>>>  RP> Surbi:
>>>> 
>>>>  RP> If you aren’t seeing connection activity in DC2, I’d check to see if 
>>>> the operations hitting DC1 are quorum ops instead of local quorum.  That
>>>>  RP> still wouldn’t explain DC2 nodes going down, but would at least 
>>>> explain them doing more work than might be on your radar right now.
>>>> 
>>>>  RP> The hint replay being slow to me sounds like you could be fighting GC.
>>>> 
>>>>  RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already 
>>>> been doing this, but if not, be sure to be under 32gb, like 31gb. 
>>>>  RP> Otherwise you’re using larger object pointers and could actually have 
>>>> less effective ability to allocate memory.
>>>> 
>>>>  RP> As the problem is only happening in DC2, then there has to be a thing 
>>>> that is true in DC2 that isn’t true in DC1.  A difference in hardware, a
>>>>  RP> difference in O/S version, a difference in networking config or 
>>>> physical infrastructure, a difference in client-triggered activity, or a
>>>>  RP> difference in how repairs are handled. Somewhere, there is a 
>>>> difference.  I’d start with focusing on that.
>>>> 
>>>>  RP> From: Erick Ramirez 
>>>>  RP> Reply-To: "user@cassandra.apache.org" 
>>>>  RP> Date: Saturday, April 4, 2020 at 8:28 PM
>>>>  RP> To: "user@cassandra.apache.org" 
>>>>  RP> Subject: Re: OOM only on one datacenter nodes
>>>> 
>>>>  RP> Message from External Sender
>>>> 
>>>>  RP> With a lack of heapdump for you to analyse, my hypothesis is that 
>>>> your DC2 nodes are taking on traffic (from some client somewhere) but 
>>>> you're
>>>>  RP> just not aware of it. The hints replay is just a side-effect of the 
>>>> nodes getting overloaded.
>>>> 
>>>>  RP> To rule out my hypothesis in the first instance, my recommendation is 
>>>> to monitor the incoming connections to the nodes in DC2. If you don't
>>>>  RP> have monitoring in place, you could simply run netstat at regular 
>>>> intervals and go from there. Cheers!
>>>> 
>>>>  RP> GOT QUESTIONS? Apache Cassandra experts from the community and 
>>>> DataStax have answers! Share your expertise on 
>>>> https://community.datastax.com/.
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> With best wishes,Alex Ott
>>>> Principal Architect, DataStax
>>>> http://datastax.com/
>>>> 
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>> 


Re: OOM only on one datacenter nodes

2020-04-05 Thread Surbhi Gupta
We are using JRE and not JDK , hence not able to take heap dump .

On Sun, 5 Apr 2020 at 19:21, Jeff Jirsa  wrote:

>
> Set the jvm flags to heap dump on oom
>
> Open up the result in a heap inspector of your preference (like yourkit or
> similar)
>
> Find a view that counts objects by total retained size. Take a screenshot.
> Send that.
>
>
>
> On Apr 5, 2020, at 6:51 PM, Surbhi Gupta  wrote:
>
> 
> I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.
>
> I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
> What specific parameter I should check on OS ?
> We are using CentOS release 6.10.
>
> Currently disk_access_modeis not set hence it is auto in our env. Should
> setting disk_access_mode  to mmap_index_only  will help ?
>
> Thanks
> Surbhi
>
> On Sun, 5 Apr 2020 at 01:31, Alex Ott  wrote:
>
>> Have you set -Xmx32g ? In this case you may get significantly less
>> available memory because of switch to 64-bit references.  See
>> http://java-performance.info/over-32g-heap-java/ for details, and set
>> slightly less than 32Gb
>>
>> Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
>>  RP> Surbi:
>>
>>  RP> If you aren’t seeing connection activity in DC2, I’d check to see if
>> the operations hitting DC1 are quorum ops instead of local quorum.  That
>>  RP> still wouldn’t explain DC2 nodes going down, but would at least
>> explain them doing more work than might be on your radar right now.
>>
>>  RP> The hint replay being slow to me sounds like you could be fighting
>> GC.
>>
>>  RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already
>> been doing this, but if not, be sure to be under 32gb, like 31gb.
>>  RP> Otherwise you’re using larger object pointers and could actually
>> have less effective ability to allocate memory.
>>
>>  RP> As the problem is only happening in DC2, then there has to be a
>> thing that is true in DC2 that isn’t true in DC1.  A difference in
>> hardware, a
>>  RP> difference in O/S version, a difference in networking config or
>> physical infrastructure, a difference in client-triggered activity, or a
>>  RP> difference in how repairs are handled. Somewhere, there is a
>> difference.  I’d start with focusing on that.
>>
>>  RP> From: Erick Ramirez 
>>  RP> Reply-To: "user@cassandra.apache.org" 
>>  RP> Date: Saturday, April 4, 2020 at 8:28 PM
>>  RP> To: "user@cassandra.apache.org" 
>>  RP> Subject: Re: OOM only on one datacenter nodes
>>
>>  RP> Message from External Sender
>>
>>  RP> With a lack of heapdump for you to analyse, my hypothesis is that
>> your DC2 nodes are taking on traffic (from some client somewhere) but you're
>>  RP> just not aware of it. The hints replay is just a side-effect of the
>> nodes getting overloaded.
>>
>>  RP> To rule out my hypothesis in the first instance, my recommendation
>> is to monitor the incoming connections to the nodes in DC2. If you don't
>>  RP> have monitoring in place, you could simply run netstat at regular
>> intervals and go from there. Cheers!
>>
>>  RP> GOT QUESTIONS? Apache Cassandra experts from the community and
>> DataStax have answers! Share your expertise on
>> https://community.datastax.com/.
>>
>>
>>
>> --
>> With best wishes,Alex Ott
>> Principal Architect, DataStax
>> http://datastax.com/
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>


Re: OOM only on one datacenter nodes

2020-04-05 Thread Jeff Jirsa

Set the jvm flags to heap dump on oom

Open up the result in a heap inspector of your preference (like yourkit or 
similar)

Find a view that counts objects by total retained size. Take a screenshot. Send 
that. 



> On Apr 5, 2020, at 6:51 PM, Surbhi Gupta  wrote:
> 
> 
> I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.
> 
> I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
> What specific parameter I should check on OS ?
> We are using CentOS release 6.10.
> 
> Currently disk_access_modeis not set hence it is auto in our env. Should 
> setting disk_access_mode  to mmap_index_only  will help ?
> 
> Thanks
> Surbhi
> 
>> On Sun, 5 Apr 2020 at 01:31, Alex Ott  wrote:
>> Have you set -Xmx32g ? In this case you may get significantly less
>> available memory because of switch to 64-bit references.  See
>> http://java-performance.info/over-32g-heap-java/ for details, and set
>> slightly less than 32Gb
>> 
>> Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
>>  RP> Surbi:
>> 
>>  RP> If you aren’t seeing connection activity in DC2, I’d check to see if 
>> the operations hitting DC1 are quorum ops instead of local quorum.  That
>>  RP> still wouldn’t explain DC2 nodes going down, but would at least explain 
>> them doing more work than might be on your radar right now.
>> 
>>  RP> The hint replay being slow to me sounds like you could be fighting GC.
>> 
>>  RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already 
>> been doing this, but if not, be sure to be under 32gb, like 31gb. 
>>  RP> Otherwise you’re using larger object pointers and could actually have 
>> less effective ability to allocate memory.
>> 
>>  RP> As the problem is only happening in DC2, then there has to be a thing 
>> that is true in DC2 that isn’t true in DC1.  A difference in hardware, a
>>  RP> difference in O/S version, a difference in networking config or 
>> physical infrastructure, a difference in client-triggered activity, or a
>>  RP> difference in how repairs are handled. Somewhere, there is a 
>> difference.  I’d start with focusing on that.
>> 
>>  RP> From: Erick Ramirez 
>>  RP> Reply-To: "user@cassandra.apache.org" 
>>  RP> Date: Saturday, April 4, 2020 at 8:28 PM
>>  RP> To: "user@cassandra.apache.org" 
>>  RP> Subject: Re: OOM only on one datacenter nodes
>> 
>>  RP> Message from External Sender
>> 
>>  RP> With a lack of heapdump for you to analyse, my hypothesis is that your 
>> DC2 nodes are taking on traffic (from some client somewhere) but you're
>>  RP> just not aware of it. The hints replay is just a side-effect of the 
>> nodes getting overloaded.
>> 
>>  RP> To rule out my hypothesis in the first instance, my recommendation is 
>> to monitor the incoming connections to the nodes in DC2. If you don't
>>  RP> have monitoring in place, you could simply run netstat at regular 
>> intervals and go from there. Cheers!
>> 
>>  RP> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax 
>> have answers! Share your expertise on https://community.datastax.com/.
>> 
>> 
>> 
>> -- 
>> With best wishes,Alex Ott
>> Principal Architect, DataStax
>> http://datastax.com/
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> 


Re: OOM only on one datacenter nodes

2020-04-05 Thread Surbhi Gupta
I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.

I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
What specific parameter I should check on OS ?
We are using CentOS release 6.10.

Currently disk_access_modeis not set hence it is auto in our env. Should
setting disk_access_mode  to mmap_index_only  will help ?

Thanks
Surbhi

On Sun, 5 Apr 2020 at 01:31, Alex Ott  wrote:

> Have you set -Xmx32g ? In this case you may get significantly less
> available memory because of switch to 64-bit references.  See
> http://java-performance.info/over-32g-heap-java/ for details, and set
> slightly less than 32Gb
>
> Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
>  RP> Surbi:
>
>  RP> If you aren’t seeing connection activity in DC2, I’d check to see if
> the operations hitting DC1 are quorum ops instead of local quorum.  That
>  RP> still wouldn’t explain DC2 nodes going down, but would at least
> explain them doing more work than might be on your radar right now.
>
>  RP> The hint replay being slow to me sounds like you could be fighting GC.
>
>  RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already
> been doing this, but if not, be sure to be under 32gb, like 31gb.
>  RP> Otherwise you’re using larger object pointers and could actually have
> less effective ability to allocate memory.
>
>  RP> As the problem is only happening in DC2, then there has to be a thing
> that is true in DC2 that isn’t true in DC1.  A difference in hardware, a
>  RP> difference in O/S version, a difference in networking config or
> physical infrastructure, a difference in client-triggered activity, or a
>  RP> difference in how repairs are handled. Somewhere, there is a
> difference.  I’d start with focusing on that.
>
>  RP> From: Erick Ramirez 
>  RP> Reply-To: "user@cassandra.apache.org" 
>  RP> Date: Saturday, April 4, 2020 at 8:28 PM
>  RP> To: "user@cassandra.apache.org" 
>  RP> Subject: Re: OOM only on one datacenter nodes
>
>  RP> Message from External Sender
>
>  RP> With a lack of heapdump for you to analyse, my hypothesis is that
> your DC2 nodes are taking on traffic (from some client somewhere) but you're
>  RP> just not aware of it. The hints replay is just a side-effect of the
> nodes getting overloaded.
>
>  RP> To rule out my hypothesis in the first instance, my recommendation is
> to monitor the incoming connections to the nodes in DC2. If you don't
>  RP> have monitoring in place, you could simply run netstat at regular
> intervals and go from there. Cheers!
>
>  RP> GOT QUESTIONS? Apache Cassandra experts from the community and
> DataStax have answers! Share your expertise on
> https://community.datastax.com/.
>
>
>
> --
> With best wishes,Alex Ott
> Principal Architect, DataStax
> http://datastax.com/
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: OOM only on one datacenter nodes

2020-04-05 Thread Alex Ott
Have you set -Xmx32g ? In this case you may get significantly less
available memory because of switch to 64-bit references.  See
http://java-performance.info/over-32g-heap-java/ for details, and set
slightly less than 32Gb

Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
 RP> Surbi:

 RP> If you aren’t seeing connection activity in DC2, I’d check to see if the 
operations hitting DC1 are quorum ops instead of local quorum.  That
 RP> still wouldn’t explain DC2 nodes going down, but would at least explain 
them doing more work than might be on your radar right now.

 RP> The hint replay being slow to me sounds like you could be fighting GC.

 RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already been 
doing this, but if not, be sure to be under 32gb, like 31gb. 
 RP> Otherwise you’re using larger object pointers and could actually have less 
effective ability to allocate memory.

 RP> As the problem is only happening in DC2, then there has to be a thing that 
is true in DC2 that isn’t true in DC1.  A difference in hardware, a
 RP> difference in O/S version, a difference in networking config or physical 
infrastructure, a difference in client-triggered activity, or a
 RP> difference in how repairs are handled. Somewhere, there is a difference.  
I’d start with focusing on that.

 RP> From: Erick Ramirez 
 RP> Reply-To: "user@cassandra.apache.org" 
 RP> Date: Saturday, April 4, 2020 at 8:28 PM
 RP> To: "user@cassandra.apache.org" 
 RP> Subject: Re: OOM only on one datacenter nodes

 RP> Message from External Sender

 RP> With a lack of heapdump for you to analyse, my hypothesis is that your DC2 
nodes are taking on traffic (from some client somewhere) but you're
 RP> just not aware of it. The hints replay is just a side-effect of the nodes 
getting overloaded.

 RP> To rule out my hypothesis in the first instance, my recommendation is to 
monitor the incoming connections to the nodes in DC2. If you don't
 RP> have monitoring in place, you could simply run netstat at regular 
intervals and go from there. Cheers!

 RP> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax 
have answers! Share your expertise on https://community.datastax.com/.



-- 
With best wishes,Alex Ott
Principal Architect, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: OOM only on one datacenter nodes

2020-04-04 Thread Reid Pinchback
Surbi:

If you aren’t seeing connection activity in DC2, I’d check to see if the 
operations hitting DC1 are quorum ops instead of local quorum.  That still 
wouldn’t explain DC2 nodes going down, but would at least explain them doing 
more work than might be on your radar right now.

The hint replay being slow to me sounds like you could be fighting GC.

You mentioned bumping the DC2 nodes to 32gb.  You might have already been doing 
this, but if not, be sure to be under 32gb, like 31gb.  Otherwise you’re using 
larger object pointers and could actually have less effective ability to 
allocate memory.

As the problem is only happening in DC2, then there has to be a thing that is 
true in DC2 that isn’t true in DC1.  A difference in hardware, a difference in 
O/S version, a difference in networking config or physical infrastructure, a 
difference in client-triggered activity, or a difference in how repairs are 
handled. Somewhere, there is a difference.  I’d start with focusing on that.


From: Erick Ramirez 
Reply-To: "user@cassandra.apache.org" 
Date: Saturday, April 4, 2020 at 8:28 PM
To: "user@cassandra.apache.org" 
Subject: Re: OOM only on one datacenter nodes

Message from External Sender
With a lack of heapdump for you to analyse, my hypothesis is that your DC2 
nodes are taking on traffic (from some client somewhere) but you're just not 
aware of it. The hints replay is just a side-effect of the nodes getting 
overloaded.

To rule out my hypothesis in the first instance, my recommendation is to 
monitor the incoming connections to the nodes in DC2. If you don't have 
monitoring in place, you could simply run netstat at regular intervals and go 
from there. Cheers!

GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have 
answers! Share your expertise on 
https://community.datastax.com/<https://urldefense.proofpoint.com/v2/url?u=https-3A__community.datastax.com_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=fGh0C8n6vv7LqulA6b4vVVTavsof7CZt6ESlDCr-uP8=oZjamTdUMswHVohvkHQQftZdYivh1qRAmRn1-dap_Uo=>.



Re: OOM only on one datacenter nodes

2020-04-04 Thread Erick Ramirez
With a lack of heapdump for you to analyse, my hypothesis is that your DC2
nodes are taking on traffic (from some client somewhere) but you're just
not aware of it. The hints replay is just a side-effect of the nodes
getting overloaded.

To rule out my hypothesis in the first instance, my recommendation is to
monitor the incoming connections to the nodes in DC2. If you don't have
monitoring in place, you could simply run netstat at regular intervals and
go from there. Cheers!

GOT QUESTIONS? Apache Cassandra experts from the community and DataStax
have answers! Share your expertise on https://community.datastax.com/.


OOM only on one datacenter nodes

2020-04-04 Thread Surbhi Gupta
Hi,

We have two datacenter with 5 nodes each and have replication factor of 3.
We have traffic on DC1 and DC2 is just for disaster recovery and there is
no direct traffic.
We are using 24cpu with 128GB RAM machines .
For DC1 where we have live traffic , we don't see any issue, however for
DC2 where we don't have live traffic we see lots OOM(Out of Memory)and node
goes down(only on DC2 nodes).

We were using 16GB heap with G1GC in DC1 and DC2 both .
As DC2 nodes were OOM so we increased 16GB to 24GB and then to 32GB but
still DC2 nodes goes down with OOM , but obviously not as frequently as it
used to go down when heap was 16GB .
DC1 nodes are still on 16GB heap and none of the nodes goes down .

We are on open source 3.11.0 .
We are having Materialized views.
We see lots of hints pending on DC2 nodes and hints replay is very very
slow on DC2 nodes compare to DC1 nodes.

Other than heap sizes mentioned above , all the configs are same in
all nodes in the clusters.
We are using JRE and can't collect the heap dump.

Any idea, what can be the cause ?

Currently disk_access_modeis not set hence it is auto in our env. Should
setting disk_access_mode  to mmap_index_only  will help ?

My question is "*Why DC2 nodes OOM and DC1 nodes doesn't?*"

Thanks
Surbhi