Nobody is going to do any better than guessing without a heap histogram

I’ve got pretty good intuition with cassandra in real prod environments and can 
think of like 8-9 different possible causes, but none of them really stand out 
as likely enough to describe in detail (maaaaybe the memtable deadlock on 
flush, or maybe repair coordination in that dc), but really need a heap

If it’s causing you enough pain to email a list, seems like it’s worth making 
the changes you need to make to get a heap and debug properly.


> On Apr 5, 2020, at 9:44 PM, Surbhi Gupta <surbhi.gupt...@gmail.com> wrote:
> 
> 
> We are using JRE and not JDK , hence not able to take heap dump .
> 
>> On Sun, 5 Apr 2020 at 19:21, Jeff Jirsa <jji...@gmail.com> wrote:
>> 
>> Set the jvm flags to heap dump on oom
>> 
>> Open up the result in a heap inspector of your preference (like yourkit or 
>> similar)
>> 
>> Find a view that counts objects by total retained size. Take a screenshot. 
>> Send that. 
>> 
>> 
>> 
>>>> On Apr 5, 2020, at 6:51 PM, Surbhi Gupta <surbhi.gupt...@gmail.com> wrote:
>>>> 
>>> 
>>> I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.
>>> 
>>> I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
>>> What specific parameter I should check on OS ?
>>> We are using CentOS release 6.10.
>>> 
>>> Currently disk_access_modeis not set hence it is auto in our env. Should 
>>> setting disk_access_mode  to mmap_index_only  will help ?
>>> 
>>> Thanks
>>> Surbhi
>>> 
>>>> On Sun, 5 Apr 2020 at 01:31, Alex Ott <alex...@gmail.com> wrote:
>>>> Have you set -Xmx32g ? In this case you may get significantly less
>>>> available memory because of switch to 64-bit references.  See
>>>> http://java-performance.info/over-32g-heap-java/ for details, and set
>>>> slightly less than 32Gb
>>>> 
>>>> Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +0000" wrote:
>>>>  RP> Surbi:
>>>> 
>>>>  RP> If you aren’t seeing connection activity in DC2, I’d check to see if 
>>>> the operations hitting DC1 are quorum ops instead of local quorum.  That
>>>>  RP> still wouldn’t explain DC2 nodes going down, but would at least 
>>>> explain them doing more work than might be on your radar right now.
>>>> 
>>>>  RP> The hint replay being slow to me sounds like you could be fighting GC.
>>>> 
>>>>  RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already 
>>>> been doing this, but if not, be sure to be under 32gb, like 31gb. 
>>>>  RP> Otherwise you’re using larger object pointers and could actually have 
>>>> less effective ability to allocate memory.
>>>> 
>>>>  RP> As the problem is only happening in DC2, then there has to be a thing 
>>>> that is true in DC2 that isn’t true in DC1.  A difference in hardware, a
>>>>  RP> difference in O/S version, a difference in networking config or 
>>>> physical infrastructure, a difference in client-triggered activity, or a
>>>>  RP> difference in how repairs are handled. Somewhere, there is a 
>>>> difference.  I’d start with focusing on that.
>>>> 
>>>>  RP> From: Erick Ramirez <erick.rami...@datastax.com>
>>>>  RP> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>>>  RP> Date: Saturday, April 4, 2020 at 8:28 PM
>>>>  RP> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>>>  RP> Subject: Re: OOM only on one datacenter nodes
>>>> 
>>>>  RP> Message from External Sender
>>>> 
>>>>  RP> With a lack of heapdump for you to analyse, my hypothesis is that 
>>>> your DC2 nodes are taking on traffic (from some client somewhere) but 
>>>> you're
>>>>  RP> just not aware of it. The hints replay is just a side-effect of the 
>>>> nodes getting overloaded.
>>>> 
>>>>  RP> To rule out my hypothesis in the first instance, my recommendation is 
>>>> to monitor the incoming connections to the nodes in DC2. If you don't
>>>>  RP> have monitoring in place, you could simply run netstat at regular 
>>>> intervals and go from there. Cheers!
>>>> 
>>>>  RP> GOT QUESTIONS? Apache Cassandra experts from the community and 
>>>> DataStax have answers! Share your expertise on 
>>>> https://community.datastax.com/.
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> With best wishes,                    Alex Ott
>>>> Principal Architect, DataStax
>>>> http://datastax.com/
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>> 

Reply via email to