Re: OOM only on one datacenter nodes

2020-04-05 Thread Surbhi Gupta
We are using JRE and not JDK , hence not able to take heap dump .

On Sun, 5 Apr 2020 at 19:21, Jeff Jirsa  wrote:

>
> Set the jvm flags to heap dump on oom
>
> Open up the result in a heap inspector of your preference (like yourkit or
> similar)
>
> Find a view that counts objects by total retained size. Take a screenshot.
> Send that.
>
>
>
> On Apr 5, 2020, at 6:51 PM, Surbhi Gupta  wrote:
>
> 
> I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.
>
> I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
> What specific parameter I should check on OS ?
> We are using CentOS release 6.10.
>
> Currently disk_access_modeis not set hence it is auto in our env. Should
> setting disk_access_mode  to mmap_index_only  will help ?
>
> Thanks
> Surbhi
>
> On Sun, 5 Apr 2020 at 01:31, Alex Ott  wrote:
>
>> Have you set -Xmx32g ? In this case you may get significantly less
>> available memory because of switch to 64-bit references.  See
>> http://java-performance.info/over-32g-heap-java/ for details, and set
>> slightly less than 32Gb
>>
>> Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
>>  RP> Surbi:
>>
>>  RP> If you aren’t seeing connection activity in DC2, I’d check to see if
>> the operations hitting DC1 are quorum ops instead of local quorum.  That
>>  RP> still wouldn’t explain DC2 nodes going down, but would at least
>> explain them doing more work than might be on your radar right now.
>>
>>  RP> The hint replay being slow to me sounds like you could be fighting
>> GC.
>>
>>  RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already
>> been doing this, but if not, be sure to be under 32gb, like 31gb.
>>  RP> Otherwise you’re using larger object pointers and could actually
>> have less effective ability to allocate memory.
>>
>>  RP> As the problem is only happening in DC2, then there has to be a
>> thing that is true in DC2 that isn’t true in DC1.  A difference in
>> hardware, a
>>  RP> difference in O/S version, a difference in networking config or
>> physical infrastructure, a difference in client-triggered activity, or a
>>  RP> difference in how repairs are handled. Somewhere, there is a
>> difference.  I’d start with focusing on that.
>>
>>  RP> From: Erick Ramirez 
>>  RP> Reply-To: "user@cassandra.apache.org" 
>>  RP> Date: Saturday, April 4, 2020 at 8:28 PM
>>  RP> To: "user@cassandra.apache.org" 
>>  RP> Subject: Re: OOM only on one datacenter nodes
>>
>>  RP> Message from External Sender
>>
>>  RP> With a lack of heapdump for you to analyse, my hypothesis is that
>> your DC2 nodes are taking on traffic (from some client somewhere) but you're
>>  RP> just not aware of it. The hints replay is just a side-effect of the
>> nodes getting overloaded.
>>
>>  RP> To rule out my hypothesis in the first instance, my recommendation
>> is to monitor the incoming connections to the nodes in DC2. If you don't
>>  RP> have monitoring in place, you could simply run netstat at regular
>> intervals and go from there. Cheers!
>>
>>  RP> GOT QUESTIONS? Apache Cassandra experts from the community and
>> DataStax have answers! Share your expertise on
>> https://community.datastax.com/.
>>
>>
>>
>> --
>> With best wishes,Alex Ott
>> Principal Architect, DataStax
>> http://datastax.com/
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>


Re: OOM only on one datacenter nodes

2020-04-05 Thread Jeff Jirsa

Set the jvm flags to heap dump on oom

Open up the result in a heap inspector of your preference (like yourkit or 
similar)

Find a view that counts objects by total retained size. Take a screenshot. Send 
that. 



> On Apr 5, 2020, at 6:51 PM, Surbhi Gupta  wrote:
> 
> 
> I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.
> 
> I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
> What specific parameter I should check on OS ?
> We are using CentOS release 6.10.
> 
> Currently disk_access_modeis not set hence it is auto in our env. Should 
> setting disk_access_mode  to mmap_index_only  will help ?
> 
> Thanks
> Surbhi
> 
>> On Sun, 5 Apr 2020 at 01:31, Alex Ott  wrote:
>> Have you set -Xmx32g ? In this case you may get significantly less
>> available memory because of switch to 64-bit references.  See
>> http://java-performance.info/over-32g-heap-java/ for details, and set
>> slightly less than 32Gb
>> 
>> Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
>>  RP> Surbi:
>> 
>>  RP> If you aren’t seeing connection activity in DC2, I’d check to see if 
>> the operations hitting DC1 are quorum ops instead of local quorum.  That
>>  RP> still wouldn’t explain DC2 nodes going down, but would at least explain 
>> them doing more work than might be on your radar right now.
>> 
>>  RP> The hint replay being slow to me sounds like you could be fighting GC.
>> 
>>  RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already 
>> been doing this, but if not, be sure to be under 32gb, like 31gb. 
>>  RP> Otherwise you’re using larger object pointers and could actually have 
>> less effective ability to allocate memory.
>> 
>>  RP> As the problem is only happening in DC2, then there has to be a thing 
>> that is true in DC2 that isn’t true in DC1.  A difference in hardware, a
>>  RP> difference in O/S version, a difference in networking config or 
>> physical infrastructure, a difference in client-triggered activity, or a
>>  RP> difference in how repairs are handled. Somewhere, there is a 
>> difference.  I’d start with focusing on that.
>> 
>>  RP> From: Erick Ramirez 
>>  RP> Reply-To: "user@cassandra.apache.org" 
>>  RP> Date: Saturday, April 4, 2020 at 8:28 PM
>>  RP> To: "user@cassandra.apache.org" 
>>  RP> Subject: Re: OOM only on one datacenter nodes
>> 
>>  RP> Message from External Sender
>> 
>>  RP> With a lack of heapdump for you to analyse, my hypothesis is that your 
>> DC2 nodes are taking on traffic (from some client somewhere) but you're
>>  RP> just not aware of it. The hints replay is just a side-effect of the 
>> nodes getting overloaded.
>> 
>>  RP> To rule out my hypothesis in the first instance, my recommendation is 
>> to monitor the incoming connections to the nodes in DC2. If you don't
>>  RP> have monitoring in place, you could simply run netstat at regular 
>> intervals and go from there. Cheers!
>> 
>>  RP> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax 
>> have answers! Share your expertise on https://community.datastax.com/.
>> 
>> 
>> 
>> -- 
>> With best wishes,Alex Ott
>> Principal Architect, DataStax
>> http://datastax.com/
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> 


Re: OOM only on one datacenter nodes

2020-04-05 Thread Surbhi Gupta
I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.

I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
What specific parameter I should check on OS ?
We are using CentOS release 6.10.

Currently disk_access_modeis not set hence it is auto in our env. Should
setting disk_access_mode  to mmap_index_only  will help ?

Thanks
Surbhi

On Sun, 5 Apr 2020 at 01:31, Alex Ott  wrote:

> Have you set -Xmx32g ? In this case you may get significantly less
> available memory because of switch to 64-bit references.  See
> http://java-performance.info/over-32g-heap-java/ for details, and set
> slightly less than 32Gb
>
> Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
>  RP> Surbi:
>
>  RP> If you aren’t seeing connection activity in DC2, I’d check to see if
> the operations hitting DC1 are quorum ops instead of local quorum.  That
>  RP> still wouldn’t explain DC2 nodes going down, but would at least
> explain them doing more work than might be on your radar right now.
>
>  RP> The hint replay being slow to me sounds like you could be fighting GC.
>
>  RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already
> been doing this, but if not, be sure to be under 32gb, like 31gb.
>  RP> Otherwise you’re using larger object pointers and could actually have
> less effective ability to allocate memory.
>
>  RP> As the problem is only happening in DC2, then there has to be a thing
> that is true in DC2 that isn’t true in DC1.  A difference in hardware, a
>  RP> difference in O/S version, a difference in networking config or
> physical infrastructure, a difference in client-triggered activity, or a
>  RP> difference in how repairs are handled. Somewhere, there is a
> difference.  I’d start with focusing on that.
>
>  RP> From: Erick Ramirez 
>  RP> Reply-To: "user@cassandra.apache.org" 
>  RP> Date: Saturday, April 4, 2020 at 8:28 PM
>  RP> To: "user@cassandra.apache.org" 
>  RP> Subject: Re: OOM only on one datacenter nodes
>
>  RP> Message from External Sender
>
>  RP> With a lack of heapdump for you to analyse, my hypothesis is that
> your DC2 nodes are taking on traffic (from some client somewhere) but you're
>  RP> just not aware of it. The hints replay is just a side-effect of the
> nodes getting overloaded.
>
>  RP> To rule out my hypothesis in the first instance, my recommendation is
> to monitor the incoming connections to the nodes in DC2. If you don't
>  RP> have monitoring in place, you could simply run netstat at regular
> intervals and go from there. Cheers!
>
>  RP> GOT QUESTIONS? Apache Cassandra experts from the community and
> DataStax have answers! Share your expertise on
> https://community.datastax.com/.
>
>
>
> --
> With best wishes,Alex Ott
> Principal Architect, DataStax
> http://datastax.com/
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: unconfigured table logtabl

2020-04-05 Thread Jeff Jirsa
Be really cautious here, this can be deceptive

There are races in some versions of cassandra that can leave you with different 
combinations of cfid

The cfid is on disk for schema
It’s in memory for schema
It’s used for the table path on disk

Those three have to match for things to work properly and various races can 
make them mismatch

If you have one ID for the path and a different in the schema table on disk and 
you bounce, the database throws away the old directory and you start fresh with 
a new empty directory

So resetlocalschema May “fix” this but it may fix it by throwing away one copy 
of data

If only one host is out of sync you may want to pretend it died and replace it. 
Alternatively a ton of inspection can tell you which variation of mismatch you 
have and you can correct it properly but it’s more work than I’m prepared to 
type today

Sorry for the unpleasant reality.



> On Apr 5, 2020, at 3:38 PM, Erick Ramirez  wrote:
> 
> 
>> Another suggestion before resetlocalschema. Try rolling restart all the 
>> nodes in the cluster and see if it fix the problem. After the restart all 
>> the nodes will use the same schema for the table.
> 
> That's a little bit heavy-handed.  Resetting a node's schema is a simple, 
> online operation that doesn't involve a cluster-wide restart. Cheers!


Re: unconfigured table logtabl

2020-04-05 Thread Erick Ramirez
>
> Another suggestion before resetlocalschema. Try rolling restart all the
> nodes in the cluster and see if it fix the problem. After the restart all
> the nodes will use the same schema for the table.
>

That's a little bit heavy-handed.  Resetting a node's schema is a simple,
online operation that doesn't involve a cluster-wide restart. Cheers!


Re: unconfigured table logtabl

2020-04-05 Thread Jai Bheemsen Rao Dhanwada
Another suggestion before resetlocalschema. Try rolling restart all the
nodes in the cluster and see if it fix the problem. After the restart all
the nodes will use the same schema for the table.

On Sunday, April 5, 2020, David Ni  wrote:

> Hi Erick
> Thank you very much for your friendly note.
> ERROR [AntiEntropyStage:1] 2020-04-04 13:57:09,614
> RepairMessageVerbHandler.java:177 - Table with id 21a3fa90-74c7-11ea-978a-
> b556b0c3a5ea was dropped during prepare phase of repair
> cassandra@cqlsh:system_schema> select keyspace_name,table_name,id from
> tables where keyspace_name='oapi_dev' and table_name='logtabl';
>  keyspace_name | table_name | id
> ---++--
>   oapi_dev |logtabl | 830028a0-7584-11ea-a277-bdf3d1289bdd
> the table id does not match the id  from system_schema.tables
> how to fix it?
>
>
>
>
> At 2020-04-04 14:44:16, "Erick Ramirez" 
> wrote:
>
> Is it possible someone else dropped then recreated the logtabl table?
> Also, did you confirm that the missing table ID matches the ID of logtabl?
>
> On a friendly note, there are a number of users here like me who respond
> to questions on the go. I personally find it difficult to read screenshots
> on my phone so if it isn't too much trouble, it would be preferable if you
> pasted the text here instead. Cheers!
>
>
>
>
>
>
>
>
>


Re:Re:Re: Re: Re: unconfigured table logtabl

2020-04-05 Thread David Ni
Hi Erick

Thank you very much for your friendly note.
ERROR [AntiEntropyStage:1] 2020-04-04 13:57:09,614 
RepairMessageVerbHandler.java:177 - Table with id 
21a3fa90-74c7-11ea-978a-b556b0c3a5ea was dropped during prepare phase of repair
cassandra@cqlsh:system_schema> select keyspace_name,table_name,id from tables 
where keyspace_name='oapi_dev' and table_name='logtabl';
 keyspace_name | table_name | id
---++--
  oapi_dev |logtabl | 830028a0-7584-11ea-a277-bdf3d1289bdd

the table id does not match the id  from system_schema.tables
how to fix it?












At 2020-04-04 14:44:16, "Erick Ramirez"  wrote:

Is it possible someone else dropped then recreated the logtabl table? Also, did 
you confirm that the missing table ID matches the ID of logtabl?


On a friendly note, there are a number of users here like me who respond to 
questions on the go. I personally find it difficult to read screenshots on my 
phone so if it isn't too much trouble, it would be preferable if you pasted the 
text here instead. Cheers!




 

Re: OOM only on one datacenter nodes

2020-04-05 Thread Alex Ott
Have you set -Xmx32g ? In this case you may get significantly less
available memory because of switch to 64-bit references.  See
http://java-performance.info/over-32g-heap-java/ for details, and set
slightly less than 32Gb

Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
 RP> Surbi:

 RP> If you aren’t seeing connection activity in DC2, I’d check to see if the 
operations hitting DC1 are quorum ops instead of local quorum.  That
 RP> still wouldn’t explain DC2 nodes going down, but would at least explain 
them doing more work than might be on your radar right now.

 RP> The hint replay being slow to me sounds like you could be fighting GC.

 RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already been 
doing this, but if not, be sure to be under 32gb, like 31gb. 
 RP> Otherwise you’re using larger object pointers and could actually have less 
effective ability to allocate memory.

 RP> As the problem is only happening in DC2, then there has to be a thing that 
is true in DC2 that isn’t true in DC1.  A difference in hardware, a
 RP> difference in O/S version, a difference in networking config or physical 
infrastructure, a difference in client-triggered activity, or a
 RP> difference in how repairs are handled. Somewhere, there is a difference.  
I’d start with focusing on that.

 RP> From: Erick Ramirez 
 RP> Reply-To: "user@cassandra.apache.org" 
 RP> Date: Saturday, April 4, 2020 at 8:28 PM
 RP> To: "user@cassandra.apache.org" 
 RP> Subject: Re: OOM only on one datacenter nodes

 RP> Message from External Sender

 RP> With a lack of heapdump for you to analyse, my hypothesis is that your DC2 
nodes are taking on traffic (from some client somewhere) but you're
 RP> just not aware of it. The hints replay is just a side-effect of the nodes 
getting overloaded.

 RP> To rule out my hypothesis in the first instance, my recommendation is to 
monitor the incoming connections to the nodes in DC2. If you don't
 RP> have monitoring in place, you could simply run netstat at regular 
intervals and go from there. Cheers!

 RP> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax 
have answers! Share your expertise on https://community.datastax.com/.



-- 
With best wishes,Alex Ott
Principal Architect, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org