Re: Cassandra is not showing a node up hours after restart

2019-11-25 Thread Shalom Sagges
Sorry, disregard the schema ID. It's too early in the morning here ;)

On Tue, Nov 26, 2019 at 7:58 AM Shalom Sagges 
wrote:

> Hi Paul,
>
> From the gossipinfo output, it looks like the node's IP address and
> rpc_address are different.
> /192.168.*187*.121 vs RPC_ADDRESS:192.168.*185*.121
> You can also see that there's a schema disagreement between nodes, e.g.
> schema_id on node001 is fd2dcb4b-ca62-30df-b8f2-d3fd774f2801 and on node002
> it is fd2dcb4b-ca62-30df-b8f2-d3fd774f2801.
> You can run nodetool describecluster to see it as well.
> So I suggest to change the rpc_address to the ip_address of the node or
> set it to 0.0.0.0 and it should resolve the issue.
>
> Hope this helps!
>
>
> On Tue, Nov 26, 2019 at 4:05 AM Inquistive allen 
> wrote:
>
>> Hello ,
>>
>> Check and compare everything parameters
>>
>> 1. Java version should ideally match across all nodes in the cluster
>> 2. Check if port 7000 is open between the nodes. Use telnet or nc commands
>> 3. You must see some clues in system logs, why the gossip is failing.
>>
>> Do confirm on the above things.
>>
>> Thanks
>>
>>
>> On Tue, 26 Nov, 2019, 2:50 AM Paul Mena,  wrote:
>>
>>> NTP was restarted on the Cassandra nodes, but unfortunately I’m still
>>> getting the same result: the restarted node does not appear to be rejoining
>>> the cluster.
>>>
>>>
>>>
>>> Here’s another data point: “nodetool gossipinfo”, when run from the
>>> restarted node (“node001”) shows a status of “normal”:
>>>
>>>
>>>
>>> user@node001=> nodetool -u gossipinfo
>>>
>>> /192.168.187.121
>>>
>>>   generation:1574364410
>>>
>>>   heartbeat:209150
>>>
>>>   NET_VERSION:8
>>>
>>>   RACK:rack1
>>>
>>>   STATUS:NORMAL,-104847506331695918
>>>
>>>   RELEASE_VERSION:2.1.9
>>>
>>>   SEVERITY:0.0
>>>
>>>   LOAD:5.78684155614E11
>>>
>>>   HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
>>>
>>>   SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
>>>
>>>   DC:datacenter1
>>>
>>>   RPC_ADDRESS:192.168.185.121
>>>
>>>
>>>
>>> When run from one of the other nodes, however, node001’s status is shown
>>> as “shutdown”:
>>>
>>>
>>>
>>> user@node002=> nodetool gossipinfo
>>>
>>> /192.168.187.121
>>>
>>>   generation:1491825076
>>>
>>>   heartbeat:2147483647
>>>
>>>   STATUS:shutdown,true
>>>
>>>   RACK:rack1
>>>
>>>   NET_VERSION:8
>>>
>>>   LOAD:5.78679987693E11
>>>
>>>   RELEASE_VERSION:2.1.9
>>>
>>>   DC:datacenter1
>>>
>>>   SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
>>>
>>>   HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
>>>
>>>   RPC_ADDRESS:192.168.185.121
>>>
>>>   SEVERITY:0.0
>>>
>>>
>>>
>>>
>>>
>>> *Paul Mena*
>>>
>>> Senior Application Administrator
>>>
>>> WHOI - Information Services
>>>
>>> 508-289-3539
>>>
>>>
>>>
>>> *From:* Paul Mena
>>> *Sent:* Monday, November 25, 2019 9:29 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* RE: Cassandra is not showing a node up hours after restart
>>>
>>>
>>>
>>> I’ve just discovered that NTP is not running on any of these Cassandra
>>> nodes, and that the timestamps are all over the map. Could this be causing
>>> my issue?
>>>
>>>
>>>
>>> user@remote=> ansible pre-prod-cassandra -a date
>>>
>>> node001.intra.myorg.org | CHANGED | rc=0 >>
>>>
>>> Mon Nov 25 13:58:17 UTC 2019
>>>
>>>
>>>
>>> node004.intra.myorg.org | CHANGED | rc=0 >>
>>>
>>> Mon Nov 25 14:07:20 UTC 2019
>>>
>>>
>>>
>>> node003.intra.myorg.org | CHANGED | rc=0 >>
>>>
>>> Mon Nov 25 13:57:06 UTC 2019
>>>
>>>
>>>
>>> node001.intra.myorg.org | CHANGED | rc=0 >>
>>>
>>> Mon Nov 25 14:07:22 UTC 2019
>>>
>>>
>>>
>>> *Paul Mena*
>>>
>>> Senior Application Administrator
>>>
>>> WHOI - Information Services
>>>
>>> 508-289-3539
>>>
>>>
>>>
>>> *From:* Inquistive allen 
>>> *Sent:* Monday, November 25, 2019 2:46 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Cassandra is not showing a node up hours after restart
>>>
>>>
>>>
>>> Hello team,
>>>
>>>
>>>
>>> Just to add on to the discussion, one may run,
>>>
>>> Nodetool disablebinary followed by a nodetool disablethrift followed by
>>> nodetool drain.
>>>
>>> Nodetool drain also does the work of nodetool flush+ declaring in the
>>> cluster that I'm down and not accepting traffic.
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>> On Mon, 25 Nov, 2019, 12:55 AM Surbhi Gupta, 
>>> wrote:
>>>
>>> Before Cassandra shutdown, nodetool drain should be executed first. As
>>> soon as you do nodetool drain, others node will see this node down and no
>>> new traffic will come to this node.
>>>
>>> I generally gives 10 seconds gap between nodetool drain and Cassandra
>>> stop.
>>>
>>>
>>>
>>> On Sun, Nov 24, 2019 at 9:52 AM Paul Mena  wrote:
>>>
>>> Thank you for the replies. I had made no changes to the config before
>>> the rolling restart.
>>>
>>>
>>>
>>> I can try another restart but was wondering if I should do it
>>> differently. I had simply done "service cassandra stop" followed by
>>> "service cassandra start".  Since then I've seen some suggestions to
>>> proceed the shutdown with "nodetool disablegossip" and/or "nodetool drain".

Re: Cassandra is not showing a node up hours after restart

2019-11-25 Thread Shalom Sagges
Hi Paul,

>From the gossipinfo output, it looks like the node's IP address and
rpc_address are different.
/192.168.*187*.121 vs RPC_ADDRESS:192.168.*185*.121
You can also see that there's a schema disagreement between nodes, e.g.
schema_id on node001 is fd2dcb4b-ca62-30df-b8f2-d3fd774f2801 and on node002
it is fd2dcb4b-ca62-30df-b8f2-d3fd774f2801.
You can run nodetool describecluster to see it as well.
So I suggest to change the rpc_address to the ip_address of the node or set
it to 0.0.0.0 and it should resolve the issue.

Hope this helps!


On Tue, Nov 26, 2019 at 4:05 AM Inquistive allen 
wrote:

> Hello ,
>
> Check and compare everything parameters
>
> 1. Java version should ideally match across all nodes in the cluster
> 2. Check if port 7000 is open between the nodes. Use telnet or nc commands
> 3. You must see some clues in system logs, why the gossip is failing.
>
> Do confirm on the above things.
>
> Thanks
>
>
> On Tue, 26 Nov, 2019, 2:50 AM Paul Mena,  wrote:
>
>> NTP was restarted on the Cassandra nodes, but unfortunately I’m still
>> getting the same result: the restarted node does not appear to be rejoining
>> the cluster.
>>
>>
>>
>> Here’s another data point: “nodetool gossipinfo”, when run from the
>> restarted node (“node001”) shows a status of “normal”:
>>
>>
>>
>> user@node001=> nodetool -u gossipinfo
>>
>> /192.168.187.121
>>
>>   generation:1574364410
>>
>>   heartbeat:209150
>>
>>   NET_VERSION:8
>>
>>   RACK:rack1
>>
>>   STATUS:NORMAL,-104847506331695918
>>
>>   RELEASE_VERSION:2.1.9
>>
>>   SEVERITY:0.0
>>
>>   LOAD:5.78684155614E11
>>
>>   HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
>>
>>   SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
>>
>>   DC:datacenter1
>>
>>   RPC_ADDRESS:192.168.185.121
>>
>>
>>
>> When run from one of the other nodes, however, node001’s status is shown
>> as “shutdown”:
>>
>>
>>
>> user@node002=> nodetool gossipinfo
>>
>> /192.168.187.121
>>
>>   generation:1491825076
>>
>>   heartbeat:2147483647
>>
>>   STATUS:shutdown,true
>>
>>   RACK:rack1
>>
>>   NET_VERSION:8
>>
>>   LOAD:5.78679987693E11
>>
>>   RELEASE_VERSION:2.1.9
>>
>>   DC:datacenter1
>>
>>   SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
>>
>>   HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
>>
>>   RPC_ADDRESS:192.168.185.121
>>
>>   SEVERITY:0.0
>>
>>
>>
>>
>>
>> *Paul Mena*
>>
>> Senior Application Administrator
>>
>> WHOI - Information Services
>>
>> 508-289-3539
>>
>>
>>
>> *From:* Paul Mena
>> *Sent:* Monday, November 25, 2019 9:29 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* RE: Cassandra is not showing a node up hours after restart
>>
>>
>>
>> I’ve just discovered that NTP is not running on any of these Cassandra
>> nodes, and that the timestamps are all over the map. Could this be causing
>> my issue?
>>
>>
>>
>> user@remote=> ansible pre-prod-cassandra -a date
>>
>> node001.intra.myorg.org | CHANGED | rc=0 >>
>>
>> Mon Nov 25 13:58:17 UTC 2019
>>
>>
>>
>> node004.intra.myorg.org | CHANGED | rc=0 >>
>>
>> Mon Nov 25 14:07:20 UTC 2019
>>
>>
>>
>> node003.intra.myorg.org | CHANGED | rc=0 >>
>>
>> Mon Nov 25 13:57:06 UTC 2019
>>
>>
>>
>> node001.intra.myorg.org | CHANGED | rc=0 >>
>>
>> Mon Nov 25 14:07:22 UTC 2019
>>
>>
>>
>> *Paul Mena*
>>
>> Senior Application Administrator
>>
>> WHOI - Information Services
>>
>> 508-289-3539
>>
>>
>>
>> *From:* Inquistive allen 
>> *Sent:* Monday, November 25, 2019 2:46 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Cassandra is not showing a node up hours after restart
>>
>>
>>
>> Hello team,
>>
>>
>>
>> Just to add on to the discussion, one may run,
>>
>> Nodetool disablebinary followed by a nodetool disablethrift followed by
>> nodetool drain.
>>
>> Nodetool drain also does the work of nodetool flush+ declaring in the
>> cluster that I'm down and not accepting traffic.
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> On Mon, 25 Nov, 2019, 12:55 AM Surbhi Gupta, 
>> wrote:
>>
>> Before Cassandra shutdown, nodetool drain should be executed first. As
>> soon as you do nodetool drain, others node will see this node down and no
>> new traffic will come to this node.
>>
>> I generally gives 10 seconds gap between nodetool drain and Cassandra
>> stop.
>>
>>
>>
>> On Sun, Nov 24, 2019 at 9:52 AM Paul Mena  wrote:
>>
>> Thank you for the replies. I had made no changes to the config before the
>> rolling restart.
>>
>>
>>
>> I can try another restart but was wondering if I should do it
>> differently. I had simply done "service cassandra stop" followed by
>> "service cassandra start".  Since then I've seen some suggestions to
>> proceed the shutdown with "nodetool disablegossip" and/or "nodetool drain".
>> Are these commands advisable? Are any other commands recommended either
>> before the shutdown or after the startup?
>>
>>
>>
>> Thanks again!
>>
>>
>>
>> Paul
>> --
>>
>> *From:* Naman Gupta 
>> *Sent:* Sunday, November 24, 2019 11:18:14 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Cassandra is not showing a node up 

Re: Cassandra is not showing a node up hours after restart

2019-11-25 Thread Inquistive allen
Hello ,

Check and compare everything parameters

1. Java version should ideally match across all nodes in the cluster
2. Check if port 7000 is open between the nodes. Use telnet or nc commands
3. You must see some clues in system logs, why the gossip is failing.

Do confirm on the above things.

Thanks


On Tue, 26 Nov, 2019, 2:50 AM Paul Mena,  wrote:

> NTP was restarted on the Cassandra nodes, but unfortunately I’m still
> getting the same result: the restarted node does not appear to be rejoining
> the cluster.
>
>
>
> Here’s another data point: “nodetool gossipinfo”, when run from the
> restarted node (“node001”) shows a status of “normal”:
>
>
>
> user@node001=> nodetool -u gossipinfo
>
> /192.168.187.121
>
>   generation:1574364410
>
>   heartbeat:209150
>
>   NET_VERSION:8
>
>   RACK:rack1
>
>   STATUS:NORMAL,-104847506331695918
>
>   RELEASE_VERSION:2.1.9
>
>   SEVERITY:0.0
>
>   LOAD:5.78684155614E11
>
>   HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
>
>   SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
>
>   DC:datacenter1
>
>   RPC_ADDRESS:192.168.185.121
>
>
>
> When run from one of the other nodes, however, node001’s status is shown
> as “shutdown”:
>
>
>
> user@node002=> nodetool gossipinfo
>
> /192.168.187.121
>
>   generation:1491825076
>
>   heartbeat:2147483647
>
>   STATUS:shutdown,true
>
>   RACK:rack1
>
>   NET_VERSION:8
>
>   LOAD:5.78679987693E11
>
>   RELEASE_VERSION:2.1.9
>
>   DC:datacenter1
>
>   SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
>
>   HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
>
>   RPC_ADDRESS:192.168.185.121
>
>   SEVERITY:0.0
>
>
>
>
>
> *Paul Mena*
>
> Senior Application Administrator
>
> WHOI - Information Services
>
> 508-289-3539
>
>
>
> *From:* Paul Mena
> *Sent:* Monday, November 25, 2019 9:29 AM
> *To:* user@cassandra.apache.org
> *Subject:* RE: Cassandra is not showing a node up hours after restart
>
>
>
> I’ve just discovered that NTP is not running on any of these Cassandra
> nodes, and that the timestamps are all over the map. Could this be causing
> my issue?
>
>
>
> user@remote=> ansible pre-prod-cassandra -a date
>
> node001.intra.myorg.org | CHANGED | rc=0 >>
>
> Mon Nov 25 13:58:17 UTC 2019
>
>
>
> node004.intra.myorg.org | CHANGED | rc=0 >>
>
> Mon Nov 25 14:07:20 UTC 2019
>
>
>
> node003.intra.myorg.org | CHANGED | rc=0 >>
>
> Mon Nov 25 13:57:06 UTC 2019
>
>
>
> node001.intra.myorg.org | CHANGED | rc=0 >>
>
> Mon Nov 25 14:07:22 UTC 2019
>
>
>
> *Paul Mena*
>
> Senior Application Administrator
>
> WHOI - Information Services
>
> 508-289-3539
>
>
>
> *From:* Inquistive allen 
> *Sent:* Monday, November 25, 2019 2:46 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Cassandra is not showing a node up hours after restart
>
>
>
> Hello team,
>
>
>
> Just to add on to the discussion, one may run,
>
> Nodetool disablebinary followed by a nodetool disablethrift followed by
> nodetool drain.
>
> Nodetool drain also does the work of nodetool flush+ declaring in the
> cluster that I'm down and not accepting traffic.
>
>
>
> Thanks
>
>
>
>
>
> On Mon, 25 Nov, 2019, 12:55 AM Surbhi Gupta, 
> wrote:
>
> Before Cassandra shutdown, nodetool drain should be executed first. As
> soon as you do nodetool drain, others node will see this node down and no
> new traffic will come to this node.
>
> I generally gives 10 seconds gap between nodetool drain and Cassandra
> stop.
>
>
>
> On Sun, Nov 24, 2019 at 9:52 AM Paul Mena  wrote:
>
> Thank you for the replies. I had made no changes to the config before the
> rolling restart.
>
>
>
> I can try another restart but was wondering if I should do it differently.
> I had simply done "service cassandra stop" followed by "service cassandra
> start".  Since then I've seen some suggestions to proceed the shutdown with
> "nodetool disablegossip" and/or "nodetool drain". Are these commands
> advisable? Are any other commands recommended either before the shutdown or
> after the startup?
>
>
>
> Thanks again!
>
>
>
> Paul
> --
>
> *From:* Naman Gupta 
> *Sent:* Sunday, November 24, 2019 11:18:14 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Cassandra is not showing a node up hours after restart
>
>
>
> Did you change the name of datacenter or any other config changes before
> the rolling restart?
>
>
>
> On Sun, Nov 24, 2019 at 8:49 PM Paul Mena  wrote:
>
> I am in the process of doing a rolling restart on a 4-node cluster running
> Cassandra 2.1.9. I stopped and started Cassandra on node 1 via "service
> cassandra stop/start", and noted nothing unusual in either system.log or
> cassandra.log. Doing a "nodetool status" from node 1 shows all four nodes
> up:
>
>
>
> user@node001=> nodetool status
>
> Datacenter: datacenter1
>
> ===
>
> Status=Up/Down
>
> |/ State=Normal/Leaving/Joining/Moving
>
> --  Address  Load   Tokens  OwnsHost ID   
> Rack
>
> UN  192.168.187.121  538.95 GB  256 ?   
> 

RE: Cassandra is not showing a node up hours after restart

2019-11-25 Thread Paul Mena
NTP was restarted on the Cassandra nodes, but unfortunately I’m still getting 
the same result: the restarted node does not appear to be rejoining the cluster.

Here’s another data point: “nodetool gossipinfo”, when run from the restarted 
node (“node001”) shows a status of “normal”:

user@node001=> nodetool -u gossipinfo
/192.168.187.121
  generation:1574364410
  heartbeat:209150
  NET_VERSION:8
  RACK:rack1
  STATUS:NORMAL,-104847506331695918
  RELEASE_VERSION:2.1.9
  SEVERITY:0.0
  LOAD:5.78684155614E11
  HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
  SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
  DC:datacenter1
  RPC_ADDRESS:192.168.185.121

When run from one of the other nodes, however, node001’s status is shown as 
“shutdown”:

user@node002=> nodetool gossipinfo
/192.168.187.121
  generation:1491825076
  heartbeat:2147483647
  STATUS:shutdown,true
  RACK:rack1
  NET_VERSION:8
  LOAD:5.78679987693E11
  RELEASE_VERSION:2.1.9
  DC:datacenter1
  SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
  HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
  RPC_ADDRESS:192.168.185.121
  SEVERITY:0.0


Paul Mena
Senior Application Administrator
WHOI - Information Services
508-289-3539

From: Paul Mena
Sent: Monday, November 25, 2019 9:29 AM
To: user@cassandra.apache.org
Subject: RE: Cassandra is not showing a node up hours after restart

I’ve just discovered that NTP is not running on any of these Cassandra nodes, 
and that the timestamps are all over the map. Could this be causing my issue?

user@remote=> ansible pre-prod-cassandra -a date
node001.intra.myorg.org | CHANGED | rc=0 >>
Mon Nov 25 13:58:17 UTC 2019

node004.intra.myorg.org | CHANGED | rc=0 >>
Mon Nov 25 14:07:20 UTC 2019

node003.intra.myorg.org | CHANGED | rc=0 >>
Mon Nov 25 13:57:06 UTC 2019

node001.intra.myorg.org | CHANGED | rc=0 >>
Mon Nov 25 14:07:22 UTC 2019

Paul Mena
Senior Application Administrator
WHOI - Information Services
508-289-3539

From: Inquistive allen mailto:inquial...@gmail.com>>
Sent: Monday, November 25, 2019 2:46 AM
To: user@cassandra.apache.org
Subject: Re: Cassandra is not showing a node up hours after restart

Hello team,

Just to add on to the discussion, one may run,
Nodetool disablebinary followed by a nodetool disablethrift followed by 
nodetool drain.
Nodetool drain also does the work of nodetool flush+ declaring in the cluster 
that I'm down and not accepting traffic.

Thanks


On Mon, 25 Nov, 2019, 12:55 AM Surbhi Gupta, 
mailto:surbhi.gupt...@gmail.com>> wrote:
Before Cassandra shutdown, nodetool drain should be executed first. As soon as 
you do nodetool drain, others node will see this node down and no new traffic 
will come to this node.
I generally gives 10 seconds gap between nodetool drain and Cassandra stop.

On Sun, Nov 24, 2019 at 9:52 AM Paul Mena 
mailto:pm...@whoi.edu>> wrote:

Thank you for the replies. I had made no changes to the config before the 
rolling restart.



I can try another restart but was wondering if I should do it differently. I 
had simply done "service cassandra stop" followed by "service cassandra start". 
 Since then I've seen some suggestions to proceed the shutdown with "nodetool 
disablegossip" and/or "nodetool drain". Are these commands advisable? Are any 
other commands recommended either before the shutdown or after the startup?



Thanks again!



Paul


From: Naman Gupta 
mailto:naman.gu...@girnarsoft.com>>
Sent: Sunday, November 24, 2019 11:18:14 AM
To: user@cassandra.apache.org
Subject: Re: Cassandra is not showing a node up hours after restart

Did you change the name of datacenter or any other config changes before the 
rolling restart?

On Sun, Nov 24, 2019 at 8:49 PM Paul Mena 
mailto:pm...@whoi.edu>> wrote:
I am in the process of doing a rolling restart on a 4-node cluster running 
Cassandra 2.1.9. I stopped and started Cassandra on node 1 via "service 
cassandra stop/start", and noted nothing unusual in either system.log or 
cassandra.log. Doing a "nodetool status" from node 1 shows all four nodes up:

user@node001=> nodetool status
Datacenter: datacenter1

===

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address  Load   Tokens  OwnsHost ID 
  Rack

UN  192.168.187.121  538.95 GB  256 ?   
c99cf581-f4ae-4aa9-ab37-1a114ab2429b  rack1

UN  192.168.187.122  630.72 GB  256 ?   
bfa07f47-7e37-42b4-9c0b-024b3c02e93f  rack1

UN  192.168.187.123  572.73 GB  256 ?   
273df9f3-e496-4c65-a1f2-325ed288a992  rack1

UN  192.168.187.124  625.05 GB  256 ?   
b8639cf1-5413-4ece-b882-2161bbb8a9c3  rack1
But doing the same command from any other of the 3 nodes shows node 1 still 
down.


user@node002=> nodetool status

Datacenter: datacenter1

===

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address  Load   Tokens  OwnsHost ID   

Seeing tons of DigestMismatchException exceptions after upgrading from 2.2.13 to 3.11.4

2019-11-25 Thread Colleen Velo
Hello,

As part of the final stages of our 2.2 --> 3.11 upgrades, one of our
clusters (on AWS/ 18 nodes/ m4.2xlarge) produced some post-upgrade fits. We
started getting spikes of Cassandra read and write timeouts despite the
fact the overall metrics volumes were unchanged. As part of the upgrade
process, there was a TWCS table that we used a facade implementation to
help change the namespace of the compaction class, but that has very low
query volume.

The DigestMismatchException error messages, (based on sampling the hash
keys and finding which tables have partitions for that hash key), seem to
be occurring on the heaviest volume table (4,000 reads, 1600 writes per
second per node approximately), and that table has semi-medium row widths
with about 10-40 column keys. (Or at least the digest mismatch partitions
have that type of width). The keyspace is an RF3 using NetworkTopology, the
CL is QUORUM for both reads and writes.

We have experienced the DigestMismatchException errors on all 3 of the
Production clusters that we have upgraded (all of them are single DC in the
us-east-1/eu-west-1/ap-northeast-2 AWS regions) and in all three cases,
those DigestMismatchException errors were not there in either the  2.1.x or
2.2.x versions of Cassandra.
Does anyone know of changes from 2.2 to 3.11 that would produce additional
timeout problems, such as heavier blocking read repair logic?  Also,

We ran repairs (via reaper v1.4.8) (much nicer in 3.11 than 2.1) on all of
the tables and across all of the nodes, and our timeouts seemed to have
disappeared, but we continue to see a rapid streaming of the Digest
mismatches exceptions, so much so that our Cassandra debug logs are rolling
over every 15 minutes..   There is a mail list post from 2018 that
indicates that some DigestMismatchException error messages are natural if
you are reading while writing, but the sheer volume that we are getting is
very concerning:
 - https://www.mail-archive.com/user@cassandra.apache.org/msg56078.html

Is that level of DigestMismatchException unusual? Or is can that volume of
mismatches appear if semi-wide rows simply require a lot of resolution
because flurries of quorum reads/writes (RF3) on recent partitions have a
decent chance of not having fully synced data on the replica reads? Does
the digest mismatch error get debug-logged on every chance read repair?
(edited)
Also, why are these DigestMismatchException only occurring once the upgrade
to 3.11 has occurred?

~

Sample DigestMismatchException error message:
DEBUG [ReadRepairStage:13] 2019-11-22 01:38:14,448
ReadCallback.java:242 - Digest mismatch:
org.apache.cassandra.service.DigestMismatchException: Mismatch for key
DecoratedKey(-6492169518344121155,
66306139353831322d323064382d313037322d663965632d636565663165326563303965)
(be2c0feaa60d99c388f9d273fdc360f7 vs 09eaded2d69cf2dd49718076edf56b36)
at
org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
~[apache-cassandra-3.11.4.jar:3.11.4]
at
org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
~[apache-cassandra-3.11.4.jar:3.11.4]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_77]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_77]
at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
[apache-cassandra-3.11.4.jar:3.11.4]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_77]

Cluster(s) setup:
* AWS region: eu-west-1:
— Nodes: 18
— single DC
— keyspace: RF3 using NetworkTopology

* AWS region: us-east-1:
— Nodes: 20
— single DC
— keyspace: RF3 using NetworkTopology

* AWS region: ap-northeast-2:
— Nodes: 30
— single DC
— keyspace: RF3 using NetworkTopology

Thanks for any insight into this issue.

-- 

*Colleen Veloemail: cmv...@gmail.com *


RE: Cassandra is not showing a node up hours after restart

2019-11-25 Thread Paul Mena
I’ve just discovered that NTP is not running on any of these Cassandra nodes, 
and that the timestamps are all over the map. Could this be causing my issue?

user@remote=> ansible pre-prod-cassandra -a date
node001.intra.myorg.org | CHANGED | rc=0 >>
Mon Nov 25 13:58:17 UTC 2019

node004.intra.myorg.org | CHANGED | rc=0 >>
Mon Nov 25 14:07:20 UTC 2019

node003.intra.myorg.org | CHANGED | rc=0 >>
Mon Nov 25 13:57:06 UTC 2019

node001.intra.myorg.org | CHANGED | rc=0 >>
Mon Nov 25 14:07:22 UTC 2019

Paul Mena
Senior Application Administrator
WHOI - Information Services
508-289-3539

From: Inquistive allen 
Sent: Monday, November 25, 2019 2:46 AM
To: user@cassandra.apache.org
Subject: Re: Cassandra is not showing a node up hours after restart

Hello team,

Just to add on to the discussion, one may run,
Nodetool disablebinary followed by a nodetool disablethrift followed by 
nodetool drain.
Nodetool drain also does the work of nodetool flush+ declaring in the cluster 
that I'm down and not accepting traffic.

Thanks


On Mon, 25 Nov, 2019, 12:55 AM Surbhi Gupta, 
mailto:surbhi.gupt...@gmail.com>> wrote:
Before Cassandra shutdown, nodetool drain should be executed first. As soon as 
you do nodetool drain, others node will see this node down and no new traffic 
will come to this node.
I generally gives 10 seconds gap between nodetool drain and Cassandra stop.

On Sun, Nov 24, 2019 at 9:52 AM Paul Mena 
mailto:pm...@whoi.edu>> wrote:

Thank you for the replies. I had made no changes to the config before the 
rolling restart.



I can try another restart but was wondering if I should do it differently. I 
had simply done "service cassandra stop" followed by "service cassandra start". 
 Since then I've seen some suggestions to proceed the shutdown with "nodetool 
disablegossip" and/or "nodetool drain". Are these commands advisable? Are any 
other commands recommended either before the shutdown or after the startup?



Thanks again!



Paul


From: Naman Gupta 
mailto:naman.gu...@girnarsoft.com>>
Sent: Sunday, November 24, 2019 11:18:14 AM
To: user@cassandra.apache.org
Subject: Re: Cassandra is not showing a node up hours after restart

Did you change the name of datacenter or any other config changes before the 
rolling restart?

On Sun, Nov 24, 2019 at 8:49 PM Paul Mena 
mailto:pm...@whoi.edu>> wrote:
I am in the process of doing a rolling restart on a 4-node cluster running 
Cassandra 2.1.9. I stopped and started Cassandra on node 1 via "service 
cassandra stop/start", and noted nothing unusual in either system.log or 
cassandra.log. Doing a "nodetool status" from node 1 shows all four nodes up:

user@node001=> nodetool status
Datacenter: datacenter1

===

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address  Load   Tokens  OwnsHost ID 
  Rack

UN  192.168.187.121  538.95 GB  256 ?   
c99cf581-f4ae-4aa9-ab37-1a114ab2429b  rack1

UN  192.168.187.122  630.72 GB  256 ?   
bfa07f47-7e37-42b4-9c0b-024b3c02e93f  rack1

UN  192.168.187.123  572.73 GB  256 ?   
273df9f3-e496-4c65-a1f2-325ed288a992  rack1

UN  192.168.187.124  625.05 GB  256 ?   
b8639cf1-5413-4ece-b882-2161bbb8a9c3  rack1
But doing the same command from any other of the 3 nodes shows node 1 still 
down.


user@node002=> nodetool status

Datacenter: datacenter1

===

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address  Load   Tokens  OwnsHost ID 
  Rack

DN  192.168.187.121  538.94 GB  256 ?   
c99cf581-f4ae-4aa9-ab37-1a114ab2429b  rack1

UN  192.168.187.122  630.72 GB  256 ?   
bfa07f47-7e37-42b4-9c0b-024b3c02e93f  rack1

UN  192.168.187.123  572.73 GB  256 ?   
273df9f3-e496-4c65-a1f2-325ed288a992  rack1

UN  192.168.187.124  625.04 GB  256 ?   
b8639cf1-5413-4ece-b882-2161bbb8a9c3  rack1
Is there something I can do to remedy this current situation - so that I can 
continue with the rolling restart?