As we are still without a functional Cassandra cluster in our development
environment, I thought I’d try restarting the same node (one of 4 in the
cluster) with the following command:
ip=$(cat /etc/hostname); nodetool disablethrift && nodetool disablebinary &&
sleep 5 && nodetool disablegossip && nodetool drain && sleep 10 && sudo service
cassandra restart && until echo "SELECT * FROM system.peers LIMIT 1;" | cqlsh
$ip > /dev/null 2>&1; do echo "Node $ip is still DOWN"; sleep 10; done && echo
"Node $ip is now UP"
The above command returned “Node is now UP” after about 40 seconds, confirmed
on “node001” via “nodetool status”:
user@node001=> nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID
Rack
UN 192.168.187.121 539.43 GB 256 ?
c99cf581-f4ae-4aa9-ab37-1a114ab2429b rack1
UN 192.168.187.122 633.92 GB 256 ?
bfa07f47-7e37-42b4-9c0b-024b3c02e93f rack1
UN 192.168.187.123 576.31 GB 256 ?
273df9f3-e496-4c65-a1f2-325ed288a992 rack1
UN 192.168.187.124 628.5 GB 256 ?
b8639cf1-5413-4ece-b882-2161bbb8a9c3 rack1
As was the case before, running “nodetool status” on any of the other nodes
shows that “node001” is still down:
user@node002=> nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID
Rack
DN 192.168.187.121 538.94 GB 256 ?
c99cf581-f4ae-4aa9-ab37-1a114ab2429b rack1
UN 192.168.187.122 634.04 GB 256 ?
bfa07f47-7e37-42b4-9c0b-024b3c02e93f rack1
UN 192.168.187.123 576.42 GB 256 ?
273df9f3-e496-4c65-a1f2-325ed288a992 rack1
UN 192.168.187.124 628.56 GB 256 ?
b8639cf1-5413-4ece-b882-2161bbb8a9c3 rack1
Is it inadvisable to continue with the rolling restart?
Paul Mena
Senior Application Administrator
WHOI - Information Services
508-289-3539
From: Shalom Sagges <[email protected]>
Sent: Tuesday, November 26, 2019 12:59 AM
To: [email protected]
Subject: Re: Cassandra is not showing a node up hours after restart
Hi Paul,
From the gossipinfo output, it looks like the node's IP address and rpc_address
are different.
/192.168.187.121 vs RPC_ADDRESS:192.168.185.121
You can also see that there's a schema disagreement between nodes, e.g.
schema_id on node001 is fd2dcb4b-ca62-30df-b8f2-d3fd774f2801 and on node002 it
is fd2dcb4b-ca62-30df-b8f2-d3fd774f2801.
You can run nodetool describecluster to see it as well.
So I suggest to change the rpc_address to the ip_address of the node or set it
to 0.0.0.0 and it should resolve the issue.
Hope this helps!
On Tue, Nov 26, 2019 at 4:05 AM Inquistive allen
<[email protected]<mailto:[email protected]>> wrote:
Hello ,
Check and compare everything parameters
1. Java version should ideally match across all nodes in the cluster
2. Check if port 7000 is open between the nodes. Use telnet or nc commands
3. You must see some clues in system logs, why the gossip is failing.
Do confirm on the above things.
Thanks
On Tue, 26 Nov, 2019, 2:50 AM Paul Mena,
<[email protected]<mailto:[email protected]>> wrote:
NTP was restarted on the Cassandra nodes, but unfortunately I’m still getting
the same result: the restarted node does not appear to be rejoining the cluster.
Here’s another data point: “nodetool gossipinfo”, when run from the restarted
node (“node001”) shows a status of “normal”:
user@node001=> nodetool -u gossipinfo
/192.168.187.121<http://192.168.187.121>
generation:1574364410
heartbeat:209150
NET_VERSION:8
RACK:rack1
STATUS:NORMAL,-104847506331695918
RELEASE_VERSION:2.1.9
SEVERITY:0.0
LOAD:5.78684155614E11
HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
DC:datacenter1
RPC_ADDRESS:192.168.185.121
When run from one of the other nodes, however, node001’s status is shown as
“shutdown”:
user@node002=> nodetool gossipinfo
/192.168.187.121<http://192.168.187.121>
generation:1491825076
heartbeat:2147483647
STATUS:shutdown,true
RACK:rack1
NET_VERSION:8
LOAD:5.78679987693E11
RELEASE_VERSION:2.1.9
DC:datacenter1
SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
RPC_ADDRESS:192.168.185.121
SEVERITY:0.0
Paul Mena
Senior Application Administrator
WHOI - Information Services
508-289-3539
From: Paul Mena
Sent: Monday, November 25, 2019 9:29 AM
To: [email protected]<mailto:[email protected]>
Subject: RE: Cassandra is not showing a node up hours after restart
I’ve just discovered that NTP is not running on any of these Cassandra nodes,
and that the timestamps are all over the map. Could this be causing my issue?
user@remote=> ansible pre-prod-cassandra -a date
node001.intra.myorg.org<http://node001.intra.myorg.org> | CHANGED | rc=0 >>
Mon Nov 25 13:58:17 UTC 2019
node004.intra.myorg.org<http://node004.intra.myorg.org> | CHANGED | rc=0 >>
Mon Nov 25 14:07:20 UTC 2019
node003.intra.myorg.org<http://node003.intra.myorg.org> | CHANGED | rc=0 >>
Mon Nov 25 13:57:06 UTC 2019
node001.intra.myorg.org<http://node001.intra.myorg.org> | CHANGED | rc=0 >>
Mon Nov 25 14:07:22 UTC 2019
Paul Mena
Senior Application Administrator
WHOI - Information Services
508-289-3539
From: Inquistive allen <[email protected]<mailto:[email protected]>>
Sent: Monday, November 25, 2019 2:46 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Cassandra is not showing a node up hours after restart
Hello team,
Just to add on to the discussion, one may run,
Nodetool disablebinary followed by a nodetool disablethrift followed by
nodetool drain.
Nodetool drain also does the work of nodetool flush+ declaring in the cluster
that I'm down and not accepting traffic.
Thanks
On Mon, 25 Nov, 2019, 12:55 AM Surbhi Gupta,
<[email protected]<mailto:[email protected]>> wrote:
Before Cassandra shutdown, nodetool drain should be executed first. As soon as
you do nodetool drain, others node will see this node down and no new traffic
will come to this node.
I generally gives 10 seconds gap between nodetool drain and Cassandra stop.
On Sun, Nov 24, 2019 at 9:52 AM Paul Mena
<[email protected]<mailto:[email protected]>> wrote:
Thank you for the replies. I had made no changes to the config before the
rolling restart.
I can try another restart but was wondering if I should do it differently. I
had simply done "service cassandra stop" followed by "service cassandra start".
Since then I've seen some suggestions to proceed the shutdown with "nodetool
disablegossip" and/or "nodetool drain". Are these commands advisable? Are any
other commands recommended either before the shutdown or after the startup?
Thanks again!
Paul
________________________________
From: Naman Gupta
<[email protected]<mailto:[email protected]>>
Sent: Sunday, November 24, 2019 11:18:14 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Cassandra is not showing a node up hours after restart
Did you change the name of datacenter or any other config changes before the
rolling restart?
On Sun, Nov 24, 2019 at 8:49 PM Paul Mena
<[email protected]<mailto:[email protected]>> wrote:
I am in the process of doing a rolling restart on a 4-node cluster running
Cassandra 2.1.9. I stopped and started Cassandra on node 1 via "service
cassandra stop/start", and noted nothing unusual in either system.log or
cassandra.log. Doing a "nodetool status" from node 1 shows all four nodes up:
user@node001=> nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID
Rack
UN 192.168.187.121 538.95 GB 256 ?
c99cf581-f4ae-4aa9-ab37-1a114ab2429b rack1
UN 192.168.187.122 630.72 GB 256 ?
bfa07f47-7e37-42b4-9c0b-024b3c02e93f rack1
UN 192.168.187.123 572.73 GB 256 ?
273df9f3-e496-4c65-a1f2-325ed288a992 rack1
UN 192.168.187.124 625.05 GB 256 ?
b8639cf1-5413-4ece-b882-2161bbb8a9c3 rack1
But doing the same command from any other of the 3 nodes shows node 1 still
down.
user@node002=> nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID
Rack
DN 192.168.187.121 538.94 GB 256 ?
c99cf581-f4ae-4aa9-ab37-1a114ab2429b rack1
UN 192.168.187.122 630.72 GB 256 ?
bfa07f47-7e37-42b4-9c0b-024b3c02e93f rack1
UN 192.168.187.123 572.73 GB 256 ?
273df9f3-e496-4c65-a1f2-325ed288a992 rack1
UN 192.168.187.124 625.04 GB 256 ?
b8639cf1-5413-4ece-b882-2161bbb8a9c3 rack1
Is there something I can do to remedy this current situation - so that I can
continue with the rolling restart?