We have a large cluster running 2.1.19, with 3 datacenters: - xxx: 220 nodes - yyy: 220 nodes - zzz: 500 nodes
The ping time between the datacenters are: - xxx to yyy: 50 ms - xxx to zzz: 240 ms - yyy to zzz: 200 ms There are some added complications such that: - our team is managing xxx and yyy only, and have no ssh/logging access to zzz. - the link to zzz can be slow/unstable at times. In a maintenance window a few days ago, we did a rolling restart, bouncing one node at a time in both xxx and yyy concurrently, with command `nodetool drain && sudo service cassandra restart && sleep 30`. In the process, the node xxx-167 somehow was marked as "shutdown" by all other nodes, and remained as such indefinitely. We only noticed that about 8 hours later. Then, we checked the gossipinfo for xxx-167: - from the node itself: xxx-167.domain/1.2.3.4 generation:1509301942 heartbeat:47368 STATUS:14:NORMAL,<redacted> LOAD:47293:<redacted> SCHEMA:10:<redacted> DC:29:xxx RACK:31:a RELEASE_VERSION:4:2.1.19 RPC_ADDRESS:3:1.2.3.4 SEVERITY:47367:0.0 NET_VERSION:1:8 HOST_ID:2:<redacted> TOKENS:13:<hidden> - from yyy-001: /1.2.3.4 generation:1509301942 heartbeat:2147483647 STATUS:608083:shutdown,true LOAD:22:<redacted> SCHEMA:10:<redacted> DC:29:xxx RACK:31:a RELEASE_VERSION:4:2.1.19 RPC_ADDRESS:3:1.2.3.4 SEVERITY:176:0.0 NET_VERSION:1:8 HOST_ID:2:<redacted> TOKENS:13:<hidden> Another 2 hours later, we decided to just restart cassandra on xxx-167. That fixed the problem. Here's some logs from nodes in xxx and yyy, that could be relevant to the incident: 2017-10-29 18:31:55,324 [yyy-081] StorageService.java:1138 - DRAINING: starting drain process 2017-10-29 18:31:57,897 [yyy-081] StorageService.java:1138 - DRAINED 2017-10-29 18:32:06,116 [xxx-167] StorageService.java:1138 - DRAINING: starting drain process 2017-10-29 18:32:06,357 [group-a] Gossiper.java:1010 - InetAddress xxx-167 is now DOWN ... repeat ... 2017-10-29 18:32:06,552 [group-a] Gossiper.java:1010 - InetAddress xxx-167 is now DOWN 2017-10-29 18:32:08,534 [xxx-167] StorageService.java:1138 - DRAINED 2017-10-29 18:32:10,337 [yyy-081] CassandraDaemon.java:155 - Hostname: yyy-081.domain 2017-10-29 18:32:12,939 [yyy-081] StorageService.java:807 - Starting up server gossip 2017-10-29 18:32:13,484 [yyy-081] OutboundTcpConnection.java:496 - Handshaking version with yyy-173 2017-10-29 18:32:13,486 [yyy-173] OutboundTcpConnection.java:496 - Handshaking version with yyy-081 2017-10-29 18:32:13,493 [yyy-081] Gossiper.java:1029 - Node xxx-167 has restarted, now UP 2017-10-29 18:32:13,495 [yyy-081] StorageService.java:1715 - Node xxx-167 state jump to shutdown 2017-10-29 18:32:13,499 [yyy-081] Gossiper.java:1010 - InetAddress xxx-167 is now DOWN 2017-10-29 18:32:20,182 [xxx-167] CassandraDaemon.java:155 - Hostname: xxx-167.domain 2017-10-29 18:32:22,304 [xxx-167] StorageService.java:807 - Starting up server gossip 2017-10-29 18:32:22,463 [yyy-081] OutboundTcpConnection.java:496 - Handshaking version with xxx-167 2017-10-29 18:32:22,780 [xxx-167] OutboundTcpConnection.java:496 - Handshaking version with yyy-173 2017-10-29 18:32:23,857 [xxx-167] OutboundTcpConnection.java:496 - Handshaking version with yyy-081 2017-10-29 18:32:25,797 [yyy-173] Gossiper.java:1029 - Node xxx-167 has restarted, now UP 2017-10-29 18:32:25,847 [yyy-173] OutboundTcpConnection.java:496 - Handshaking version with xxx-167 2017-10-29 18:32:27,113 [yyy-081] Gossiper.java:1029 - Node xxx-167 has restarted, now UP 2017-10-29 18:33:13,443 [yyy-173] Gossiper.java:1010 - InetAddress xxx-167 is now DOWN 2017-10-29 18:33:13,628 [group-b] StorageService.java:1715 - Node xxx-167 state jump to shutdown ... repeat ... 2017-10-29 18:33:28,809 [group-b] StorageService.java:1715 - Node xxx-167 state jump to shutdown Notes: - "group-a" includes 437 nodes in xxx and yyy, except: - xxx-167: the draining node - yyy-081: still starting up - yyy-173: somehow didn't notice that xxx-167 was down - "group-b" includes 438 nodes in xxx and yyy, except: - xxx-167: the wrongly marked as shutdown node - yyy-173: "patient zero" that suddenly thought that xxx-167 was down Obviously, we would like to know how can this happen, and if there is any way to prevent it from happening again. Some other questions: 1. Should I create a jira issue? 2. For rolling restart, is it better to do so one datacenter at a time? 3. For graceful shutdown, is the command `nodetool drain && sudo service cassandra restart` sufficient? The internet suggests to do `nodetool disablebinary && nodetool disablethrift && nodetool disablegossip` before `nodetool drain`. However, looking at the 2.1 source code, `nodetool drain` already does the same things before draining. Thanks --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org