During a rolling restart, one node marked as "shutdown" indefinitely by all other nodes

Yap Sok Ann Thu, 02 Nov 2017 02:43:06 -0700

We have a large cluster running 2.1.19, with 3 datacenters:
- xxx: 220 nodes
- yyy: 220 nodes
- zzz: 500 nodes


The ping time between the datacenters are:
- xxx to yyy:  50 ms
- xxx to zzz: 240 ms
- yyy to zzz: 200 ms

There are some added complications such that:
- our team is managing xxx and yyy only, and have no ssh/logging access
to zzz.
- the link to zzz can be slow/unstable at times.

In a maintenance window a few days ago, we did a rolling restart,
bouncing one node at a time in both xxx and yyy concurrently, with
command `nodetool drain && sudo service cassandra restart && sleep 30`.

In the process, the node xxx-167 somehow was marked as "shutdown" by all
other nodes, and remained as such indefinitely.

We only noticed that about 8 hours later. Then, we checked the
gossipinfo for xxx-167:

- from the node itself:

xxx-167.domain/1.2.3.4
  generation:1509301942
  heartbeat:47368
  STATUS:14:NORMAL,<redacted>
  LOAD:47293:<redacted>
  SCHEMA:10:<redacted>
  DC:29:xxx
  RACK:31:a
  RELEASE_VERSION:4:2.1.19
  RPC_ADDRESS:3:1.2.3.4
  SEVERITY:47367:0.0
  NET_VERSION:1:8
  HOST_ID:2:<redacted>
  TOKENS:13:<hidden>

- from yyy-001:

/1.2.3.4
  generation:1509301942
  heartbeat:2147483647
  STATUS:608083:shutdown,true
  LOAD:22:<redacted>
  SCHEMA:10:<redacted>
  DC:29:xxx
  RACK:31:a
  RELEASE_VERSION:4:2.1.19
  RPC_ADDRESS:3:1.2.3.4
  SEVERITY:176:0.0
  NET_VERSION:1:8
  HOST_ID:2:<redacted>
  TOKENS:13:<hidden>

Another 2 hours later, we decided to just restart cassandra on xxx-167.
That fixed the problem.

Here's some logs from nodes in xxx and yyy, that could be relevant to
the incident:

2017-10-29 18:31:55,324 [yyy-081] StorageService.java:1138 - DRAINING:
starting drain process
2017-10-29 18:31:57,897 [yyy-081] StorageService.java:1138 - DRAINED
2017-10-29 18:32:06,116 [xxx-167] StorageService.java:1138 - DRAINING:
starting drain process
2017-10-29 18:32:06,357 [group-a] Gossiper.java:1010 - InetAddress
xxx-167 is now DOWN
... repeat ...
2017-10-29 18:32:06,552 [group-a] Gossiper.java:1010 - InetAddress
xxx-167 is now DOWN
2017-10-29 18:32:08,534 [xxx-167] StorageService.java:1138 - DRAINED
2017-10-29 18:32:10,337 [yyy-081] CassandraDaemon.java:155 - Hostname:
yyy-081.domain
2017-10-29 18:32:12,939 [yyy-081] StorageService.java:807 - Starting up
server gossip
2017-10-29 18:32:13,484 [yyy-081] OutboundTcpConnection.java:496 -
Handshaking version with yyy-173
2017-10-29 18:32:13,486 [yyy-173] OutboundTcpConnection.java:496 -
Handshaking version with yyy-081
2017-10-29 18:32:13,493 [yyy-081] Gossiper.java:1029 - Node xxx-167 has
restarted, now UP
2017-10-29 18:32:13,495 [yyy-081] StorageService.java:1715 - Node
xxx-167 state jump to shutdown
2017-10-29 18:32:13,499 [yyy-081] Gossiper.java:1010 - InetAddress
xxx-167 is now DOWN
2017-10-29 18:32:20,182 [xxx-167] CassandraDaemon.java:155 - Hostname:
xxx-167.domain
2017-10-29 18:32:22,304 [xxx-167] StorageService.java:807 - Starting up
server gossip
2017-10-29 18:32:22,463 [yyy-081] OutboundTcpConnection.java:496 -
Handshaking version with xxx-167
2017-10-29 18:32:22,780 [xxx-167] OutboundTcpConnection.java:496 -
Handshaking version with yyy-173
2017-10-29 18:32:23,857 [xxx-167] OutboundTcpConnection.java:496 -
Handshaking version with yyy-081
2017-10-29 18:32:25,797 [yyy-173] Gossiper.java:1029 - Node xxx-167 has
restarted, now UP
2017-10-29 18:32:25,847 [yyy-173] OutboundTcpConnection.java:496 -
Handshaking version with xxx-167
2017-10-29 18:32:27,113 [yyy-081] Gossiper.java:1029 - Node xxx-167 has
restarted, now UP
2017-10-29 18:33:13,443 [yyy-173] Gossiper.java:1010 - InetAddress
xxx-167 is now DOWN
2017-10-29 18:33:13,628 [group-b] StorageService.java:1715 - Node
xxx-167 state jump to shutdown
... repeat ...
2017-10-29 18:33:28,809 [group-b] StorageService.java:1715 - Node
xxx-167 state jump to shutdown

Notes:
- "group-a" includes 437 nodes in xxx and yyy, except:
  - xxx-167: the draining node
  - yyy-081: still starting up
  - yyy-173: somehow didn't notice that xxx-167 was down

- "group-b" includes 438 nodes in xxx and yyy, except:
  - xxx-167: the wrongly marked as shutdown node
  - yyy-173: "patient zero" that suddenly thought that xxx-167 was down

Obviously, we would like to know how can this happen, and if there is
any way to prevent it from happening again. Some other questions:

1. Should I create a jira issue?

2. For rolling restart, is it better to do so one datacenter at a time?

3. For graceful shutdown, is the command `nodetool drain && sudo service
cassandra restart` sufficient? The internet suggests to do `nodetool
disablebinary && nodetool disablethrift && nodetool disablegossip`
before `nodetool drain`. However, looking at the 2.1 source code,
`nodetool drain` already does the same things before draining.

Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

During a rolling restart, one node marked as "shutdown" indefinitely by all other nodes

Reply via email to