In the Cassandra versions 2.1.11 - 2.1.16, after we decommission a node or
datacenter, we observe the decommissioned nodes marked as DOWN in the
cluster when you do a "nodetool describecluster". The nodes however do not
show up in the "nodetool status" command.
The decommissioned node also does not show up in the "system_peers" table
on the nodes.

The workaround we follow is rolling restart of the cluster, which removes
the decommissioned nodes from the "UNREACHABLE STATE", and shows the actual
state of the cluster. The workaround is tedious for huge clusters.

We also verified the decommission process in CCM tool, and observed the
same issue for clusters with versions from 2.1.12 to 2.1.16. The issue was
not observed in versions prior to or later than the ones mentioned above.


Has anybody in the community observed similar issue? We've also raised a
JIRA issue regarding this.
https://issues.apache.org/jira/browse/CASSANDRA-13144


Below are the observed logs from the versions without the bug, and with the
bug.  The one's highlighted in yellow show the expected logs. The one's
highlighted in red are the one's where the node is recognized as down, and
shows as UNREACHABLE.



Cassandra 2.1.1 Logs showing the decommissioned node :  (Without the bug)

2017-01-19 20:18:56,415 [GossipStage:1] DEBUG ArrivalWindow Ignoring
interval time of 2049943233 for /X.X.X.X
2017-01-19 20:18:56,416 [GossipStage:1] DEBUG StorageService Node /X.X.X.X
state left, tokens [ 59353109817657926242901533144729725259,
60254520910109313597677907197875221475,
75698727618038614819889933974570742305,
84508739091270910297310401957975430578]
2017-01-19 20:18:56,416 [GossipStage:1] DEBUG Gossiper adding expire time
for endpoint : /X.X.X.X (1485116334088)
2017-01-19 20:18:56,417 [GossipStage:1] INFO StorageService Removing
tokens [100434964734820719895982857900842892337,
114144647582686041354301802358217767299,
132090888860517964702932350041942412177,
138409460913927199437556572481804704749] for /X.X.X.X
2017-01-19 20:18:56,418 [HintedHandoff:3] INFO HintedHandOffManager
Deleting any stored hints for /X.X.X.X
2017-01-19 20:18:56,424 [GossipStage:1] DEBUG MessagingService Resetting
version for /X.X.X.X
2017-01-19 20:18:56,424 [GossipStage:1] DEBUG Gossiper removing endpoint
/X.X.X.X
2017-01-19 20:18:56,437 [GossipStage:1] DEBUG StorageService Ignoring state
change for dead or unknown endpoint: /X.X.X.X
2017-01-19 20:19:02,022 [WRITE-/X.X.X.X] DEBUG OutboundTcpConnection
attempting to connect to /X.X.X.X
2017-01-19 20:19:02,023 [HANDSHAKE-/X.X.X.X] INFO OutboundTcpConnection
Handshaking version with /X.X.X.X
2017-01-19 20:19:02,023 [WRITE-/X.X.X.X] DEBUG MessagingService Setting
version 7 for /X.X.X.X
2017-01-19 20:19:08,096 [GossipStage:1] DEBUG ArrivalWindow Ignoring
interval time of 2074454222 for /X.X.X.X
2017-01-19 20:19:54,407 [GossipStage:1] DEBUG ArrivalWindow Ignoring
interval time of 4302985797 for /X.X.X.X
2017-01-19 20:19:57,405 [GossipTasks:1] DEBUG Gossiper 60000 elapsed,
/X.X.X.X gossip quarantine over
2017-01-19 20:19:57,455 [GossipStage:1] DEBUG ArrivalWindow Ignoring
interval time of 3047826501 for /X.X.X.X
2017-01-19 20:19:57,455 [GossipStage:1] DEBUG StorageService Ignoring state
change for dead or unknown endpoint: /X.X.X.X


Cassandra 2.1.16 Logs showing the decommissioned node :   (The logs in
2.1.16 show the same as 2.1.1 upto "DEBUG Gossiper 60000 elapsed, /X.X.X.X
gossip quarantine over", and then is followed by "NODE is now DOWN"

017-01-19 19:52:23,687 [GossipStage:1] DEBUG StorageService.java:1883 -
Node /X.X.X.X state left, tokens [-1112888759032625467,
-228773855963737699, -311455042375
4381391, -4848625944949064281, -6920961603460018610, -8566729719076824066,
1611098831406674636, 7278843689020594771, 7565410054791352413, 9166885764,
8654747784805453046]
2017-01-19 19:52:23,688 [GossipStage:1] DEBUG Gossiper.java:1520 - adding
expire time for endpoint : /X.X.X.X (1485114743567)
2017-01-19 19:52:23,688 [GossipStage:1] INFO StorageService.java:1965 -
Removing tokens [-1112888759032625467, -228773855963737699,
-3114550423754381391, -48486259449
49064281, -6920961603460018610, 5690722015779071557, 6202373691525063547,
7191120402564284381, 7278843689020594771, 7565410054791352413,
8524200089166885764, 865474778
4805453046] for /X.X.X.X
2017-01-19 19:52:23,689 [HintedHandoffManager:1] INFO
HintedHandOffManager.java:230 - Deleting any stored hints for /X.X.X.X
2017-01-19 19:52:23,689 [GossipStage:1] DEBUG MessagingService.java:840 -
Resetting version for /X.X.X.X
2017-01-19 19:52:23,690 [GossipStage:1] DEBUG Gossiper.java:417 - removing
endpoint /X.X.X.X
2017-01-19 19:52:23,691 [GossipStage:1] DEBUG StorageService.java:1552 -
Ignoring state change for dead or unknown endpoint: /X.X.X.X
2017-01-19 19:52:31,617 [MessagingService-Outgoing-/X.X.X.X] DEBUG
OutboundTcpConnection.java:372 - attempting to connect to /X.X.X.X
2017-01-19 19:52:31,618 [HANDSHAKE-/X.X.X.X] INFO
OutboundTcpConnection.java:488 - Handshaking version with /X.X.X.X
2017-01-19 19:52:31,619 [MessagingService-Outgoing-/X.X.X.X] DEBUG
MessagingService.java:826 - Setting version 8 for /X.X.X.X
2017-01-19 19:53:19,914 [GossipStage:1] DEBUG FailureDetector.java:423 -
Ignoring interval time of 6004119075 for /X.X.X.X
2017-01-19 19:53:23,702 [GossipTasks:1] DEBUG Gossiper.java:795 - 60000
elapsed, /X.X.X.X gossip quarantine over
2017-01-19 19:53:23,985 [GossipStage:1] DEBUG StorageService.java:1552 -
Ignoring state change for dead or unknown endpoint: /X.X.X.X
2017-01-19 19:53:26,223 [GossipStage:1] DEBUG FailureDetector.java:423 -
Ignoring interval time of 6309159352 for /X.X.X.X
2017-01-19 19:53:50,709 [GossipTasks:1] DEBUG Gossiper.java:336 -
Convicting /X.X.X.X with status LEFT - alive true
2017-01-19 19:53:50,709 [GossipTasks:1] INFO Gossiper.java:1008 -
InetAddress /X.X.X.X is now DOWN
2017-01-19 19:53:50,709 [GossipTasks:1] DEBUG MessagingService.java:429 -
Resetting pool for /X.X.X.X
2017-01-19 19:53:51,710 [GossipTasks:1] DEBUG Gossiper.java:336 -
Convicting /X.X.X.X with status LEFT - alive false
2017-01-19 19:53:53,711 [MessagingService-Outgoing-/X.X.X.X] DEBUG
OutboundTcpConnection.java:372 - attempting to connect to /X.X.X.X
2017-01-19 19:53:53,711 [GossipTasks:1] DEBUG Gossiper.java:336 -
Convicting /X.X.X.X with status LEFT - alive false
2017-01-19 19:53:54,711 [GossipTasks:1] DEBUG Gossiper.java:336 -
Convicting /X.X.X.X with status LEFT - alive false



thanks

Sai

Reply via email to