Hi, We have 3 node cluster, before we reboot one of the VMWare VM/node (RHEL 7.2) all 3 nodes formed the cluster without any issues. However after reboot, we noticed rebooted node (7 out of 10 times) takes more time to join approximately 10 - 15min.
Cassandra - 3.9 version While investigating the issue further we noticed - Node 1 (rebooted node) able to send "SYN", "ACK2" messages for both the nodes (Node 2, Node 3) even though nodetool status displays "Node 2 and 3 as "DN"" only in "Node 1" - After 10 - 15min we noticed *"Connection Timeout"* exception in Node 2 and 3. being thrown from OutboundTcpConnection.java (line # 311) which triggers a state change event to "Node 1" and changes the state as "UN". > *if (logger.isTraceEnabled())** logger.trace("error > writing to {}", poolReference.endPoint(), e);* Please let us know what triggers "Connection TimeOut" exception in "Node 2 and 3" and ways to resolve this. Thanks and Regards A.SathishKumar On Fri, Dec 22, 2017 at 12:06 PM, sat <sathish.al...@gmail.com> wrote: > Hi Jeff, > > Thanks for your prompt response. Please find logs. Does Gossip has > dependency with NTP, because we find the node we reboot takes time to sync > with NTP. > > When will FailureDetector message kick in, we have this issue only when we > see Gossiper and Failure Detector log message. > > TRACE [GossipStage:1] 2017-12-22 12:46:26,759 FailureDetector.java:272 - > Average for /10.63.114.158 is 9.977908204545455E8 > TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1140 - / > 10.63.114.158 local generation 1513967155, remote generation 1513967155 > TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1193 - > Updating heartbeat state version to 14666 from 14663 for /10.63.114.158 > ... > TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:986 - Sending > a EchoMessage to /10.63.114.158 > TRACE [GossipStage:1] 2017-12-22 12:46:26,759 MessagingService.java:760 - / > 10.63.114.150 sending ECHO to 6794@/10.63.114.158 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:144 - My > heartbeat is now 2258 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:500 - Gossip > Digests are : /10.63.114.150:1513971247:2258 /10.63.114.154:1513967154:14664 > /10.63.114.158:1513967155:14666 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending > a GossipDigestSyn to /10.63.114.158 ... > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - / > 10.63.114.150 sending GOSSIP_DIGEST_SYN to 6795@/10.63.114.158 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending > a GossipDigestSyn to /10.63.114.154 ... > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - / > 10.63.114.150 sending GOSSIP_DIGEST_SYN to 6796@/10.63.114.154 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:757 - > Performing status check ... > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 - > PHI for /10.63.114.158 : 0.09877714244264273 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:315 - > PHI for /10.63.114.158 : 0.09877714244264273 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:316 - > mean for /10.63.114.158 : 9.977908204545455E8 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 - > PHI for /10.63.114.154 : 0.27857029580273446 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:315 - > PHI for /10.63.114.154 : 0.27857029580273446 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:316 - > mean for /10.63.114.154 : 7.65247444583815E8 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 > GossipDigestAckVerbHandler.java:41 > - Received a GossipDigestAckMessage from /10.63.114.158 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 > GossipDigestAckVerbHandler.java:52 > - Received ack with 1 digests and 0 states > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local > heartbeat version 2258 greater than 2257 for /10.63.114.150 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 > GossipDigestAckVerbHandler.java:84 > - Sending a GossipDigestAck2Message to /10.63.114.158 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 - / > 10.63.114.150 sending GOSSIP_DIGEST_ACK2 to 6797@/10.63.114.158 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 > GossipDigestAckVerbHandler.java:41 > - Received a GossipDigestAckMessage from /10.63.114.154 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 > GossipDigestAckVerbHandler.java:52 > - Received ack with 2 digests and 1 states > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1140 - / > 10.63.114.154 local generation 1513967154, remote generation 1513967154 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1193 - > Updating heartbeat state version to 14664 from 14664 for /10.63.114.154 > ... > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:986 - Sending > a EchoMessage to /10.63.114.154 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 - / > 10.63.114.150 sending ECHO to 6798@/10.63.114.154 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local > heartbeat version 2258 greater than 2257 for /10.63.114.150 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local > heartbeat version 14666 greater than 14663 for /10.63.114.158 > TRACE [GossipStage:1] 2017-12-22 12:46:26,859 Gossiper.java:904 - Adding > state SEVERITY: 0.0 > > Thanks and Regards > A.SathishKumar > > > > On Thu, Dec 21, 2017 at 9:40 PM, Jeff Jirsa <jji...@gmail.com> wrote: > >> Not received is ambiguous. >> >> Can you paste the logs somewhere like pastebin? Sanitize/anonymize as >> needed. >> >> >> >> -- >> Jeff Jirsa >> >> >> On Dec 21, 2017, at 9:15 PM, sat <sathish.al...@gmail.com> wrote: >> >> Hi, >> >> We observe gossip is not received by other nodes and we see "Convicted" >> message in debug.log even though cassandra process is running in all 3 >> nodes. >> >> Thanks and Regards >> A.SathisKumar >> >> On Thu, Dec 21, 2017 at 9:12 PM, Jeff Jirsa <jji...@gmail.com> wrote: >> >>> What’s it logging? >>> >>> -- >>> Jeff Jirsa >>> >>> >>> On Dec 21, 2017, at 8:56 PM, sat <sathish.al...@gmail.com> wrote: >>> >>> We have configured 3 node cassandra cluster in RHEL 7.2 version and we >>> are doing cluster testing. When we start cassandra in all 3 nodes they form >>> a cluster and they work fine. >>> >>> But when we bring one node down using "init 6" or "reboot" command, the >>> rebooted node takes more time to join the cluster, however if we manually >>> kill and start cassandra process the nodes join cluster immediately without >>> any issues. >>> >>> We have provided all 3 IPs as seed nodes and the cluster name is same >>> for all 3 nodes and their respective IP as listen address. >>> >>> Selinux is also disabled. >>> >>> Please help us in resolving this issue. >>> >>> Thanks >>> >>> >> >> >> -- >> A.SathishKumar >> 044-24735023 >> >> > > > -- > A.SathishKumar > 044-24735023 > -- A.SathishKumar 044-24735023