Re: Cassandra takes more time to join the cluster
Hi, This issue is related to https://issues.apache.org/jira/browse/CASSANDRA-9630 Thanks and Regards A.SathishKumar On Sun, Dec 24, 2017 at 2:59 AM, sat wrote: > Hi, > > We have 3 node cluster, before we reboot one of the VMWare VM/node (RHEL > 7.2) all 3 nodes formed the cluster without any issues. However after > reboot, we noticed rebooted node (7 out of 10 times) takes more time to > join approximately 10 - 15min. > > Cassandra - 3.9 version > > While investigating the issue further we noticed > >- Node 1 (rebooted node) able to send "SYN", "ACK2" messages for both >the nodes (Node 2, Node 3) even though nodetool status displays "Node 2 and >3 as "DN"" only in "Node 1" >- After 10 - 15min we noticed *"Connection Timeout"* exception in >Node 2 and 3. being thrown from OutboundTcpConnection.java (line # 311) >which triggers a state change event to "Node 1" and changes the state as >"UN". > > > >> *if (logger.isTraceEnabled())**logger.trace("error >> writing to {}", poolReference.endPoint(), e);* > > > Please let us know what triggers "Connection TimeOut" exception in "Node 2 > and 3" and ways to resolve this. > > > Thanks and Regards > A.SathishKumar > > On Fri, Dec 22, 2017 at 12:06 PM, sat wrote: > >> Hi Jeff, >> >> Thanks for your prompt response. Please find logs. Does Gossip has >> dependency with NTP, because we find the node we reboot takes time to sync >> with NTP. >> >> When will FailureDetector message kick in, we have this issue only when >> we see Gossiper and Failure Detector log message. >> >> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 FailureDetector.java:272 - >> Average for /10.63.114.158 is 9.977908204545455E8 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1140 - / >> 10.63.114.158 local generation 1513967155, remote generation 1513967155 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1193 - >> Updating heartbeat state version to 14666 from 14663 for /10.63.114.158 >> ... >> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:986 - Sending >> a EchoMessage to /10.63.114.158 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 MessagingService.java:760 - >> /10.63.114.150 sending ECHO to 6794@/10.63.114.158 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:144 - My >> heartbeat is now 2258 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:500 - Gossip >> Digests are : /10.63.114.150:1513971247:2258 /10.63.114.154:1513967154 >> :14664 /10.63.114.158:1513967155:14666 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending >> a GossipDigestSyn to /10.63.114.158 ... >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - >> /10.63.114.150 sending GOSSIP_DIGEST_SYN to 6795@/10.63.114.158 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending >> a GossipDigestSyn to /10.63.114.154 ... >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - >> /10.63.114.150 sending GOSSIP_DIGEST_SYN to 6796@/10.63.114.154 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:757 - >> Performing status check ... >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 - >> PHI for /10.63.114.158 : 0.09877714244264273 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:315 - >> PHI for /10.63.114.158 : 0.09877714244264273 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:316 - >> mean for /10.63.114.158 : 9.977908204545455E8 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 - >> PHI for /10.63.114.154 : 0.27857029580273446 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:315 - >> PHI for /10.63.114.154 : 0.27857029580273446 >> TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:316 - >> mean for /10.63.114.154 : 7.65247444583815E8 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 >> GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage >> from /10.63.114.158 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 >> GossipDigestAckVerbHandler.java:52 - Received ack with 1 digests and 0 >> states >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local >> heartbeat version 2258 greater than 2257 for /10.63.114.150 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 >> GossipDigestAckVerbHandler.java:84 - Sending a GossipDigestAck2Message >> to /10.63.114.158 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 - >> /10.63.114.150 sending GOSSIP_DIGEST_ACK2 to 6797@/10.63.114.158 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 >> GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage >> from /10.63.114.154 >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 >> GossipDigestAckVerbHandler.java:52 - Received ack with 2 digests and 1 >> states >> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1140 - / >> 10.63.114.154 loc
Re: Cassandra takes more time to join the cluster
Hi, We have 3 node cluster, before we reboot one of the VMWare VM/node (RHEL 7.2) all 3 nodes formed the cluster without any issues. However after reboot, we noticed rebooted node (7 out of 10 times) takes more time to join approximately 10 - 15min. Cassandra - 3.9 version While investigating the issue further we noticed - Node 1 (rebooted node) able to send "SYN", "ACK2" messages for both the nodes (Node 2, Node 3) even though nodetool status displays "Node 2 and 3 as "DN"" only in "Node 1" - After 10 - 15min we noticed *"Connection Timeout"* exception in Node 2 and 3. being thrown from OutboundTcpConnection.java (line # 311) which triggers a state change event to "Node 1" and changes the state as "UN". > *if (logger.isTraceEnabled())**logger.trace("error > writing to {}", poolReference.endPoint(), e);* Please let us know what triggers "Connection TimeOut" exception in "Node 2 and 3" and ways to resolve this. Thanks and Regards A.SathishKumar On Fri, Dec 22, 2017 at 12:06 PM, sat wrote: > Hi Jeff, > > Thanks for your prompt response. Please find logs. Does Gossip has > dependency with NTP, because we find the node we reboot takes time to sync > with NTP. > > When will FailureDetector message kick in, we have this issue only when we > see Gossiper and Failure Detector log message. > > TRACE [GossipStage:1] 2017-12-22 12:46:26,759 FailureDetector.java:272 - > Average for /10.63.114.158 is 9.977908204545455E8 > TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1140 - / > 10.63.114.158 local generation 1513967155, remote generation 1513967155 > TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1193 - > Updating heartbeat state version to 14666 from 14663 for /10.63.114.158 > ... > TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:986 - Sending > a EchoMessage to /10.63.114.158 > TRACE [GossipStage:1] 2017-12-22 12:46:26,759 MessagingService.java:760 - / > 10.63.114.150 sending ECHO to 6794@/10.63.114.158 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:144 - My > heartbeat is now 2258 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:500 - Gossip > Digests are : /10.63.114.150:1513971247:2258 /10.63.114.154:1513967154:14664 > /10.63.114.158:1513967155:14666 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending > a GossipDigestSyn to /10.63.114.158 ... > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - / > 10.63.114.150 sending GOSSIP_DIGEST_SYN to 6795@/10.63.114.158 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending > a GossipDigestSyn to /10.63.114.154 ... > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - / > 10.63.114.150 sending GOSSIP_DIGEST_SYN to 6796@/10.63.114.154 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:757 - > Performing status check ... > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 - > PHI for /10.63.114.158 : 0.09877714244264273 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:315 - > PHI for /10.63.114.158 : 0.09877714244264273 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:316 - > mean for /10.63.114.158 : 9.977908204545455E8 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 - > PHI for /10.63.114.154 : 0.27857029580273446 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:315 - > PHI for /10.63.114.154 : 0.27857029580273446 > TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:316 - > mean for /10.63.114.154 : 7.65247444583815E8 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 > GossipDigestAckVerbHandler.java:41 > - Received a GossipDigestAckMessage from /10.63.114.158 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 > GossipDigestAckVerbHandler.java:52 > - Received ack with 1 digests and 0 states > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local > heartbeat version 2258 greater than 2257 for /10.63.114.150 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 > GossipDigestAckVerbHandler.java:84 > - Sending a GossipDigestAck2Message to /10.63.114.158 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 - / > 10.63.114.150 sending GOSSIP_DIGEST_ACK2 to 6797@/10.63.114.158 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 > GossipDigestAckVerbHandler.java:41 > - Received a GossipDigestAckMessage from /10.63.114.154 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 > GossipDigestAckVerbHandler.java:52 > - Received ack with 2 digests and 1 states > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1140 - / > 10.63.114.154 local generation 1513967154, remote generation 1513967154 > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1193 - > Updating heartbeat state version to 14664 from 14664 for /10.63.114.154 > ... > TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:986 - Sending
Re: Cassandra takes more time to join the cluster
Hi Jeff, Thanks for your prompt response. Please find logs. Does Gossip has dependency with NTP, because we find the node we reboot takes time to sync with NTP. When will FailureDetector message kick in, we have this issue only when we see Gossiper and Failure Detector log message. TRACE [GossipStage:1] 2017-12-22 12:46:26,759 FailureDetector.java:272 - Average for /10.63.114.158 is 9.977908204545455E8 TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1140 - / 10.63.114.158 local generation 1513967155, remote generation 1513967155 TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1193 - Updating heartbeat state version to 14666 from 14663 for /10.63.114.158 ... TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:986 - Sending a EchoMessage to /10.63.114.158 TRACE [GossipStage:1] 2017-12-22 12:46:26,759 MessagingService.java:760 - / 10.63.114.150 sending ECHO to 6794@/10.63.114.158 TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:144 - My heartbeat is now 2258 TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:500 - Gossip Digests are : /10.63.114.150:1513971247:2258 /10.63.114.154:1513967154:14664 /10.63.114.158:1513967155:14666 TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending a GossipDigestSyn to /10.63.114.158 ... TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - / 10.63.114.150 sending GOSSIP_DIGEST_SYN to 6795@/10.63.114.158 TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending a GossipDigestSyn to /10.63.114.154 ... TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - / 10.63.114.150 sending GOSSIP_DIGEST_SYN to 6796@/10.63.114.154 TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:757 - Performing status check ... TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 - PHI for /10.63.114.158 : 0.09877714244264273 TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:315 - PHI for /10.63.114.158 : 0.09877714244264273 TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:316 - mean for /10.63.114.158 : 9.977908204545455E8 TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 - PHI for /10.63.114.154 : 0.27857029580273446 TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:315 - PHI for /10.63.114.154 : 0.27857029580273446 TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:316 - mean for /10.63.114.154 : 7.65247444583815E8 TRACE [GossipStage:1] 2017-12-22 12:46:26,858 GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage from /10.63.114.158 TRACE [GossipStage:1] 2017-12-22 12:46:26,858 GossipDigestAckVerbHandler.java:52 - Received ack with 1 digests and 0 states TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local heartbeat version 2258 greater than 2257 for /10.63.114.150 TRACE [GossipStage:1] 2017-12-22 12:46:26,858 GossipDigestAckVerbHandler.java:84 - Sending a GossipDigestAck2Message to / 10.63.114.158 TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 - / 10.63.114.150 sending GOSSIP_DIGEST_ACK2 to 6797@/10.63.114.158 TRACE [GossipStage:1] 2017-12-22 12:46:26,858 GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage from /10.63.114.154 TRACE [GossipStage:1] 2017-12-22 12:46:26,858 GossipDigestAckVerbHandler.java:52 - Received ack with 2 digests and 1 states TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1140 - / 10.63.114.154 local generation 1513967154, remote generation 1513967154 TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1193 - Updating heartbeat state version to 14664 from 14664 for /10.63.114.154 ... TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:986 - Sending a EchoMessage to /10.63.114.154 TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 - / 10.63.114.150 sending ECHO to 6798@/10.63.114.154 TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local heartbeat version 2258 greater than 2257 for /10.63.114.150 TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local heartbeat version 14666 greater than 14663 for /10.63.114.158 TRACE [GossipStage:1] 2017-12-22 12:46:26,859 Gossiper.java:904 - Adding state SEVERITY: 0.0 Thanks and Regards A.SathishKumar On Thu, Dec 21, 2017 at 9:40 PM, Jeff Jirsa wrote: > Not received is ambiguous. > > Can you paste the logs somewhere like pastebin? Sanitize/anonymize as > needed. > > > > -- > Jeff Jirsa > > > On Dec 21, 2017, at 9:15 PM, sat wrote: > > Hi, > > We observe gossip is not received by other nodes and we see "Convicted" > message in debug.log even though cassandra process is running in all 3 > nodes. > > Thanks and Regards > A.SathisKumar > > On Thu, Dec 21, 2017 at 9:12 PM, Jeff Jirsa wrote: > >> What’s it logging? >> >> -- >> Jeff Jirsa >> >> >> On Dec 21, 2017, at 8:56 PM, sat wrote: >> >> We have configured 3 node cassandra cluster in
Re: Cassandra takes more time to join the cluster
Not received is ambiguous. Can you paste the logs somewhere like pastebin? Sanitize/anonymize as needed. -- Jeff Jirsa > On Dec 21, 2017, at 9:15 PM, sat wrote: > > Hi, > > We observe gossip is not received by other nodes and we see "Convicted" > message in debug.log even though cassandra process is running in all 3 nodes. > > Thanks and Regards > A.SathisKumar > >> On Thu, Dec 21, 2017 at 9:12 PM, Jeff Jirsa wrote: >> What’s it logging? >> >> -- >> Jeff Jirsa >> >> >>> On Dec 21, 2017, at 8:56 PM, sat wrote: >>> >>> We have configured 3 node cassandra cluster in RHEL 7.2 version and we are >>> doing cluster testing. When we start cassandra in all 3 nodes they form a >>> cluster and they work fine. >>> >>> But when we bring one node down using "init 6" or "reboot" command, the >>> rebooted node takes more time to join the cluster, however if we manually >>> kill and start cassandra process the nodes join cluster immediately without >>> any issues. >>> >>> We have provided all 3 IPs as seed nodes and the cluster name is same for >>> all 3 nodes and their respective IP as listen address. >>> >>> Selinux is also disabled. >>> >>> Please help us in resolving this issue. >>> >>> Thanks >>> >>> > > > > -- > A.SathishKumar > 044-24735023
Re: Cassandra takes more time to join the cluster
Hi, We observe gossip is not received by other nodes and we see "Convicted" message in debug.log even though cassandra process is running in all 3 nodes. Thanks and Regards A.SathisKumar On Thu, Dec 21, 2017 at 9:12 PM, Jeff Jirsa wrote: > What’s it logging? > > -- > Jeff Jirsa > > > On Dec 21, 2017, at 8:56 PM, sat wrote: > > We have configured 3 node cassandra cluster in RHEL 7.2 version and we are > doing cluster testing. When we start cassandra in all 3 nodes they form a > cluster and they work fine. > > But when we bring one node down using "init 6" or "reboot" command, the > rebooted node takes more time to join the cluster, however if we manually > kill and start cassandra process the nodes join cluster immediately without > any issues. > > We have provided all 3 IPs as seed nodes and the cluster name is same for > all 3 nodes and their respective IP as listen address. > > Selinux is also disabled. > > Please help us in resolving this issue. > > Thanks > > -- A.SathishKumar 044-24735023
Re: Cassandra takes more time to join the cluster
What’s it logging? -- Jeff Jirsa > On Dec 21, 2017, at 8:56 PM, sat wrote: > > We have configured 3 node cassandra cluster in RHEL 7.2 version and we are > doing cluster testing. When we start cassandra in all 3 nodes they form a > cluster and they work fine. > > But when we bring one node down using "init 6" or "reboot" command, the > rebooted node takes more time to join the cluster, however if we manually > kill and start cassandra process the nodes join cluster immediately without > any issues. > > We have provided all 3 IPs as seed nodes and the cluster name is same for all > 3 nodes and their respective IP as listen address. > > Selinux is also disabled. > > Please help us in resolving this issue. > > Thanks > >