Re: Cassandra takes more time to join the cluster

sat Wed, 27 Dec 2017 19:53:18 -0800

Hi,

This issue is related to
https://issues.apache.org/jira/browse/CASSANDRA-9630


Thanks and Regards
A.SathishKumar

On Sun, Dec 24, 2017 at 2:59 AM, sat <sathish.al...@gmail.com> wrote:

> Hi,
>
> We have 3 node cluster, before we reboot one of the VMWare VM/node (RHEL
> 7.2) all 3 nodes formed the cluster without any issues. However after
> reboot, we noticed rebooted node (7 out of 10 times) takes more time to
> join approximately 10  - 15min.
>
> Cassandra - 3.9 version
>
> While investigating the issue further we noticed
>
>    - Node 1 (rebooted node) able to send "SYN", "ACK2" messages for both
>    the nodes (Node 2, Node 3) even though nodetool status displays "Node 2 and
>    3 as "DN"" only in "Node 1"
>    - After 10 - 15min we noticed *"Connection Timeout"*  exception in
>    Node 2 and 3. being thrown from OutboundTcpConnection.java (line # 311)
>    which triggers a state change event to "Node 1" and changes the state as
>    "UN".
>
>
>
>> *if (logger.isTraceEnabled())**                    logger.trace("error
>> writing to {}", poolReference.endPoint(), e);*
>
>
> Please let us know what triggers "Connection TimeOut" exception in "Node 2
> and 3" and ways to resolve this.
>
>
> Thanks and Regards
> A.SathishKumar
>
> On Fri, Dec 22, 2017 at 12:06 PM, sat <sathish.al...@gmail.com> wrote:
>
>> Hi Jeff,
>>
>> Thanks for your prompt  response. Please find logs. Does Gossip has
>> dependency with NTP, because we find the node we reboot takes time to sync
>> with NTP.
>>
>> When will FailureDetector message kick in, we have this issue only when
>> we see Gossiper and Failure Detector log message.
>>
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 FailureDetector.java:272 -
>> Average for /10.63.114.158 is 9.977908204545455E8
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1140 - /
>> 10.63.114.158 local generation 1513967155, remote generation 1513967155
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1193 -
>> Updating heartbeat state version to 14666 from 14663 for /10.63.114.158
>> ...
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:986 - Sending
>> a EchoMessage to /10.63.114.158
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 MessagingService.java:760 -
>> /10.63.114.150 sending ECHO to 6794@/10.63.114.158
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:144 - My
>> heartbeat is now 2258
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:500 - Gossip
>> Digests are : /10.63.114.150:1513971247:2258 /10.63.114.154:1513967154
>> :14664 /10.63.114.158:1513967155:14666
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending
>> a GossipDigestSyn to /10.63.114.158 ...
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 -
>> /10.63.114.150 sending GOSSIP_DIGEST_SYN to 6795@/10.63.114.158
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending
>> a GossipDigestSyn to /10.63.114.154 ...
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 -
>> /10.63.114.150 sending GOSSIP_DIGEST_SYN to 6796@/10.63.114.154
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:757 -
>> Performing status check ...
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 -
>> PHI for /10.63.114.158 : 0.09877714244264273
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:315 -
>> PHI for /10.63.114.158 : 0.09877714244264273
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:316 -
>> mean for /10.63.114.158 : 9.977908204545455E8
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 -
>> PHI for /10.63.114.154 : 0.27857029580273446
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:315 -
>> PHI for /10.63.114.154 : 0.27857029580273446
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:316 -
>> mean for /10.63.114.154 : 7.65247444583815E8
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858
>> GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage
>> from /10.63.114.158
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858
>> GossipDigestAckVerbHandler.java:52 - Received ack with 1 digests and 0
>> states
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local
>> heartbeat version 2258 greater than 2257 for /10.63.114.150
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858
>> GossipDigestAckVerbHandler.java:84 - Sending a GossipDigestAck2Message
>> to /10.63.114.158
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 -
>> /10.63.114.150 sending GOSSIP_DIGEST_ACK2 to 6797@/10.63.114.158
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858
>> GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage
>> from /10.63.114.154
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858
>> GossipDigestAckVerbHandler.java:52 - Received ack with 2 digests and 1
>> states
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1140 - /
>> 10.63.114.154 local generation 1513967154, remote generation 1513967154
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1193 -
>> Updating heartbeat state version to 14664 from 14664 for /10.63.114.154
>> ...
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:986 - Sending
>> a EchoMessage to /10.63.114.154
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 -
>> /10.63.114.150 sending ECHO to 6798@/10.63.114.154
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local
>> heartbeat version 2258 greater than 2257 for /10.63.114.150
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local
>> heartbeat version 14666 greater than 14663 for /10.63.114.158
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,859 Gossiper.java:904 - Adding
>> state SEVERITY: 0.0
>>
>> Thanks and Regards
>> A.SathishKumar
>>
>>
>>
>> On Thu, Dec 21, 2017 at 9:40 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>>
>>> Not received is ambiguous.
>>>
>>> Can you paste the logs somewhere like pastebin? Sanitize/anonymize as
>>> needed.
>>>
>>>
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Dec 21, 2017, at 9:15 PM, sat <sathish.al...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> We observe gossip is not received by other nodes and we see "Convicted"
>>> message in debug.log even though cassandra process is running in all 3
>>> nodes.
>>>
>>> Thanks and Regards
>>> A.SathisKumar
>>>
>>> On Thu, Dec 21, 2017 at 9:12 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>>> What’s it logging?
>>>>
>>>> --
>>>> Jeff Jirsa
>>>>
>>>>
>>>> On Dec 21, 2017, at 8:56 PM, sat <sathish.al...@gmail.com> wrote:
>>>>
>>>> We have configured 3 node cassandra cluster in RHEL 7.2 version and we
>>>> are doing cluster testing. When we start cassandra in all 3 nodes they form
>>>> a cluster and they work fine.
>>>>
>>>> But when we bring one node down using "init 6" or "reboot" command, the
>>>> rebooted node takes more time to join the cluster, however if we manually
>>>> kill and start cassandra process the nodes join cluster immediately without
>>>> any issues.
>>>>
>>>> We have provided all 3 IPs as seed nodes and the cluster name is same
>>>> for all 3 nodes and their respective IP as listen address.
>>>>
>>>> Selinux is also disabled.
>>>>
>>>> Please help us in resolving this issue.
>>>>
>>>> Thanks
>>>>
>>>>
>>>
>>>
>>> --
>>> A.SathishKumar
>>> 044-24735023
>>>
>>>
>>
>>
>> --
>> A.SathishKumar
>> 044-24735023
>>
>
>
>
> --
> A.SathishKumar
> 044-24735023
>



-- 
A.SathishKumar
044-24735023

Re: Cassandra takes more time to join the cluster

Reply via email to