Hi,

We have 3 node cluster, before we reboot one of the VMWare VM/node (RHEL
7.2) all 3 nodes formed the cluster without any issues. However after
reboot, we noticed rebooted node (7 out of 10 times) takes more time to
join approximately 10  - 15min.

Cassandra - 3.9 version

While investigating the issue further we noticed

   - Node 1 (rebooted node) able to send "SYN", "ACK2" messages for both
   the nodes (Node 2, Node 3) even though nodetool status displays "Node 2 and
   3 as "DN"" only in "Node 1"
   - After 10 - 15min we noticed *"Connection Timeout"*  exception in Node
   2 and 3. being thrown from OutboundTcpConnection.java (line # 311)  which
   triggers a state change event to "Node 1" and changes the state as "UN".



> *if (logger.isTraceEnabled())**                    logger.trace("error
> writing to {}", poolReference.endPoint(), e);*


Please let us know what triggers "Connection TimeOut" exception in "Node 2
and 3" and ways to resolve this.


Thanks and Regards
A.SathishKumar

On Fri, Dec 22, 2017 at 12:06 PM, sat <sathish.al...@gmail.com> wrote:

> Hi Jeff,
>
> Thanks for your prompt  response. Please find logs. Does Gossip has
> dependency with NTP, because we find the node we reboot takes time to sync
> with NTP.
>
> When will FailureDetector message kick in, we have this issue only when we
> see Gossiper and Failure Detector log message.
>
> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 FailureDetector.java:272 -
> Average for /10.63.114.158 is 9.977908204545455E8
> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1140 - /
> 10.63.114.158 local generation 1513967155, remote generation 1513967155
> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1193 -
> Updating heartbeat state version to 14666 from 14663 for /10.63.114.158
> ...
> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:986 - Sending
> a EchoMessage to /10.63.114.158
> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 MessagingService.java:760 - /
> 10.63.114.150 sending ECHO to 6794@/10.63.114.158
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:144 - My
> heartbeat is now 2258
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:500 - Gossip
> Digests are : /10.63.114.150:1513971247:2258 /10.63.114.154:1513967154:14664
> /10.63.114.158:1513967155:14666
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending
> a GossipDigestSyn to /10.63.114.158 ...
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - /
> 10.63.114.150 sending GOSSIP_DIGEST_SYN to 6795@/10.63.114.158
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending
> a GossipDigestSyn to /10.63.114.154 ...
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - /
> 10.63.114.150 sending GOSSIP_DIGEST_SYN to 6796@/10.63.114.154
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:757 -
> Performing status check ...
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 -
> PHI for /10.63.114.158 : 0.09877714244264273
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:315 -
> PHI for /10.63.114.158 : 0.09877714244264273
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:316 -
> mean for /10.63.114.158 : 9.977908204545455E8
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 -
> PHI for /10.63.114.154 : 0.27857029580273446
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:315 -
> PHI for /10.63.114.154 : 0.27857029580273446
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:316 -
> mean for /10.63.114.154 : 7.65247444583815E8
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 
> GossipDigestAckVerbHandler.java:41
> - Received a GossipDigestAckMessage from /10.63.114.158
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 
> GossipDigestAckVerbHandler.java:52
> - Received ack with 1 digests and 0 states
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local
> heartbeat version 2258 greater than 2257 for /10.63.114.150
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 
> GossipDigestAckVerbHandler.java:84
> - Sending a GossipDigestAck2Message to /10.63.114.158
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 - /
> 10.63.114.150 sending GOSSIP_DIGEST_ACK2 to 6797@/10.63.114.158
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 
> GossipDigestAckVerbHandler.java:41
> - Received a GossipDigestAckMessage from /10.63.114.154
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 
> GossipDigestAckVerbHandler.java:52
> - Received ack with 2 digests and 1 states
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1140 - /
> 10.63.114.154 local generation 1513967154, remote generation 1513967154
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1193 -
> Updating heartbeat state version to 14664 from 14664 for /10.63.114.154
> ...
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:986 - Sending
> a EchoMessage to /10.63.114.154
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 - /
> 10.63.114.150 sending ECHO to 6798@/10.63.114.154
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local
> heartbeat version 2258 greater than 2257 for /10.63.114.150
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local
> heartbeat version 14666 greater than 14663 for /10.63.114.158
> TRACE [GossipStage:1] 2017-12-22 12:46:26,859 Gossiper.java:904 - Adding
> state SEVERITY: 0.0
>
> Thanks and Regards
> A.SathishKumar
>
>
>
> On Thu, Dec 21, 2017 at 9:40 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>
>> Not received is ambiguous.
>>
>> Can you paste the logs somewhere like pastebin? Sanitize/anonymize as
>> needed.
>>
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Dec 21, 2017, at 9:15 PM, sat <sathish.al...@gmail.com> wrote:
>>
>> Hi,
>>
>> We observe gossip is not received by other nodes and we see "Convicted"
>> message in debug.log even though cassandra process is running in all 3
>> nodes.
>>
>> Thanks and Regards
>> A.SathisKumar
>>
>> On Thu, Dec 21, 2017 at 9:12 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>>
>>> What’s it logging?
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Dec 21, 2017, at 8:56 PM, sat <sathish.al...@gmail.com> wrote:
>>>
>>> We have configured 3 node cassandra cluster in RHEL 7.2 version and we
>>> are doing cluster testing. When we start cassandra in all 3 nodes they form
>>> a cluster and they work fine.
>>>
>>> But when we bring one node down using "init 6" or "reboot" command, the
>>> rebooted node takes more time to join the cluster, however if we manually
>>> kill and start cassandra process the nodes join cluster immediately without
>>> any issues.
>>>
>>> We have provided all 3 IPs as seed nodes and the cluster name is same
>>> for all 3 nodes and their respective IP as listen address.
>>>
>>> Selinux is also disabled.
>>>
>>> Please help us in resolving this issue.
>>>
>>> Thanks
>>>
>>>
>>
>>
>> --
>> A.SathishKumar
>> 044-24735023
>>
>>
>
>
> --
> A.SathishKumar
> 044-24735023
>



-- 
A.SathishKumar
044-24735023

Reply via email to