Re: Cassandra takes more time to join the cluster

2017-12-27 Thread sat
Hi,

This issue is related to
https://issues.apache.org/jira/browse/CASSANDRA-9630

Thanks and Regards
A.SathishKumar

On Sun, Dec 24, 2017 at 2:59 AM, sat  wrote:

> Hi,
>
> We have 3 node cluster, before we reboot one of the VMWare VM/node (RHEL
> 7.2) all 3 nodes formed the cluster without any issues. However after
> reboot, we noticed rebooted node (7 out of 10 times) takes more time to
> join approximately 10  - 15min.
>
> Cassandra - 3.9 version
>
> While investigating the issue further we noticed
>
>- Node 1 (rebooted node) able to send "SYN", "ACK2" messages for both
>the nodes (Node 2, Node 3) even though nodetool status displays "Node 2 and
>3 as "DN"" only in "Node 1"
>- After 10 - 15min we noticed *"Connection Timeout"*  exception in
>Node 2 and 3. being thrown from OutboundTcpConnection.java (line # 311)
>which triggers a state change event to "Node 1" and changes the state as
>"UN".
>
>
>
>> *if (logger.isTraceEnabled())**logger.trace("error
>> writing to {}", poolReference.endPoint(), e);*
>
>
> Please let us know what triggers "Connection TimeOut" exception in "Node 2
> and 3" and ways to resolve this.
>
>
> Thanks and Regards
> A.SathishKumar
>
> On Fri, Dec 22, 2017 at 12:06 PM, sat  wrote:
>
>> Hi Jeff,
>>
>> Thanks for your prompt  response. Please find logs. Does Gossip has
>> dependency with NTP, because we find the node we reboot takes time to sync
>> with NTP.
>>
>> When will FailureDetector message kick in, we have this issue only when
>> we see Gossiper and Failure Detector log message.
>>
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 FailureDetector.java:272 -
>> Average for /10.63.114.158 is 9.977908204545455E8
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1140 - /
>> 10.63.114.158 local generation 1513967155, remote generation 1513967155
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1193 -
>> Updating heartbeat state version to 14666 from 14663 for /10.63.114.158
>> ...
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:986 - Sending
>> a EchoMessage to /10.63.114.158
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 MessagingService.java:760 -
>> /10.63.114.150 sending ECHO to 6794@/10.63.114.158
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:144 - My
>> heartbeat is now 2258
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:500 - Gossip
>> Digests are : /10.63.114.150:1513971247:2258 /10.63.114.154:1513967154
>> :14664 /10.63.114.158:1513967155:14666
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending
>> a GossipDigestSyn to /10.63.114.158 ...
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 -
>> /10.63.114.150 sending GOSSIP_DIGEST_SYN to 6795@/10.63.114.158
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending
>> a GossipDigestSyn to /10.63.114.154 ...
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 -
>> /10.63.114.150 sending GOSSIP_DIGEST_SYN to 6796@/10.63.114.154
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:757 -
>> Performing status check ...
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 -
>> PHI for /10.63.114.158 : 0.09877714244264273
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:315 -
>> PHI for /10.63.114.158 : 0.09877714244264273
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:316 -
>> mean for /10.63.114.158 : 9.977908204545455E8
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 -
>> PHI for /10.63.114.154 : 0.27857029580273446
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:315 -
>> PHI for /10.63.114.154 : 0.27857029580273446
>> TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:316 -
>> mean for /10.63.114.154 : 7.65247444583815E8
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858
>> GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage
>> from /10.63.114.158
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858
>> GossipDigestAckVerbHandler.java:52 - Received ack with 1 digests and 0
>> states
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local
>> heartbeat version 2258 greater than 2257 for /10.63.114.150
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858
>> GossipDigestAckVerbHandler.java:84 - Sending a GossipDigestAck2Message
>> to /10.63.114.158
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 -
>> /10.63.114.150 sending GOSSIP_DIGEST_ACK2 to 6797@/10.63.114.158
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858
>> GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage
>> from /10.63.114.154
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858
>> GossipDigestAckVerbHandler.java:52 - Received ack with 2 digests and 1
>> states
>> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1140 - /
>> 10.63.114.154 loc

Re: Cassandra takes more time to join the cluster

2017-12-24 Thread sat
Hi,

We have 3 node cluster, before we reboot one of the VMWare VM/node (RHEL
7.2) all 3 nodes formed the cluster without any issues. However after
reboot, we noticed rebooted node (7 out of 10 times) takes more time to
join approximately 10  - 15min.

Cassandra - 3.9 version

While investigating the issue further we noticed

   - Node 1 (rebooted node) able to send "SYN", "ACK2" messages for both
   the nodes (Node 2, Node 3) even though nodetool status displays "Node 2 and
   3 as "DN"" only in "Node 1"
   - After 10 - 15min we noticed *"Connection Timeout"*  exception in Node
   2 and 3. being thrown from OutboundTcpConnection.java (line # 311)  which
   triggers a state change event to "Node 1" and changes the state as "UN".



> *if (logger.isTraceEnabled())**logger.trace("error
> writing to {}", poolReference.endPoint(), e);*


Please let us know what triggers "Connection TimeOut" exception in "Node 2
and 3" and ways to resolve this.


Thanks and Regards
A.SathishKumar

On Fri, Dec 22, 2017 at 12:06 PM, sat  wrote:

> Hi Jeff,
>
> Thanks for your prompt  response. Please find logs. Does Gossip has
> dependency with NTP, because we find the node we reboot takes time to sync
> with NTP.
>
> When will FailureDetector message kick in, we have this issue only when we
> see Gossiper and Failure Detector log message.
>
> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 FailureDetector.java:272 -
> Average for /10.63.114.158 is 9.977908204545455E8
> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1140 - /
> 10.63.114.158 local generation 1513967155, remote generation 1513967155
> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1193 -
> Updating heartbeat state version to 14666 from 14663 for /10.63.114.158
> ...
> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:986 - Sending
> a EchoMessage to /10.63.114.158
> TRACE [GossipStage:1] 2017-12-22 12:46:26,759 MessagingService.java:760 - /
> 10.63.114.150 sending ECHO to 6794@/10.63.114.158
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:144 - My
> heartbeat is now 2258
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:500 - Gossip
> Digests are : /10.63.114.150:1513971247:2258 /10.63.114.154:1513967154:14664
> /10.63.114.158:1513967155:14666
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending
> a GossipDigestSyn to /10.63.114.158 ...
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - /
> 10.63.114.150 sending GOSSIP_DIGEST_SYN to 6795@/10.63.114.158
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending
> a GossipDigestSyn to /10.63.114.154 ...
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - /
> 10.63.114.150 sending GOSSIP_DIGEST_SYN to 6796@/10.63.114.154
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:757 -
> Performing status check ...
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 -
> PHI for /10.63.114.158 : 0.09877714244264273
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:315 -
> PHI for /10.63.114.158 : 0.09877714244264273
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:316 -
> mean for /10.63.114.158 : 9.977908204545455E8
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 -
> PHI for /10.63.114.154 : 0.27857029580273446
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:315 -
> PHI for /10.63.114.154 : 0.27857029580273446
> TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:316 -
> mean for /10.63.114.154 : 7.65247444583815E8
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 
> GossipDigestAckVerbHandler.java:41
> - Received a GossipDigestAckMessage from /10.63.114.158
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 
> GossipDigestAckVerbHandler.java:52
> - Received ack with 1 digests and 0 states
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local
> heartbeat version 2258 greater than 2257 for /10.63.114.150
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 
> GossipDigestAckVerbHandler.java:84
> - Sending a GossipDigestAck2Message to /10.63.114.158
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 - /
> 10.63.114.150 sending GOSSIP_DIGEST_ACK2 to 6797@/10.63.114.158
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 
> GossipDigestAckVerbHandler.java:41
> - Received a GossipDigestAckMessage from /10.63.114.154
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 
> GossipDigestAckVerbHandler.java:52
> - Received ack with 2 digests and 1 states
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1140 - /
> 10.63.114.154 local generation 1513967154, remote generation 1513967154
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1193 -
> Updating heartbeat state version to 14664 from 14664 for /10.63.114.154
> ...
> TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:986 - Sending

Re: Cassandra takes more time to join the cluster

2017-12-22 Thread sat
Hi Jeff,

Thanks for your prompt  response. Please find logs. Does Gossip has
dependency with NTP, because we find the node we reboot takes time to sync
with NTP.

When will FailureDetector message kick in, we have this issue only when we
see Gossiper and Failure Detector log message.

TRACE [GossipStage:1] 2017-12-22 12:46:26,759 FailureDetector.java:272 -
Average for /10.63.114.158 is 9.977908204545455E8
TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1140 - /
10.63.114.158 local generation 1513967155, remote generation 1513967155
TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:1193 - Updating
heartbeat state version to 14666 from 14663 for /10.63.114.158 ...
TRACE [GossipStage:1] 2017-12-22 12:46:26,759 Gossiper.java:986 - Sending a
EchoMessage to /10.63.114.158
TRACE [GossipStage:1] 2017-12-22 12:46:26,759 MessagingService.java:760 - /
10.63.114.150 sending ECHO to 6794@/10.63.114.158
TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:144 - My
heartbeat is now 2258
TRACE [GossipTasks:1] 2017-12-22 12:46:26,856 Gossiper.java:500 - Gossip
Digests are : /10.63.114.150:1513971247:2258 /10.63.114.154:1513967154:14664
/10.63.114.158:1513967155:14666
TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending a
GossipDigestSyn to /10.63.114.158 ...
TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - /
10.63.114.150 sending GOSSIP_DIGEST_SYN to 6795@/10.63.114.158
TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:646 - Sending a
GossipDigestSyn to /10.63.114.154 ...
TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 MessagingService.java:760 - /
10.63.114.150 sending GOSSIP_DIGEST_SYN to 6796@/10.63.114.154
TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 Gossiper.java:757 -
Performing status check ...
TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 -
PHI for /10.63.114.158 : 0.09877714244264273
TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:315 -
PHI for /10.63.114.158 : 0.09877714244264273
TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:316 -
mean for /10.63.114.158 : 9.977908204545455E8
TRACE [GossipTasks:1] 2017-12-22 12:46:26,857 FailureDetector.java:298 -
PHI for /10.63.114.154 : 0.27857029580273446
TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:315 -
PHI for /10.63.114.154 : 0.27857029580273446
TRACE [GossipTasks:1] 2017-12-22 12:46:26,858 FailureDetector.java:316 -
mean for /10.63.114.154 : 7.65247444583815E8
TRACE [GossipStage:1] 2017-12-22 12:46:26,858
GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage from
/10.63.114.158
TRACE [GossipStage:1] 2017-12-22 12:46:26,858
GossipDigestAckVerbHandler.java:52 - Received ack with 1 digests and 0
states
TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local
heartbeat version 2258 greater than 2257 for /10.63.114.150
TRACE [GossipStage:1] 2017-12-22 12:46:26,858
GossipDigestAckVerbHandler.java:84 - Sending a GossipDigestAck2Message to /
10.63.114.158
TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 - /
10.63.114.150 sending GOSSIP_DIGEST_ACK2 to 6797@/10.63.114.158
TRACE [GossipStage:1] 2017-12-22 12:46:26,858
GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage from
/10.63.114.154
TRACE [GossipStage:1] 2017-12-22 12:46:26,858
GossipDigestAckVerbHandler.java:52 - Received ack with 2 digests and 1
states
TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1140 - /
10.63.114.154 local generation 1513967154, remote generation 1513967154
TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:1193 - Updating
heartbeat state version to 14664 from 14664 for /10.63.114.154 ...
TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:986 - Sending a
EchoMessage to /10.63.114.154
TRACE [GossipStage:1] 2017-12-22 12:46:26,858 MessagingService.java:760 - /
10.63.114.150 sending ECHO to 6798@/10.63.114.154
TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local
heartbeat version 2258 greater than 2257 for /10.63.114.150
TRACE [GossipStage:1] 2017-12-22 12:46:26,858 Gossiper.java:889 - local
heartbeat version 14666 greater than 14663 for /10.63.114.158
TRACE [GossipStage:1] 2017-12-22 12:46:26,859 Gossiper.java:904 - Adding
state SEVERITY: 0.0

Thanks and Regards
A.SathishKumar



On Thu, Dec 21, 2017 at 9:40 PM, Jeff Jirsa  wrote:

> Not received is ambiguous.
>
> Can you paste the logs somewhere like pastebin? Sanitize/anonymize as
> needed.
>
>
>
> --
> Jeff Jirsa
>
>
> On Dec 21, 2017, at 9:15 PM, sat  wrote:
>
> Hi,
>
> We observe gossip is not received by other nodes and we see "Convicted"
> message in debug.log even though cassandra process is running in all 3
> nodes.
>
> Thanks and Regards
> A.SathisKumar
>
> On Thu, Dec 21, 2017 at 9:12 PM, Jeff Jirsa  wrote:
>
>> What’s it logging?
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Dec 21, 2017, at 8:56 PM, sat  wrote:
>>
>> We have configured 3 node cassandra cluster in 

Re: Cassandra takes more time to join the cluster

2017-12-21 Thread Jeff Jirsa
Not received is ambiguous. 

Can you paste the logs somewhere like pastebin? Sanitize/anonymize as needed.



-- 
Jeff Jirsa


> On Dec 21, 2017, at 9:15 PM, sat  wrote:
> 
> Hi,
> 
> We observe gossip is not received by other nodes and we see "Convicted" 
> message in debug.log even though cassandra process is running in all 3 nodes.
> 
> Thanks and Regards
> A.SathisKumar
> 
>> On Thu, Dec 21, 2017 at 9:12 PM, Jeff Jirsa  wrote:
>> What’s it logging? 
>> 
>> -- 
>> Jeff Jirsa
>> 
>> 
>>> On Dec 21, 2017, at 8:56 PM, sat  wrote:
>>> 
>>> We have configured 3 node cassandra cluster in RHEL 7.2 version and we are 
>>> doing cluster testing. When we start cassandra in all 3 nodes they form a 
>>> cluster and they work fine.
>>> 
>>> But when we bring one node down using "init 6" or "reboot" command, the 
>>> rebooted node takes more time to join the cluster, however if we manually 
>>> kill and start cassandra process the nodes join cluster immediately without 
>>> any issues.
>>> 
>>> We have provided all 3 IPs as seed nodes and the cluster name is same for 
>>> all 3 nodes and their respective IP as listen address.
>>> 
>>> Selinux is also disabled.
>>> 
>>> Please help us in resolving this issue.
>>> 
>>> Thanks
>>> 
>>> 
> 
> 
> 
> -- 
> A.SathishKumar
> 044-24735023


Re: Cassandra takes more time to join the cluster

2017-12-21 Thread sat
Hi,

We observe gossip is not received by other nodes and we see "Convicted"
message in debug.log even though cassandra process is running in all 3
nodes.

Thanks and Regards
A.SathisKumar

On Thu, Dec 21, 2017 at 9:12 PM, Jeff Jirsa  wrote:

> What’s it logging?
>
> --
> Jeff Jirsa
>
>
> On Dec 21, 2017, at 8:56 PM, sat  wrote:
>
> We have configured 3 node cassandra cluster in RHEL 7.2 version and we are
> doing cluster testing. When we start cassandra in all 3 nodes they form a
> cluster and they work fine.
>
> But when we bring one node down using "init 6" or "reboot" command, the
> rebooted node takes more time to join the cluster, however if we manually
> kill and start cassandra process the nodes join cluster immediately without
> any issues.
>
> We have provided all 3 IPs as seed nodes and the cluster name is same for
> all 3 nodes and their respective IP as listen address.
>
> Selinux is also disabled.
>
> Please help us in resolving this issue.
>
> Thanks
>
>


-- 
A.SathishKumar
044-24735023


Re: Cassandra takes more time to join the cluster

2017-12-21 Thread Jeff Jirsa
What’s it logging? 

-- 
Jeff Jirsa


> On Dec 21, 2017, at 8:56 PM, sat  wrote:
> 
> We have configured 3 node cassandra cluster in RHEL 7.2 version and we are 
> doing cluster testing. When we start cassandra in all 3 nodes they form a 
> cluster and they work fine.
> 
> But when we bring one node down using "init 6" or "reboot" command, the 
> rebooted node takes more time to join the cluster, however if we manually 
> kill and start cassandra process the nodes join cluster immediately without 
> any issues.
> 
> We have provided all 3 IPs as seed nodes and the cluster name is same for all 
> 3 nodes and their respective IP as listen address.
> 
> Selinux is also disabled.
> 
> Please help us in resolving this issue.
> 
> Thanks
> 
>