[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-08-23 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589562#comment-16589562
 ] 

Joseph Lynch edited comment on CASSANDRA-14358 at 8/23/18 4:49 PM:
---

[~khuizhang] yea that looks similar with the half open keepalive connections. 
Did you try the kernel workaround and did it help?

[~aweisberg] [~jasobrown] I've got a mitigation patch so that Cassandra trunk 
at least heals the half open partitions faster. Please take a look if you have 
bandwidth for review. While testing the re-connection behavior I ran into 
CASSANDRA-14503 because the retry future was just getting clobbered by another 
message, so I couldn't test that we don't keep retrying after just connections 
are killed (as right now they just retry every message).
||trunk||
|[patch|https://github.com/apache/cassandra/compare/trunk...jolynch:CASSANDRA-14358]|
|[unit 
tests|https://circleci.com/gh/jolynch/workflows/cassandra/tree/CASSANDRA-14358]|

This patch makes the timeout configurable for internode connection (2s) and 
internode tcp user timeout (30s). The timeouts are settable via JMX (and 
nodetool).

I'm marking this as patch available as I think the operating system workaround 
is probably ok for previous releases? If it's not just let me know and I can 
try to figure out how to fix it for those ones as well.

I didn't have any new tests because the only way I'm aware to reproduce this 
behavior is by using iptables to blackhole traffic. I've been testing with a 
ccm cluster by trying to block just the small message channel (if gossip is 
blocked then the failure detector convicts it) and doing a read at ALL:
{noformat}
$ netstat -on | grep 7000 | grep 127.0.0.3:7000 | grep -v "0 127.0.0.3"
tcp0  0 127.0.0.1:55604 127.0.0.3:7000  ESTABLISHED 
keepalive (4093.37/0/0)
tcp0  0 127.0.0.1:55610 127.0.0.3:7000  ESTABLISHED 
keepalive (4093.53/0/0)
tcp0  0 127.0.0.1:58080 127.0.0.3:7000  ESTABLISHED 
keepalive (6601.96/0/0)

# Try to drop just the small message channel
$ sudo iptables -A OUTPUT -p tcp -d 127.0.0.3 --dport 7000 --sport 58080 -j DROP
{noformat}
Then I just check that we properly reconnect faster than 15 minutes. On an 
unpatched trunk I can watch the small message channel just sit there probing 
for 15 retries (the default tcp_retries2 value, node that this netstat is a 
different run than the previous one) :
{noformat}
netstat -on | grep 7000 | grep 127.0.0.3:7000 | grep -v "0 127.0.0.3"   

tcp    0  0 127.0.0.1:34532 127.0.0.3:7000  ESTABLISHED 
keepalive (6823.99/0/0) 
tcp    0  0 127.0.0.1:34564 127.0.0.3:7000  ESTABLISHED 
keepalive (6840.36/0/0) 
tcp    0    808 127.0.0.1:34544 127.0.0.3:7000  ESTABLISHED 
probe (108.32/0/10) 
tcp    0  0 127.0.0.1:34554 127.0.0.3:7000  ESTABLISHED 
keepalive (6840.26/0/0) 
tcp    0  0 127.0.0.1:34540 127.0.0.3:7000  ESTABLISHED 
keepalive (6831.27/0/0) 
tcp    0  0 127.0.0.1:34534 127.0.0.3:7000  ESTABLISHED 
keepalive (6828.96/0/0)

{noformat}
And then with my patch it get's killed after like 5 retries instead of waiting 
the full 15.

If you have ideas for how to unit test it I'm open to suggestions of course.


was (Author: jolynch):
[~khuizhang] yea that looks similar with the half open keepalive connections. 
Did you try the kernel workaround and did it help?

[~aweisberg] [~jasobrown] I've got a mitigation patch so that Cassandra trunk 
at least heals the half open partitions faster. Please take a look if you have 
bandwidth for review. While testing the re-connection behavior I ran into 
CASSANDRA-14503 because the retry future was just getting clobbered by another 
message, so I couldn't test that we don't keep retrying after just connections 
are killed (as right now they just retry every message).
||trunk||
|[patch|https://github.com/apache/cassandra/compare/trunk...jolynch:CASSANDRA-14358]|
|[unit 
tests|https://circleci.com/gh/jolynch/workflows/cassandra/tree/CASSANDRA-14358]|

This patch makes the timeout configurable for internode connection (2s) and 
internode tcp user timeout (30s). The timeouts are settable via JMX (and 
nodetool).

I'm marking this as patch available as I think the operating system workaround 
is probably ok for previous releases? If it's not just let me know and I can 
try to figure out how to fix it for those ones as well.

I didn't have any new tests because the only way I'm aware to reproduce this 
behavior is by using iptables to blackhole traffic. I've been testing with a 
ccm cluster when i disable gossip (with 3 nodes the gossip is so frequent that 
we mark the node as DOWN very fast):
{noformat}
$ netstat -on | grep 7000 | grep 127.0.0.3:7000 | grep -v "

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-21 Thread Joseph Lynch (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482919#comment-16482919
 ] 

Joseph Lynch edited comment on CASSANDRA-14358 at 5/21/18 7:27 PM:
---

[~alienth] that is interesting and thank you for digging so deeply! If I 
understand correctly during a {{drain}} the other servers are responsible for 
noticing the change and closing their connections within the 
{{[shutdown_announce_in_ms|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/Gossiper.java#L1497]}}
 period in 
[response|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/GossipShutdownVerbHandler.java#L37]
 to the {{GOSSIP_SHUTDOWN}} gossip state, and then the 
{{[markAsShutdown|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/Gossiper.java#L363-L373]}}
 method marks it down and forcibly convicts it. I believe that the TCP 
connections get closed via the {{StorageService}}'s {{onDead}} method which 
calls 
{{[onDead|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/StorageService.java#L2514]}}
 which calls 
{{[MessagingService::reset|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/MessagingService.java#L505]}}
 which calls 
{{[OutboundTcpConnection::closeSocket|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L80],
 which [enqueues a 
sentinel|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L210]}}
 into the backlog and then the 
{{[OutboundTcpConnection::run|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L253]}}
 method is actually supposed to close it. The {{drainedMessages}} queue is a 
local reference though so backlog could get something that was enqueued before 
the {{CLOSE_SENTINEL}} and after it as well. This seems very racey to me, in 
particular the reconnection logic might race with the closing logic from what I 
can tell as we have a 2 second window between when the clients start closing 
and when the server will actually stop accepting new connections (because it 
closes the listeners).

Non stateful networks would surface the RST in the {{writeConnected}} method, 
but AWS is like "yea that machine isn't allowed to talk to that one" and just 
blackholes the RSTs... I wonder if I can reproduce this by increasing that 
window significantly and just sending lots of traffic.


was (Author: jolynch):
[~alienth] that is interesting and thank you for digging so deeply! If I 
understand correctly during a {{drain}} the other servers are responsible for 
noticing the change and closing their connections within the 
{{[shutdown_announce_in_ms|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/Gossiper.java#L1497]}}
 period in 
[response|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/GossipShutdownVerbHandler.java#L37]
 to the {{GOSSIP_SHUTDOWN}} gossip state, and then the 
{{[markAsShutdown|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/Gossiper.java#L363-L373]}}
 method marks it down and forcibly convicts it. I believe that the TCP 
connections get closed via the {{StorageService}}'s {{onDead}} method which 
calls 
{{[onDead|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/StorageService.java#L2514]}}
 which calls 
{{[MessagingService::reset|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/MessagingService.java#L505]}}
 which calls 
{{[OutboundTcpConnection::closeSocket|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L80],
 which [enqueues a 
sentinel|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L210]}}
 into the backlog and then the 
{{[OutboundTcpConnection::run|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L253]}}
 method is actually supposed to close it. The {{drainedMessages}} queue is a 
local reference though so backlog could get something that was enqueued before 
the {{CLOSE_SENTINEL}} a

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464622#comment-16464622
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 4:40 AM:
--

Why on earth didn't the failure detector on {{.40}} see anything for 20 
minutes? Possibly due to blocking on an Outbound forever?


was (Author: alienth):
Why on earth didn't the failure detector in {{.40}} see anything wrong for 20 
minutes? Possibly due to blocking on an Outbound forever?

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new 
> node's ECHOs, but the new node is never getting the ECHO's. This appears, to 
> me, to indicate that the old node's {{OutboundTcpConnection}} thread is just 
> stuck and can't make any forward progress. By the time we could notice this 
> and slap TRACE logging on, the only thing we see is ~10 minutes later a 
> {{SocketException}} inside {{writeConnected}}'s flush and an immediate 
> recovery. It is interesting to me that the exception happens in 
> {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write 
> failure}} I believe that this can't be a connection reset), because my 
> understanding is that we should have a fully handshaked SSL connection at 
> that point in the code.
> Current theory:
>  # "New" node restarts,  "Old" node calls 
> [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433]
>  # Old node starts [creating a 
> new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L141]
>  SSL socket 
>  # SSLSocket calls 
> [createSocket|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/security/SSLFactory.java#L98],
>  which conveniently calls connect with a default timeout of "forever". We 
> could hang here forever until the OS kills us.
>  # If we continue, we get to 
> [writeConnected|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassa

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464596#comment-16464596
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 4:24 AM:
--

[~jasobrown] {{.88}} was down for about 10 minutes, and was shutdown via 
`nodetool drain`, followed by stopping the service gracefully.


{{.40}} has no gossiper logs showing {{.88}} going down, until the socket to 
{{.88}} is killed about 20 minutes later. {{.88}} was shut down at 1602 and 
came back up at 1612:

{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 
StorageService.java:1449 - DRAINING: starting drain process}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 
HintsService.java:220 - Paused hints dispatch}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 
ThriftServer.java:139 - Stop listening to thrift clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 
Server.java:176 - Stop listening for CQL clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 
Gossiper.java:1540 - Announcing shutdown}}

{{INFO  [main] 2018-05-04 16:12:39,746 StorageService.java:1449 - JOINING: 
Finish joining ring}}

And {{.40}} has no logs of this happening, and seemingly doesn't see it as down 
until ~20 minutes later:

{{INFO  [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - 
InetAddress /10.0.107.88 is now DOWN}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node 
/10.0.107.88 has restarted, now UP}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node 
/10.0.107.88 state jump to NORMAL}}

That is.. bizarre.


was (Author: alienth):
[~jasobrown] {{.88}} was down for about 10 minutes, and was shutdown via 
`nodetool drain`, followed by stopping the service gracefully.


{{.40}} has no gossiper logs showing {{.88}} going down, until the socket to 
{{.88}} is killed about 20 minutes later. {{.88}} was shut down at 1602 and 
came back up at 1602:

{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 
StorageService.java:1449 - DRAINING: starting drain process}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 
HintsService.java:220 - Paused hints dispatch}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 
ThriftServer.java:139 - Stop listening to thrift clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 
Server.java:176 - Stop listening for CQL clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 
Gossiper.java:1540 - Announcing shutdown}}

{{INFO  [main] 2018-05-04 16:12:39,746 StorageService.java:1449 - JOINING: 
Finish joining ring}}

And {{.40}} has no logs of this happening, and seemingly doesn't see it as down 
until ~20 minutes later:

{{INFO  [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - 
InetAddress /10.0.107.88 is now DOWN}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node 
/10.0.107.88 has restarted, now UP}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node 
/10.0.107.88 state jump to NORMAL}}

That is.. bizarre.

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noform

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464604#comment-16464604
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:47 AM:
--

Extra note: At the same second that {{.88}}'s gossiper announced shutdown, 
there was a newly established socket from {{.40}} to {{.88}}:

{{Fri May 4 16:02:15 PDT 2018}}
 {{tcp ESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000 
timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}}

This is the same socket that would become CLOSE-WAIT only seconds later, and 
remain that way for the following 18 minutes:

{{Fri May 4 16:02:20 PDT 2018}}
 {{tcp CLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000 
timer:(keepalive,4min54sec,0) uid:115 ino:5913893 sk:216e -->}}


was (Author: alienth):
Extra note: One second after {{.88}}'s gossiper announced shutdown, there was a 
newly established socket from {{.40}} to {{.88}}:

{{Fri May 4 16:02:15 PDT 2018}}
 {{tcp ESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000 
timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}}

This is the same socket that would become CLOSE-WAIT only seconds later, and 
remain that way for the following 18 minutes:

{{Fri May 4 16:02:20 PDT 2018}}
 {{tcp CLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000 
timer:(keepalive,4min54sec,0) uid:115 ino:5913893 sk:216e -->}}

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new 
> node's ECHOs, but the new node is never getting the ECHO's. This appears, to 
> me, to indicate that the old node's {{OutboundTcpConnection}} thread is just 
> stuck and can't make any forward progress. By the time we could notice this 
> and slap TRACE logging on, the only thing we see is ~10 minutes later a 
> {{SocketException}} inside {{writeConnected}}'s flush and an immediate 
> recovery. It is interesting to me that the exception happens in 
> {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write 
> failure}} I believe that this can't be a connection reset), because my 
> understanding is that we should have a fully handshaked SSL connection at 
> that point in the code.
> Current 

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464604#comment-16464604
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:37 AM:
--

Extra note: One second after {{.88}}'s gossiper announced shutdown, there was a 
newly established socket from {{.40}} to {{.88}}:

{{Fri May 4 16:02:15 PDT 2018}}
 {{tcp ESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000 
timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}}

This is the same socket that would become CLOSE-WAIT only seconds later, and 
remain that way for the following 18 minutes:

{{Fri May 4 16:02:20 PDT 2018}}
 {{tcp CLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000 
timer:(keepalive,4min54sec,0) uid:115 ino:5913893 sk:216e -->}}


was (Author: alienth):
Extra note: One second after {{.88}}'s gossiper announced shutdown, there was a 
newly established socket from {{.40}} to {{.88}}:

{{Fri May 4 16:02:15 PDT 201}}
 {{tcp ESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000 
timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}}

This is the same socket that would become CLOSE-WAIT only seconds later, and 
remain that way for the following 18 minutes:

{{Fri May 4 16:02:20 PDT 2018}}
 {{tcp CLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000 
timer:(keepalive,4min54sec,0) uid:115 ino:5913893 sk:216e -->}}

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new 
> node's ECHOs, but the new node is never getting the ECHO's. This appears, to 
> me, to indicate that the old node's {{OutboundTcpConnection}} thread is just 
> stuck and can't make any forward progress. By the time we could notice this 
> and slap TRACE logging on, the only thing we see is ~10 minutes later a 
> {{SocketException}} inside {{writeConnected}}'s flush and an immediate 
> recovery. It is interesting to me that the exception happens in 
> {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write 
> failure}} I believe that this can't be a connection reset), because my 
> understanding is that we should have a fully handshaked SSL connection at 
> that point in the code.
> Current theory:

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464604#comment-16464604
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:36 AM:
--

Extra note: One second after {{.88}}'s gossiper announced shutdown, there was a 
newly established socket from {{.40}} to {{.88}}:

{{Fri May 4 16:02:15 PDT 201}}
 {{tcp ESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000 
timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}}

This is the same socket that would become CLOSE-WAIT only seconds later, and 
remain that way for the following 18 minutes:

{{Fri May 4 16:02:20 PDT 2018}}
 {{tcp CLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000 
timer:(keepalive,4min54sec,0) uid:115 ino:5913893 sk:216e -->}}


was (Author: alienth):
Extra note: One second after {{.88}}'s gossiper announced shutdown, there was a 
newly established socket to {{.40}}:

{{Fri May  4 16:02:15 PDT 201}}
{{tcpESTAB  0  0  10.0.161.40:59739  
10.0.107.88:7000timer:(keepalive,4min59sec,0) uid:115 
ino:5913893 sk:216e <->}}

This is the same socket that would become CLOSE-WAIT only seconds later, and 
remain that way for the following 18 minutes:

{{Fri May  4 16:02:20 PDT 2018}}
{{tcpCLOSE-WAIT 1  0  10.0.161.40:59739  
10.0.107.88:7000timer:(keepalive,4min54sec,0) uid:115 
ino:5913893 sk:216e -->}}

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new 
> node's ECHOs, but the new node is never getting the ECHO's. This appears, to 
> me, to indicate that the old node's {{OutboundTcpConnection}} thread is just 
> stuck and can't make any forward progress. By the time we could notice this 
> and slap TRACE logging on, the only thing we see is ~10 minutes later a 
> {{SocketException}} inside {{writeConnected}}'s flush and an immediate 
> recovery. It is interesting to me that the exception happens in 
> {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write 
> failure}} I believe that this can't be a connection reset), because my 
> understanding is that we should have a fully

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464604#comment-16464604
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:34 AM:
--

Extra note: One second after {{.88}}'s gossiper announced shutdown, there was a 
newly established socket to {{.40}}:

{{Fri May  4 16:02:15 PDT 201}}
{{tcpESTAB  0  0  10.0.161.40:59739  
10.0.107.88:7000timer:(keepalive,4min59sec,0) uid:115 
ino:5913893 sk:216e <->}}

This is the same socket that would become CLOSE-WAIT only seconds later, and 
remain that way for the following 18 minutes:

{{Fri May  4 16:02:20 PDT 2018}}
{{tcpCLOSE-WAIT 1  0  10.0.161.40:59739  
10.0.107.88:7000timer:(keepalive,4min54sec,0) uid:115 
ino:5913893 sk:216e -->}}


was (Author: alienth):
Extra note: One second after {{.88}}'s gossiper announced shutdown, there was a 
newly established socket to {{.40}}:

{{Fri May  4 16:02:15 PDT 2018 }}
{{tcpESTAB  0  0  10.0.161.40:59739  
10.0.107.88:7000timer:(keepalive,4min59sec,0) uid:115 
ino:5913893 sk:216e <->}}

This is the same socket that would become CLOSE-WAIT only seconds later, and 
remain that way for the following 18 minutes:

{{Fri May  4 16:02:20 PDT 2018}}
{{tcpCLOSE-WAIT 1  0  10.0.161.40:59739  
10.0.107.88:7000timer:(keepalive,4min54sec,0) uid:115 
ino:5913893 sk:216e -->}}

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new 
> node's ECHOs, but the new node is never getting the ECHO's. This appears, to 
> me, to indicate that the old node's {{OutboundTcpConnection}} thread is just 
> stuck and can't make any forward progress. By the time we could notice this 
> and slap TRACE logging on, the only thing we see is ~10 minutes later a 
> {{SocketException}} inside {{writeConnected}}'s flush and an immediate 
> recovery. It is interesting to me that the exception happens in 
> {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write 
> failure}} I believe that this can't be a

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464596#comment-16464596
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:27 AM:
--

[~jasobrown] {{.88}} was down for about 10 minutes, and was shutdown via 
`nodetool drain`, followed by stopping the service gracefully.


{{.40}} has no gossiper logs showing {{.88}} going down, until the socket to 
{{.88}} is killed about 20 minutes later. {{.88}} was shut down at 1602 and 
came back up at 1602:

{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 
StorageService.java:1449 - DRAINING: starting drain process}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 
HintsService.java:220 - Paused hints dispatch}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 
ThriftServer.java:139 - Stop listening to thrift clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 
Server.java:176 - Stop listening for CQL clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 
Gossiper.java:1540 - Announcing shutdown}}

{{INFO  [main] 2018-05-04 16:12:39,746 StorageService.java:1449 - JOINING: 
Finish joining ring}}

And {{.40}} has no logs of this happening, and seemingly doesn't see it as down 
until ~20 minutes later:

{{INFO  [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - 
InetAddress /10.0.107.88 is now DOWN}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node 
/10.0.107.88 has restarted, now UP}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node 
/10.0.107.88 state jump to NORMAL}}

That is.. bizarre.


was (Author: alienth):
[~jasobrown] {{.88}} was down for about 20 minutes, and was shutdown via 
`nodetool drain`, followed by stopping the service gracefully.


{{.40}} has no gossiper logs showing {{.88}} going down, until {{.88}} comes 
back 20 minutes later. {{.88}} was shut down at 1602:

{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 
StorageService.java:1449 - DRAINING: starting drain process}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 
HintsService.java:220 - Paused hints dispatch}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 
ThriftServer.java:139 - Stop listening to thrift clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 
Server.java:176 - Stop listening for CQL clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 
Gossiper.java:1540 - Announcing shutdown}}

And {{.40}} has no logs of this happening, and seemingly doesn't see it as down 
until ~20 minutes later:

{{INFO  [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - 
InetAddress /10.0.107.88 is now DOWN}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node 
/10.0.107.88 has restarted, now UP}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node 
/10.0.107.88 state jump to NORMAL}}

That is.. bizarre.

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partit

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464596#comment-16464596
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:26 AM:
--

[~jasobrown] {{.88}} was down for about 10 minutes, and was shutdown via 
`nodetool drain`, followed by stopping the service gracefully.


{{.40}} has no gossiper logs showing {{.88}} going down, until {{.88}} comes 
back 20 minutes later. {{.88}} was shut down at 1602:

{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 
StorageService.java:1449 - DRAINING: starting drain process}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 
HintsService.java:220 - Paused hints dispatch}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 
ThriftServer.java:139 - Stop listening to thrift clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 
Server.java:176 - Stop listening for CQL clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 
Gossiper.java:1540 - Announcing shutdown}}

And {{.40}} has no logs of this happening, and seemingly doesn't see it as down 
until ~20 minutes later:

{{INFO  [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - 
InetAddress /10.0.107.88 is now DOWN}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node 
/10.0.107.88 has restarted, now UP}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node 
/10.0.107.88 state jump to NORMAL}}

That is.. bizarre.


was (Author: alienth):
[~jasobrown] {{.88}} was down for about 10 minutes, and was shutdown via 
`nodetool drain`, followed by stopping the service gracefully.


{{.40}} has no gossiper logs showing {{.88}} going down, until {{.88}} comes 
back 20 minutes later. {{.88}} was shut down at 1602:

{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 
StorageService.java:1449 - DRAINING: starting drain process}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 
HintsService.java:220 - Paused hints dispatch}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 
ThriftServer.java:139 - Stop listening to thrift clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 
Server.java:176 - Stop listening for CQL clients
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 
Gossiper.java:1540 - Announcing shutdown}}

And {{.40}} has no logs of this happening, and seemingly doesn't see it as down 
until ~20 minutes later:

{{INFO  [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - 
InetAddress /10.0.107.88 is now DOWN}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node 
/10.0.107.88 has restarted, now UP}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node 
/10.0.107.88 state jump to NORMAL}}

That is.. bizarre.

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464596#comment-16464596
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:26 AM:
--

[~jasobrown] {{.88}} was down for about 20 minutes, and was shutdown via 
`nodetool drain`, followed by stopping the service gracefully.


{{.40}} has no gossiper logs showing {{.88}} going down, until {{.88}} comes 
back 20 minutes later. {{.88}} was shut down at 1602:

{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 
StorageService.java:1449 - DRAINING: starting drain process}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 
HintsService.java:220 - Paused hints dispatch}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 
ThriftServer.java:139 - Stop listening to thrift clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 
Server.java:176 - Stop listening for CQL clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 
Gossiper.java:1540 - Announcing shutdown}}

And {{.40}} has no logs of this happening, and seemingly doesn't see it as down 
until ~20 minutes later:

{{INFO  [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - 
InetAddress /10.0.107.88 is now DOWN}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node 
/10.0.107.88 has restarted, now UP}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node 
/10.0.107.88 state jump to NORMAL}}

That is.. bizarre.


was (Author: alienth):
[~jasobrown] {{.88}} was down for about 10 minutes, and was shutdown via 
`nodetool drain`, followed by stopping the service gracefully.


{{.40}} has no gossiper logs showing {{.88}} going down, until {{.88}} comes 
back 20 minutes later. {{.88}} was shut down at 1602:

{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 
StorageService.java:1449 - DRAINING: starting drain process}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 
HintsService.java:220 - Paused hints dispatch}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 
ThriftServer.java:139 - Stop listening to thrift clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 
Server.java:176 - Stop listening for CQL clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 
Gossiper.java:1540 - Announcing shutdown}}

And {{.40}} has no logs of this happening, and seemingly doesn't see it as down 
until ~20 minutes later:

{{INFO  [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - 
InetAddress /10.0.107.88 is now DOWN}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node 
/10.0.107.88 has restarted, now UP}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node 
/10.0.107.88 state jump to NORMAL}}

That is.. bizarre.

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464596#comment-16464596
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:25 AM:
--

[~jasobrown] {{.88}} was down for about 10 minutes, and was shutdown via 
`nodetool drain`, followed by stopping the service gracefully.


{{.40}} has no gossiper logs showing {{.88}} going down, until {{.88}} comes 
back 20 minutes later. {{.88}} was shut down at 1602:

{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 
StorageService.java:1449 - DRAINING: starting drain process}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 
HintsService.java:220 - Paused hints dispatch}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 
ThriftServer.java:139 - Stop listening to thrift clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 
Server.java:176 - Stop listening for CQL clients
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 
Gossiper.java:1540 - Announcing shutdown}}

And {{.40}} has no logs of this happening, and seemingly doesn't see it as down 
until ~20 minutes later:

{{INFO  [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - 
InetAddress /10.0.107.88 is now DOWN}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node 
/10.0.107.88 has restarted, now UP}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node 
/10.0.107.88 state jump to NORMAL}}

That is.. bizarre.


was (Author: alienth):
{{.88}} was down for about 10 minutes, and was shutdown via `nodetool drain`, 
followed by stopping the service gracefully.


{{.40}} has no gossiper logs showing {{.88}} going down, until {{.88}} comes 
back 20 minutes later. {{.88}} was shut down at 1602:

{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 
StorageService.java:1449 - DRAINING: starting drain process}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 
HintsService.java:220 - Paused hints dispatch}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 
ThriftServer.java:139 - Stop listening to thrift clients}}
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 
Server.java:176 - Stop listening for CQL clients
{{INFO  [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 
Gossiper.java:1540 - Announcing shutdown}}

And {{.40}} has no logs of this happening, and seemingly doesn't see it as down 
until ~20 minutes later:

{{INFO  [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - 
InetAddress /10.0.107.88 is now DOWN}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 
OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node 
/10.0.107.88 has restarted, now UP}}
{{INFO  [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node 
/10.0.107.88 state jump to NORMAL}}

That is.. bizarre.

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> c

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464566#comment-16464566
 ] 

Jason Brown edited comment on CASSANDRA-14358 at 5/5/18 2:06 AM:
-

[~alienth] nice detective work. it looks like {{10.0.161.40}} hasn't closed the 
previous sockets/connections. If you have the logs of {{10.0.161.40}} still 
handy, can you see if, during and after the bounce, there are log statements 
about  {{10.0.107.88}} being down or alive (or restarted). Not looking at the 
code to see the exact log messages, but typically they are emitted by 
{{Gossiper}}.

Also, how long did it take  {{10.0.107.88}} to bounce? If it went down and came 
back up fast enough, it's possible the failure detector on {{10.0.161.40}} 
didn't mark it as down.

How did you terminate {{10.0.107.88}}? A normal shutdown, or {{kill -9}}? 


was (Author: jasobrown):
[~alienth] nice detective work. it looks like {{10.0.161.40}} hasn't closed the 
previous sockets/connections. If you have the logs of {{10.0.161.40}} still 
handy, can you see if, during and after the bounce, there are log statements 
about  {{10.0.107.88}} being down or alive (or restarted). Not looking at the 
code to see the exact log messages, but typically they are emitted by 
{{Gossiper}}.

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new 
> node's ECHOs, but the new node is never getting the ECHO's. This appears, to 
> me, to indicate that the old node's {{OutboundTcpConnection}} thread is just 
> stuck and can't make any forward progress. By the time we could notice this 
> and slap TRACE logging on, the only thing we see is ~10 minutes later a 
> {{SocketException}} inside {{writeConnected}}'s flush and an immediate 
> recovery. It is interesting to me that the exception happens in 
> {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write 
> failure}} I believe that this can't be a connection reset), because my 
> understanding is that we should have a fully handshaked SSL connection at 
> that point in the code.
> Current theory:
>  # "New" node restarts,  "Old" node calls 
> [newSocket|https

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464495#comment-16464495
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/4/18 11:44 PM:
---

Captured the socket state for before, during, and after the restart, from the 
POV of a node which the restarted node sees as down:

Before, when nothing has been done:
 {{Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port}}
 {{tcp   ESTAB 0      0      10.0.161.40:59739  10.0.107.88:7000 
timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}}

After 10.0.107.88 has been restarted. Note the {{1}} in in the recv-q:
 {{tcp CLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000 
timer:(keepalive,4min14sec,0) uid:115 ino:5913893 sk:216e -->}}

When 10.0.107.88 comes back up and 10.0.161.40 tries to respond to the 
EchoMessage, using the previous socket which has been in CLOSE-WAIT. You can 
see the outbounds piling up in the send-q:
 {{tcp CLOSE-WAIT 1 36527 10.0.161.40:59739 10.0.107.88:7000 
timer:(on,1.932ms,4) uid:115 ino:5913893 sk:216e -->}}


was (Author: alienth):
Captured the socket state for before, during, and after the restart, from the 
POV of a node which the restarted node sees as down:

Before, when nothing has been done:
{{Netid  State  Recv-Q Send-Q Local Address:Port   Peer 
Address:Port}}
{{tcpESTAB  0  0  10.0.161.40:59739  
10.0.107.88:7000timer:(keepalive,4min59sec,0) uid:115 
ino:5913893 sk:216e <->}}

After 10.0.107.88 has been restarted. Note the {{1}} in in the recv-q:
{{tcpCLOSE-WAIT 1  0  10.0.161.40:59739  
10.0.107.88:7000timer:(keepalive,4min14sec,0) uid:115 
ino:5913893 sk:216e -->}}

When 10.0.107.88 comes back up and 10.0.161.40 tries to respond to the 
EchoMessage, using the previous socket which has been in CLOSE-WAIT. You can 
see the outbounds piling up in the send-q:
{{tcpCLOSE-WAIT 1  36527  10.0.161.40:59739  
10.0.107.88:7000timer:(on,1.932ms,4) uid:115 ino:5913893 
sk:216e -->}}

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node, and the old node is sending ECHOs and REQU

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464495#comment-16464495
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/4/18 11:43 PM:
---

Captured the socket state for before, during, and after the restart, from the 
POV of a node which the restarted node sees as down:

Before, when nothing has been done:
{{Netid  State  Recv-Q Send-Q Local Address:Port   Peer 
Address:Port}}
{{tcpESTAB  0  0  10.0.161.40:59739  
10.0.107.88:7000timer:(keepalive,4min59sec,0) uid:115 
ino:5913893 sk:216e <->}}

After 10.0.107.88 has been restarted. Note the {{1}} in in the recv-q:
{{tcpCLOSE-WAIT 1  0  10.0.161.40:59739  
10.0.107.88:7000timer:(keepalive,4min14sec,0) uid:115 
ino:5913893 sk:216e -->}}

When 10.0.107.88 comes back up and 10.0.161.40 tries to respond to the 
EchoMessage, using the previous socket which has been in CLOSE-WAIT. You can 
see the outbounds piling up in the send-q:
{{tcpCLOSE-WAIT 1  36527  10.0.161.40:59739  
10.0.107.88:7000timer:(on,1.932ms,4) uid:115 ino:5913893 
sk:216e -->}}


was (Author: alienth):
Captured the socket state for before, during, and after the restart, from the 
POV of a node which the restarted node sees as down:

Before, when nothing has been done:
{{tcpESTAB  0  0  10.0.161.40:59739  
10.0.107.88:7000timer:(keepalive,4min59sec,0) uid:115 
ino:5913893 sk:216e <->}}

After 10.0.107.88 has been restarted. Note the {{1}} in in the recv-q:
{{tcpCLOSE-WAIT 1  0  10.0.161.40:59739  
10.0.107.88:7000timer:(keepalive,4min14sec,0) uid:115 
ino:5913893 sk:216e -->}}

When 10.0.107.88 comes back up and 10.0.161.40 tries to respond to the 
EchoMessage, using the previous socket which has been in CLOSE-WAIT:
{{tcpCLOSE-WAIT 1  36527  10.0.161.40:59739  
10.0.107.88:7000timer:(on,1.932ms,4) uid:115 ino:5913893 
sk:216e -->}}

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node, and the old node is sending ECHOs and REQUEST_RESPO

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464368#comment-16464368
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/4/18 8:53 PM:
--

Reproduced this behaviour several times now,and in all cases the socket that 
cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`.

I think that suggests that the more common case might be where AWS stops 
tracking the flow for a FINd connection, and an ACK being dropped somewhere 
resulting in a node not fully closing the connection?

Note that I'm seeing this on cassandra 3.11.2, which has the patches from 
CASSANDRA-9630

{{tcpCLOSE-WAIT 1  322683 10.0.161.40:18254  
10.0.109.39:7000timer:(on,1min54sec,13) uid:115 ino:5837280 
sk:1a1 -->}}


was (Author: alienth):
Reproduced this behaviour several times now,and in all cases the socket that 
cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`.

I think that suggests that the more common case might be where AWS stops 
tracking the flow for a FINd connection, and an ACK being dropped somewhere 
resulting in a node not fully closing the connection.

Note that I'm seeing this on cassandra 3.11.2, which has the patches from 
CASSANDRA-9630

{{tcpCLOSE-WAIT 1  322683 10.0.161.40:18254  
10.0.109.39:7000timer:(on,1min54sec,13) uid:115 ino:5837280 
sk:1a1 -->}}

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new 
> node's ECHOs, but the new node is never getting the ECHO's. This appears, to 
> me, to indicate that the old node's {{OutboundTcpConnection}} thread is just 
> stuck and can't make any forward progress. By the time we could notice this 
> and slap TRACE logging on, the only thing we see is ~10 minutes later a 
> {{SocketException}} inside {{writeConnected}}'s flush and an immediate 
> recovery. It is interesting to me that the exception happens in 
> {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write 
> failure}} I believe that this can't be a connection reset), because my 
> understanding is that we should have

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464368#comment-16464368
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/4/18 8:52 PM:
--

Reproduced this behaviour several times now,and in all cases the socket that 
cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`.

I think that suggests that the more common case might be where AWS stops 
tracking the flow for a FINd connection, and an ACK being dropped somewhere 
resulting in a node not fully closing the connection.

Note that I'm seeing this on cassandra 3.11.2, which has the patches from 
CASSANDRA-9630

{{tcpCLOSE-WAIT 1  322683 10.0.161.40:18254  
10.0.109.39:7000timer:(on,1min54sec,13) uid:115 ino:5837280 
sk:1a1 -->}}


was (Author: alienth):
Reproduced this behaviour several times now,and in all cases the socket that 
cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`.

I think that suggests that the more common case might be where AWS stops 
tracking the flow for a FINd connection, and an ACK being dropped somewhere 
resulting in a node not fully closing the connection.


{{tcpCLOSE-WAIT 1  322683 10.0.161.40:18254  
10.0.109.39:7000timer:(on,1min54sec,13) uid:115 ino:5837280 
sk:1a1 -->}}

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new 
> node's ECHOs, but the new node is never getting the ECHO's. This appears, to 
> me, to indicate that the old node's {{OutboundTcpConnection}} thread is just 
> stuck and can't make any forward progress. By the time we could notice this 
> and slap TRACE logging on, the only thing we see is ~10 minutes later a 
> {{SocketException}} inside {{writeConnected}}'s flush and an immediate 
> recovery. It is interesting to me that the exception happens in 
> {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write 
> failure}} I believe that this can't be a connection reset), because my 
> understanding is that we should have a fully handshaked SSL connection at 
> that point in the code.
> Current theory:
>  # "N

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464368#comment-16464368
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/4/18 8:44 PM:
--

Reproduced this behaviour several times now,and in all cases the socket that 
cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`.

I think that suggests that the more common case might be where AWS stops 
tracking the flow for a FINd connection, and an ACK being dropped somewhere 
resulting in a node not fully closing the connection.


{{tcpCLOSE-WAIT 1  322683 10.0.161.40:18254  
10.0.109.39:7000timer:(on,1min54sec,13) uid:115 ino:5837280 
sk:1a1 -->}}


was (Author: alienth):
Reproduced this behaviour several times now,and in all cases the socket that 
cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`.

I think that suggests that the more common case might be where AWS stops 
tracking the flow for a FINd connection having a dropped ACK somewhere.


{{tcpCLOSE-WAIT 1  322683 10.0.161.40:18254  
10.0.109.39:7000timer:(on,1min54sec,13) uid:115 ino:5837280 
sk:1a1 -->}}

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new 
> node's ECHOs, but the new node is never getting the ECHO's. This appears, to 
> me, to indicate that the old node's {{OutboundTcpConnection}} thread is just 
> stuck and can't make any forward progress. By the time we could notice this 
> and slap TRACE logging on, the only thing we see is ~10 minutes later a 
> {{SocketException}} inside {{writeConnected}}'s flush and an immediate 
> recovery. It is interesting to me that the exception happens in 
> {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write 
> failure}} I believe that this can't be a connection reset), because my 
> understanding is that we should have a fully handshaked SSL connection at 
> that point in the code.
> Current theory:
>  # "New" node restarts,  "Old" node calls 
> [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apa

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-05-04 Thread Jason Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464368#comment-16464368
 ] 

Jason Harvey edited comment on CASSANDRA-14358 at 5/4/18 8:42 PM:
--

Reproduced this behaviour several times now,and in all cases the socket that 
cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`.

I think that suggests that the more common case might be where AWS stops 
tracking the flow for a FINd connection having a dropped ACK somewhere.


{{tcpCLOSE-WAIT 1  322683 10.0.161.40:18254  
10.0.109.39:7000timer:(on,1min54sec,13) uid:115 ino:5837280 
sk:1a1 -->}}


was (Author: alienth):
Reproduced this behaviour several times now,and in all cases the socket that 
cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`.

I think that suggests that the more common case might be where AWS stops 
tracking the flow for a FINd connection having a dropped ACK somewhere.

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new 
> node's ECHOs, but the new node is never getting the ECHO's. This appears, to 
> me, to indicate that the old node's {{OutboundTcpConnection}} thread is just 
> stuck and can't make any forward progress. By the time we could notice this 
> and slap TRACE logging on, the only thing we see is ~10 minutes later a 
> {{SocketException}} inside {{writeConnected}}'s flush and an immediate 
> recovery. It is interesting to me that the exception happens in 
> {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write 
> failure}} I believe that this can't be a connection reset), because my 
> understanding is that we should have a fully handshaked SSL connection at 
> that point in the code.
> Current theory:
>  # "New" node restarts,  "Old" node calls 
> [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433]
>  # Old node starts [creating a 
> new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTc

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-04-10 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433171#comment-16433171
 ] 

Ariel Weisberg edited comment on CASSANDRA-14358 at 4/10/18 11:20 PM:
--

30 seconds (effectively 60) and a hot prop sounds excellent.


was (Author: aweisberg):
30 seconds and a hot prop sounds excellent.

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new 
> node's ECHOs, but the new node is never getting the ECHO's. This appears, to 
> me, to indicate that the old node's {{OutboundTcpConnection}} thread is just 
> stuck and can't make any forward progress. By the time we could notice this 
> and slap TRACE logging on, the only thing we see is ~10 minutes later a 
> {{SocketException}} inside {{writeConnected}}'s flush and an immediate 
> recovery. It is interesting to me that the exception happens in 
> {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write 
> failure}} I believe that this can't be a connection reset), because my 
> understanding is that we should have a fully handshaked SSL connection at 
> that point in the code.
> Current theory:
>  # "New" node restarts,  "Old" node calls 
> [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433]
>  # Old node starts [creating a 
> new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L141]
>  SSL socket 
>  # SSLSocket calls 
> [createSocket|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/security/SSLFactory.java#L98],
>  which conveniently calls connect with a default timeout of "forever". We 
> could hang here forever until the OS kills us.
>  # If we continue, we get to 
> [writeConnected|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L263]
>  which eventually calls 
> [flush|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-04-09 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431491#comment-16431491
 ] 

Ariel Weisberg edited comment on CASSANDRA-14358 at 4/9/18 11:27 PM:
-

Nice find! The numbers you picked seem pretty aggressive. I was expecting in 
the 30-60 second range before giving up. If the network conditions are 
unreliable but working you might drop connections and then not be able to 
recreate them quickly due to packet loss. Granted things are probably mostly 
unavailable anyways in that scenario, but you don't want them looping too 
tightly trying to recreate connections.

The last question is if this needs to be a hot property or not.


was (Author: aweisberg):
Nice find! Seems like something we should set this and make configurable in the 
YAML. The numbers you picked seem pretty aggressive. I was expecting in the 
30-60 second range before giving up. If the network conditions are unreliable 
but working you might drop connections and then not be able to recreate them 
quickly due to packet loss. Granted things are probably mostly unavailable 
anyways in that scenario, but you don't want them looping too tightly trying to 
recreate connections.

The last question is if this needs to be a hot property or not.

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new 
> node's ECHOs, but the new node is never getting the ECHO's. This appears, to 
> me, to indicate that the old node's {{OutboundTcpConnection}} thread is just 
> stuck and can't make any forward progress. By the time we could notice this 
> and slap TRACE logging on, the only thing we see is ~10 minutes later a 
> {{SocketException}} inside {{writeConnected}}'s flush and an immediate 
> recovery. It is interesting to me that the exception happens in 
> {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write 
> failure}} I believe that this can't be a connection reset), because my 
> understanding is that we should have a fully handshaked SSL connection at 
> that point in the code.
> Current theory:
>  # "New" node restarts,  "Old" node calls 
> [newSocket|https://githu

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-04-03 Thread Joseph Lynch (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424802#comment-16424802
 ] 

Joseph Lynch edited comment on CASSANDRA-14358 at 4/4/18 12:49 AM:
---

For what it's worth we've opened an AWS ticket to find out if this is expected 
behaviour in VPCs (that the network blackholes packets on unknown flows rather 
than resetting them). I have a feeling that answer will be "yea that's how 
stateful firewalls work we can't keep flows forever", in which case Cassandra 
(probably) should be resilient to it I think.

[~djoshi3] I think you've got it now. I'm proposing that we tune Cassandra's 
socket options (opt-in of course like most socket options we set) to tune TCP's 
parameters per Cassandra's requirement of high availability. In the case of 
blocking socket the way that I'm familiar with are connect timeouts 
([unix|http://man7.org/linux/man-pages/man2/connect.2.html], 
[java|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#connect-java.net.SocketAddress-int-]),
 receive timeouts ([SO_RCVTIMEO|https://linux.die.net/man/7/socket], 
[java|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#setSoTimeout-int-])
 and send timeouts which Java doesn't expose afaict 
([SO_SNDTIMEO|https://linux.die.net/man/7/socket]). For {{O_NONBLOCK}} sockets 
I think typically the timeout is handled at the event loop level (e.g. I think 
that's what {{WriteTimeoutHandler}} and {{ReadTimeoutHandler}} do in Netty).

I still need to test this thoroughly, but I believe asking users to set 
{{/proc/sys/net/ipv4/tcp_retries2 = 5}} is a reasonable workaround. I was able 
to simulate pretty easily the exact error in the production logs by doing the 
following:
{noformat}
$ sudo lsof -p $(pgrep -f Cassandra) -n | grep OTHER_IP | grep STORAGE_PORT
java5277 cassandra  64u IPv4 113860  0t0TCP 
LOCAL_IP:38665->OTHER_IP:STORAGE_PORT (ESTABLISHED)
java5277 cassandra  69u IPv4 114797  0t0TCP 
LOCAL_IP:STORAGE_PORT->OTHER_IP:54875 (ESTABLISHED)

$ sudo iptables -A OUTPUT -p tcp -d OTHER_IP --sport 38665 --dport STORAGE_PORT 
-j DROP

$ cqlsh
cqlsh> CONSISTENCY ALL
Consistency level set to ALL.
cqlsh> select * from test_ks.test_cf WHERE key = 'key_that_lives_on_OTHER_IP';
OperationTimedOut: errors={}, last_host=
... timeouts happen for next ~15 minutes

$ tail -20 system.log 
TRACE [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:05,009 
OutboundTcpConnection.java:365 - Socket to /OTHER_IP closed
DEBUG [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:05,010 
OutboundTcpConnection.java:303 - error writing to /OTHER_IP
java.net.SocketException: Connection timed out (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method) 
~[na:1.8.0_152]
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) 
~[na:1.8.0_152]
at java.net.SocketOutputStream.write(SocketOutputStream.java:155) 
~[na:1.8.0_152]
at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431) 
~[na:1.8.0_152]
at sun.security.ssl.OutputRecord.write(OutputRecord.java:417) 
~[na:1.8.0_152]
at 
sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:886) 
~[na:1.8.0_152]
at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:857) 
~[na:1.8.0_152]
at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123) 
~[na:1.8.0_152]
at 
net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
 ~[lz4-1.2.0.jar:na]
at 
net.jpountz.lz4.LZ4BlockOutputStream.flush(LZ4BlockOutputStream.java:222) 
~[lz4-1.2.0.jar:na]
at 
org.apache.cassandra.io.util.DataOutputStreamPlus.flush(DataOutputStreamPlus.java:55)
 ~[cassandra-2.1.19.jar:2.1.19]
at 
org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:294)
 [cassandra-2.1.19.jar:2.1.19]
at 
org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:222)
 [cassandra-2.1.19.jar:2.1.19]
DEBUG [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:49,867 
OutboundTcpConnection.java:380 - attempting to connect to /OTHER_IP
INFO  [HANDSHAKE-/OTHER_IP] 2018-04-04 00:13:49,916 
OutboundTcpConnection.java:496 - Handshaking version with /OTHER_IP
TRACE [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:49,960 
OutboundTcpConnection.java:453 - Upgrading OutputStream to be compressed

$ cqlsh
cqlsh> CONSISTENCY ALL
Consistency level set to ALL.
cqlsh> select * from test_ks.test_cf WHERE key = 'key_that_lives_on_OTHER_IP';
... results come back and everything is happy

$ sudo lsof -p $(pgrep -f Cassandra) -n | grep OTHER_IP | grep STORAGE_PORT
java5277 cassandra  134u IPv4 113860  0t0TCP 
LOCAL_IP:33417->OTHER_IP:STORAGE_PORT (ESTABLI

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-04-03 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424033#comment-16424033
 ] 

Jason Brown edited comment on CASSANDRA-14358 at 4/3/18 1:42 PM:
-

For trunk, we already use {{IdleStateHandler}} [in the outbound 
pipeline|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/async/OutboundHandshakeHandler.java#L196]
 for the "not making progress sending data" case. Take a look at 
{{MessageOutHandler#userEventTriggered}}. iirc, {{WriteTimeoutHandler}} / 
{{ReadTimeoutHandler}} does not fit our use case for messaging; I'd have to 
look again to page it back in.

Further, [~jolynch], I thought your problem was in pre-4.0? There is a problem 
(or, at least, non-optimization) with {{MessagingService.SocketThread}} where 
the single-threaded accept thread handles receiving the first message pf the c* 
internode messaging protocol, which may require executing the TLS handshake 
before that can be received (if using internode TLS). Thus a single new 
establishing connection may block the accept thread for 500-1000 ms for a 
cross-datacenter connection. Even worse, if you are using the 
{{Ec2MultiregionSnitch}}, all connections for the local region will be torn 
down and built anew as we switch from the public IP to the region-local private 
IP - more traffic on the accept thread. So one slow connection interaction 
screws the whole accept thread.

I have an internal patch for 3.0 which, when a new socket is accepted, moves 
all processing (internode protocol handshake, TLS handshake) to a separate 
thread (via an executor group). So, if one connection runs into a problem, the 
other incoming connections are not held up. I've been reticent about OSS'ing 
this late in the 3.0/3.11 branches, even though it's not that clever or 
invasive. Let me know if you'd be interested in giving it shot and I can post 
it. If it's useful/helpful we can consider committing it, as well.


was (Author: jasobrown):
For trunk, we already use {{IdleStateHandler}} [in the outbound 
pipeline|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/async/OutboundHandshakeHandler.java#L196]
 for the "not making progress sending data" case. iirc, {{WriteTimeoutHandler}} 
/ {{ReadTimeoutHandler}} does not fit our use case for messaging; I'd have to 
look again to page it back in.

Further, [~jolynch], I thought your problem was in pre-4.0? There is a problem 
(or, at least, non-optimization) with {{MessagingService.SocketThread}} where 
the single-threaded accept thread handles receiving the first message pf the c* 
internode messaging protocol, which may require executing the TLS handshake 
before that can be received (if using internode TLS). Thus a single new 
establishing connection may block the accept thread for 500-1000 ms for a 
cross-datacenter connection. Even worse, if you are using the 
{{Ec2MultiregionSnitch}}, all connections for the local region will be torn 
down and built anew as we switch from the public IP to the region-local private 
IP - more traffic on the accept thread. So one slow connection interaction 
screws the whole accept thread.

I have an internal patch for 3.0 which, when a new socket is accepted, moves 
all processing (internode protocol handshake, TLS handshake) to a separate 
thread (via an executor group). So, if one connection runs into a problem, the 
other incoming connections are not held up. I've been reticent about OSS'ing 
this late in the 3.0/3.11 branches, even though it's not that clever or 
invasive. Let me know if you'd be interested in giving it shot and I can post 
it. If it's useful/helpful we can consider committing it, as well.

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
> 

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-04-02 Thread Joseph Lynch (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423205#comment-16423205
 ] 

Joseph Lynch edited comment on CASSANDRA-14358 at 4/3/18 12:42 AM:
---

[~aweisberg] if the issue is what I think it is (and I definitely need more 
supporting evidence), I think we can fix it in trunk with a netty 
{{WriteTimeoutHandler}}/ {{ReadTimeoutHandler}} which will throw away the tcp 
connection if recv/write take too long. I think it's reasonable to say in 
earlier versions we are not going to patch it past connection and recv timeouts 
and just recommend {{/proc/sys/net/ipv4/tcp_retries2 = 5}}. I tested the sysctl 
and it leads to a much shorter partition with the iptables test, like 20s 
instead of 10 minutes. I'm testing it out on a large cluster with a rolling 
restart soon.

[~pauloricardomg] I believe that CASSANDRA-9630 is a separate issue (although 
+1 to a 2.1.x backport there since it's so simple and 2.1.x is actually still 
used places) because in this case I believe we have an established connection 
that is blackholing. I'll double check this next time I run the test and ensure 
that the tcp connection state is {{ESTABLISHED}}.


was (Author: jolynch):
[~aweisberg] if the issue is what I think it is (and I definitely need more 
supporting evidence), I think we can fix it in trunk with a netty 
{{WriteTimeoutHandler}}/ {{ReadTimeoutHandler}} which will throw away the tcp 
connection if recv/write take too long. I think it's reasonable to say in 
earlier versions we are not going to patch it and just recommend 
{{/proc/sys/net/ipv4/tcp_retries2 = 5}}. I tested the sysctl and it leads to a 
much shorter partition with the iptables test, like 20s instead of 10 minutes. 
I'm testing it out on a large cluster with a rolling restart soon.

[~pauloricardomg] I believe that CASSANDRA-9630 is a separate issue (although 
+1 to a 2.1.x backport there since it's so simple and 2.1.x is actually still 
used places) because in this case I believe we have an established connection 
that is blackholing. I'll double check this next time I run the test and ensure 
that the tcp connection state is {{ESTABLISHED}}.

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that this means that we could potentially block forever on {{connect}} or 
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I 
> think that the OS will protect us somewhat (and that may be what's causing 
> the eventual timeout) but I think that given the right network conditions our 
> {{OutboundTCPConnection}} threads can just be stuck never making any progress 
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling 
> restart where an old node in the cluster has a completely foobarred 
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a 
> {{java.net.SocketException: Connection timed out (Write failed)}} and 
> immediately successfully reconnecting. I conclude that the old node is the 
> problem because the new node (the one that restarted) is sending ECHOs to the 
> old node

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-04-02 Thread Joseph Lynch (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422969#comment-16422969
 ] 

Joseph Lynch edited comment on CASSANDRA-14358 at 4/2/18 7:56 PM:
--

[~djoshi3] I agree that the issue is probably a bad TCP level connection, but 
also I think it's a  bug that Cassandra blocks forever never making any 
progress and potentially causing an outage for users (until the OS kills the 
connection); we're supposed to be highly available right ;). In non JVM 
languages this would be as simple (I think) as setting {{SO_SNDTIMEO}} to a 
reasonable number like 30s, but I don't see how to do it without JNI in 
Cassandra 2.1.x/3.0.x/3.11.x. If this is the issue (I think we still need more 
data), I also think recommending users tune {{tcp_retries2}} to 8, 5 or even 3 
would be a reasonable workaround if it's fixed in trunk.

I haven't tried in trunk (tbh we can't even build/deploy trunk right now), but 
unfortunately I think it would be vulnerable to same kind of issue since afaict 
{{[NettyFactory|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/async/NettyFactory.java#L358-L377]}}
 doesn't add any kind of 
{{[WriteTimeoutHandler|https://netty.io/4.0/api/io/netty/handler/timeout/WriteTimeoutHandler.html]}}
 or 
{{[ReadTimeoutHandler|https://netty.io/4.0/api/io/netty/handler/timeout/ReadTimeoutHandler.html]}}
 which will throw an exception (I believe) if the write/read on the socket 
doesn't complete in a reasonable amount of time. Admittedly I'm very new to the 
netty refactor in 4.0 so I might be missing the part where we set a bunch of 
timeouts somewhere.

I don't want to invest too much time testing trunk until I confirm the type of 
partition that is actually happening, but if our theory is correct it should be 
easy to reproduce on trunk with the above iptables rule (if trunk handles it 
then the tcp connection should get thrown out after ~30s or something 
reasonable and we re-connect).


was (Author: jolynch):
[~djoshi3] I agree that the issue is probably a bad TCP level connection, but 
also I think it's a  bug that Cassandra blocks forever never making any 
progress and potentially causing an outage for users (until the OS kills the 
connection); we're supposed to be highly available right ;). In non JVM 
languages this would be as simple (I think) as setting {{SO_SNDTIMEO}} to a 
reasonable number like 30s, but I don't see how to do it without JNI in 
Cassandra. If this is the issue (I think we still need more data), I also think 
recommending users tune {{tcp_retries2}} to 8, 5 or even 3 would be a 
reasonable workaround if it's fixed in trunk.

I haven't tried in trunk (tbh we can't even build/deploy trunk right now), but 
unfortunately I think it would be vulnerable to same kind of issue since afaict 
{{[NettyFactory|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/async/NettyFactory.java#L358-L377]}}
 doesn't add any kind of 
{{[WriteTimeoutHandler|https://netty.io/4.0/api/io/netty/handler/timeout/WriteTimeoutHandler.html]}}
 or 
{{[ReadTimeoutHandler|https://netty.io/4.0/api/io/netty/handler/timeout/ReadTimeoutHandler.html]}}
 which will throw an exception (I believe) if the write/read on the socket 
doesn't complete in a reasonable amount of time. Admittedly I'm very new to the 
netty refactor in 4.0 so I might be missing the part where we set a bunch of 
timeouts somewhere.

I don't want to invest too much time testing trunk until I confirm the type of 
partition that is actually happening, but if our theory is correct it should be 
easy to reproduce on trunk with the above iptables rule (if trunk handles it 
then the tcp connection should get thrown out after ~30s or something 
reasonable and we re-connect).

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-04-02 Thread Joseph Lynch (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422783#comment-16422783
 ] 

Joseph Lynch edited comment on CASSANDRA-14358 at 4/2/18 5:08 PM:
--

[~djoshi3] yea I think that you're probably right, but I think that this might 
not be a slow consumer so much as a never consumer. Specifically if there is a 
stateful firewall (e.g. security groups or vpc) in the way, the network could 
absolutely blackhole packets on a connection that's been reset. My plan for 
this week is I'm going to try to catch this happening and run the following 
analysis:
 # Get a netstat view of established connections and their send buffers on both 
sides
 # Slap a tcpdump on both sides to see if {{RESETS}} from the restarted node 
are even getting to the old node (VPC might be swallowing them)
 # If #1 or #2 confirm the theory that we're just in a stuck tcp connection, I 
will try tuning {{tcp_retries2}} down to ~5, setting {{SO_SNDBUF}} or 
{{wmem_default}} down to a more reasonable number (I think right now it's like 
16 megs) to see if that fixes the issue

For what it's worth, Cassandra is _really_ intolerant of these kinds of network 
partitions. For example if I tell a stateful firewall to just drop all packets 
on one of Cassandra's internode connections with {{iptables -A OUPTUT -p tcp -d 
 --sport  --dport  -j DROP}}, Cassandra 
continues sending packets out on that {{OutboundTcpConnection}} until the 
operating system says "hey that's probably not going to work" 15 minutes later. 
Unfortunately we have two TCP connections so I can't just do a simple heartbeat 
mechanism.


was (Author: jolynch):
[~djoshi3] yea I think that you're probably right, but I think that this might 
not be a slow consumer so much as a never consumer. Specifically if there is a 
stateful firewall (e.g. security groups or vpc) in the way, the network could 
absolutely blackhole packets on a connection that's been reset. My plan for 
this week is I'm going to try to catch this happening and run the following 
analysis:
 # Get a netstat view of established connections and their send buffers on both 
sides
 # Slap a tcpdump on both sides to see if {{RESETS}} from the restarted node 
are even getting to the old node (VPC might be swallowing them)
 # If #1 or #2 confirm the theory that we're just in a stuck tcp connection, I 
will try tuning {{tcp_retries2}} down to ~5, setting {{SO_SNDBUF}} or 
{{wmem_default}} down to a more reasonable number (I think right now it's like 
16 megs) to see if that fixes the issue

For what it's worth, Cassandra is _really_ intolerant of these kinds of network 
partitions. For example if I tell a stateful firewall to just drop all packets 
on one of Cassandra's internode connections with {{iptables -A OUPTUT -p tcp -d 
 --sport  --dport  -j DROP}}, Cassandra 
continues sending packets out on that {{OutboundTcpConnection}} until the 
operating system says "hey that's probably not going to work". Unfortunately we 
have two TCP connections so I can't just do a simple heartbeat mechanism.

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either

[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart

2018-04-02 Thread Joseph Lynch (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422783#comment-16422783
 ] 

Joseph Lynch edited comment on CASSANDRA-14358 at 4/2/18 5:07 PM:
--

[~djoshi3] yea I think that you're probably right, but I think that this might 
not be a slow consumer so much as a never consumer. Specifically if there is a 
stateful firewall (e.g. security groups or vpc) in the way, the network could 
absolutely blackhole packets on a connection that's been reset. My plan for 
this week is I'm going to try to catch this happening and run the following 
analysis:
 # Get a netstat view of established connections and their send buffers on both 
sides
 # Slap a tcpdump on both sides to see if {{RESETS}} from the restarted node 
are even getting to the old node (VPC might be swallowing them)
 # If #1 or #2 confirm the theory that we're just in a stuck tcp connection, I 
will try tuning {{tcp_retries2}} down to ~5, setting {{SO_SNDBUF}} or 
{{wmem_default}} down to a more reasonable number (I think right now it's like 
16 megs) to see if that fixes the issue

For what it's worth, Cassandra is _really_ intolerant of these kinds of network 
partitions. For example if I tell a stateful firewall to just drop all packets 
on one of Cassandra's internode connections with {{iptables -A OUPTUT -p tcp -d 
 --sport  --dport  -j DROP}}, Cassandra 
continues sending packets out on that {{OutboundTcpConnection}} until the 
operating system says "hey that's probably not going to work". Unfortunately we 
have two TCP connections so I can't just do a simple heartbeat mechanism.


was (Author: jolynch):
[~djoshi3] yea I think that you're probably right, but I think that this might 
not be a slow consumer so much as a never consumer. Specifically if there is a 
stateful firewall (e.g. security groups or vpc) in the way, the network could 
absolutely blackhole packets on a connection that's been reset. My plan for 
this week is I'm going to try to catch this happening and run the following 
analysis:
# Get a netstat view of established connections and their send buffers on both 
sides
# Slap a tcpdump on both sides to see if {{RESETS}} from the restarted node are 
even getting to the old node (VPC might be swallowing them)
# If #1 or #2 confirm the theory that we're just in a stuck tcp connection, I 
will try tuning {{tcp_retries2}} down to ~5, setting {{SO_SNDBUF}} or 
{{wmem_default}} down to a more reasonable number (I think right now it's like 
16 megs) to see if that fixes the issue  

For what it's worth, Cassandra is _really_ intolerant of these kinds of network 
partitions. For example if I tell a stateful firewall to just drop all packets 
on on of Cassandra's internode connections with {{iptables -A OUPTUT -p tcp -d 
 --sport  --dport  -j DROP}}, Cassandra 
continues sending packets out on that {{OutboundTcpConnection}} until the 
operating system says "hey that's probably not going to work". Unfortunately we 
have two TCP connections so I can't just do a simple heartbeat mechanism.

> OutboundTcpConnection can hang for many minutes when nodes restart
> --
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running 
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux 
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is 
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently 
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip 
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
>Reporter: Joseph Lynch
>Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> I've been trying to debug nodes not being able to see each other during 
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can 
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior 
> to trunk (with the awesome Netty refactor) we do not set socket connect 
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set 
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe 
> that