[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589562#comment-16589562 ] Joseph Lynch edited comment on CASSANDRA-14358 at 8/23/18 4:49 PM: --- [~khuizhang] yea that looks similar with the half open keepalive connections. Did you try the kernel workaround and did it help? [~aweisberg] [~jasobrown] I've got a mitigation patch so that Cassandra trunk at least heals the half open partitions faster. Please take a look if you have bandwidth for review. While testing the re-connection behavior I ran into CASSANDRA-14503 because the retry future was just getting clobbered by another message, so I couldn't test that we don't keep retrying after just connections are killed (as right now they just retry every message). ||trunk|| |[patch|https://github.com/apache/cassandra/compare/trunk...jolynch:CASSANDRA-14358]| |[unit tests|https://circleci.com/gh/jolynch/workflows/cassandra/tree/CASSANDRA-14358]| This patch makes the timeout configurable for internode connection (2s) and internode tcp user timeout (30s). The timeouts are settable via JMX (and nodetool). I'm marking this as patch available as I think the operating system workaround is probably ok for previous releases? If it's not just let me know and I can try to figure out how to fix it for those ones as well. I didn't have any new tests because the only way I'm aware to reproduce this behavior is by using iptables to blackhole traffic. I've been testing with a ccm cluster by trying to block just the small message channel (if gossip is blocked then the failure detector convicts it) and doing a read at ALL: {noformat} $ netstat -on | grep 7000 | grep 127.0.0.3:7000 | grep -v "0 127.0.0.3" tcp0 0 127.0.0.1:55604 127.0.0.3:7000 ESTABLISHED keepalive (4093.37/0/0) tcp0 0 127.0.0.1:55610 127.0.0.3:7000 ESTABLISHED keepalive (4093.53/0/0) tcp0 0 127.0.0.1:58080 127.0.0.3:7000 ESTABLISHED keepalive (6601.96/0/0) # Try to drop just the small message channel $ sudo iptables -A OUTPUT -p tcp -d 127.0.0.3 --dport 7000 --sport 58080 -j DROP {noformat} Then I just check that we properly reconnect faster than 15 minutes. On an unpatched trunk I can watch the small message channel just sit there probing for 15 retries (the default tcp_retries2 value, node that this netstat is a different run than the previous one) : {noformat} netstat -on | grep 7000 | grep 127.0.0.3:7000 | grep -v "0 127.0.0.3" tcp 0 0 127.0.0.1:34532 127.0.0.3:7000 ESTABLISHED keepalive (6823.99/0/0) tcp 0 0 127.0.0.1:34564 127.0.0.3:7000 ESTABLISHED keepalive (6840.36/0/0) tcp 0 808 127.0.0.1:34544 127.0.0.3:7000 ESTABLISHED probe (108.32/0/10) tcp 0 0 127.0.0.1:34554 127.0.0.3:7000 ESTABLISHED keepalive (6840.26/0/0) tcp 0 0 127.0.0.1:34540 127.0.0.3:7000 ESTABLISHED keepalive (6831.27/0/0) tcp 0 0 127.0.0.1:34534 127.0.0.3:7000 ESTABLISHED keepalive (6828.96/0/0) {noformat} And then with my patch it get's killed after like 5 retries instead of waiting the full 15. If you have ideas for how to unit test it I'm open to suggestions of course. was (Author: jolynch): [~khuizhang] yea that looks similar with the half open keepalive connections. Did you try the kernel workaround and did it help? [~aweisberg] [~jasobrown] I've got a mitigation patch so that Cassandra trunk at least heals the half open partitions faster. Please take a look if you have bandwidth for review. While testing the re-connection behavior I ran into CASSANDRA-14503 because the retry future was just getting clobbered by another message, so I couldn't test that we don't keep retrying after just connections are killed (as right now they just retry every message). ||trunk|| |[patch|https://github.com/apache/cassandra/compare/trunk...jolynch:CASSANDRA-14358]| |[unit tests|https://circleci.com/gh/jolynch/workflows/cassandra/tree/CASSANDRA-14358]| This patch makes the timeout configurable for internode connection (2s) and internode tcp user timeout (30s). The timeouts are settable via JMX (and nodetool). I'm marking this as patch available as I think the operating system workaround is probably ok for previous releases? If it's not just let me know and I can try to figure out how to fix it for those ones as well. I didn't have any new tests because the only way I'm aware to reproduce this behavior is by using iptables to blackhole traffic. I've been testing with a ccm cluster when i disable gossip (with 3 nodes the gossip is so frequent that we mark the node as DOWN very fast): {noformat} $ netstat -on | grep 7000 | grep 127.0.0.3:7000 | grep -v "
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482919#comment-16482919 ] Joseph Lynch edited comment on CASSANDRA-14358 at 5/21/18 7:27 PM: --- [~alienth] that is interesting and thank you for digging so deeply! If I understand correctly during a {{drain}} the other servers are responsible for noticing the change and closing their connections within the {{[shutdown_announce_in_ms|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/Gossiper.java#L1497]}} period in [response|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/GossipShutdownVerbHandler.java#L37] to the {{GOSSIP_SHUTDOWN}} gossip state, and then the {{[markAsShutdown|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/Gossiper.java#L363-L373]}} method marks it down and forcibly convicts it. I believe that the TCP connections get closed via the {{StorageService}}'s {{onDead}} method which calls {{[onDead|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/StorageService.java#L2514]}} which calls {{[MessagingService::reset|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/MessagingService.java#L505]}} which calls {{[OutboundTcpConnection::closeSocket|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L80], which [enqueues a sentinel|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L210]}} into the backlog and then the {{[OutboundTcpConnection::run|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L253]}} method is actually supposed to close it. The {{drainedMessages}} queue is a local reference though so backlog could get something that was enqueued before the {{CLOSE_SENTINEL}} and after it as well. This seems very racey to me, in particular the reconnection logic might race with the closing logic from what I can tell as we have a 2 second window between when the clients start closing and when the server will actually stop accepting new connections (because it closes the listeners). Non stateful networks would surface the RST in the {{writeConnected}} method, but AWS is like "yea that machine isn't allowed to talk to that one" and just blackholes the RSTs... I wonder if I can reproduce this by increasing that window significantly and just sending lots of traffic. was (Author: jolynch): [~alienth] that is interesting and thank you for digging so deeply! If I understand correctly during a {{drain}} the other servers are responsible for noticing the change and closing their connections within the {{[shutdown_announce_in_ms|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/Gossiper.java#L1497]}} period in [response|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/GossipShutdownVerbHandler.java#L37] to the {{GOSSIP_SHUTDOWN}} gossip state, and then the {{[markAsShutdown|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/Gossiper.java#L363-L373]}} method marks it down and forcibly convicts it. I believe that the TCP connections get closed via the {{StorageService}}'s {{onDead}} method which calls {{[onDead|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/StorageService.java#L2514]}} which calls {{[MessagingService::reset|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/MessagingService.java#L505]}} which calls {{[OutboundTcpConnection::closeSocket|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L80], which [enqueues a sentinel|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L210]}} into the backlog and then the {{[OutboundTcpConnection::run|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L253]}} method is actually supposed to close it. The {{drainedMessages}} queue is a local reference though so backlog could get something that was enqueued before the {{CLOSE_SENTINEL}} a
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464622#comment-16464622 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 4:40 AM: -- Why on earth didn't the failure detector on {{.40}} see anything for 20 minutes? Possibly due to blocking on an Outbound forever? was (Author: alienth): Why on earth didn't the failure detector in {{.40}} see anything wrong for 20 minutes? Possibly due to blocking on an Outbound forever? > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new > node's ECHOs, but the new node is never getting the ECHO's. This appears, to > me, to indicate that the old node's {{OutboundTcpConnection}} thread is just > stuck and can't make any forward progress. By the time we could notice this > and slap TRACE logging on, the only thing we see is ~10 minutes later a > {{SocketException}} inside {{writeConnected}}'s flush and an immediate > recovery. It is interesting to me that the exception happens in > {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write > failure}} I believe that this can't be a connection reset), because my > understanding is that we should have a fully handshaked SSL connection at > that point in the code. > Current theory: > # "New" node restarts, "Old" node calls > [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433] > # Old node starts [creating a > new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L141] > SSL socket > # SSLSocket calls > [createSocket|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/security/SSLFactory.java#L98], > which conveniently calls connect with a default timeout of "forever". We > could hang here forever until the OS kills us. > # If we continue, we get to > [writeConnected|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassa
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464596#comment-16464596 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 4:24 AM: -- [~jasobrown] {{.88}} was down for about 10 minutes, and was shutdown via `nodetool drain`, followed by stopping the service gracefully. {{.40}} has no gossiper logs showing {{.88}} going down, until the socket to {{.88}} is killed about 20 minutes later. {{.88}} was shut down at 1602 and came back up at 1612: {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 StorageService.java:1449 - DRAINING: starting drain process}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 HintsService.java:220 - Paused hints dispatch}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 ThriftServer.java:139 - Stop listening to thrift clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 Server.java:176 - Stop listening for CQL clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 Gossiper.java:1540 - Announcing shutdown}} {{INFO [main] 2018-05-04 16:12:39,746 StorageService.java:1449 - JOINING: Finish joining ring}} And {{.40}} has no logs of this happening, and seemingly doesn't see it as down until ~20 minutes later: {{INFO [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - InetAddress /10.0.107.88 is now DOWN}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node /10.0.107.88 has restarted, now UP}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node /10.0.107.88 state jump to NORMAL}} That is.. bizarre. was (Author: alienth): [~jasobrown] {{.88}} was down for about 10 minutes, and was shutdown via `nodetool drain`, followed by stopping the service gracefully. {{.40}} has no gossiper logs showing {{.88}} going down, until the socket to {{.88}} is killed about 20 minutes later. {{.88}} was shut down at 1602 and came back up at 1602: {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 StorageService.java:1449 - DRAINING: starting drain process}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 HintsService.java:220 - Paused hints dispatch}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 ThriftServer.java:139 - Stop listening to thrift clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 Server.java:176 - Stop listening for CQL clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 Gossiper.java:1540 - Announcing shutdown}} {{INFO [main] 2018-05-04 16:12:39,746 StorageService.java:1449 - JOINING: Finish joining ring}} And {{.40}} has no logs of this happening, and seemingly doesn't see it as down until ~20 minutes later: {{INFO [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - InetAddress /10.0.107.88 is now DOWN}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node /10.0.107.88 has restarted, now UP}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node /10.0.107.88 state jump to NORMAL}} That is.. bizarre. > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noform
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464604#comment-16464604 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:47 AM: -- Extra note: At the same second that {{.88}}'s gossiper announced shutdown, there was a newly established socket from {{.40}} to {{.88}}: {{Fri May 4 16:02:15 PDT 2018}} {{tcp ESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000 timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}} This is the same socket that would become CLOSE-WAIT only seconds later, and remain that way for the following 18 minutes: {{Fri May 4 16:02:20 PDT 2018}} {{tcp CLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000 timer:(keepalive,4min54sec,0) uid:115 ino:5913893 sk:216e -->}} was (Author: alienth): Extra note: One second after {{.88}}'s gossiper announced shutdown, there was a newly established socket from {{.40}} to {{.88}}: {{Fri May 4 16:02:15 PDT 2018}} {{tcp ESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000 timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}} This is the same socket that would become CLOSE-WAIT only seconds later, and remain that way for the following 18 minutes: {{Fri May 4 16:02:20 PDT 2018}} {{tcp CLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000 timer:(keepalive,4min54sec,0) uid:115 ino:5913893 sk:216e -->}} > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new > node's ECHOs, but the new node is never getting the ECHO's. This appears, to > me, to indicate that the old node's {{OutboundTcpConnection}} thread is just > stuck and can't make any forward progress. By the time we could notice this > and slap TRACE logging on, the only thing we see is ~10 minutes later a > {{SocketException}} inside {{writeConnected}}'s flush and an immediate > recovery. It is interesting to me that the exception happens in > {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write > failure}} I believe that this can't be a connection reset), because my > understanding is that we should have a fully handshaked SSL connection at > that point in the code. > Current
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464604#comment-16464604 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:37 AM: -- Extra note: One second after {{.88}}'s gossiper announced shutdown, there was a newly established socket from {{.40}} to {{.88}}: {{Fri May 4 16:02:15 PDT 2018}} {{tcp ESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000 timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}} This is the same socket that would become CLOSE-WAIT only seconds later, and remain that way for the following 18 minutes: {{Fri May 4 16:02:20 PDT 2018}} {{tcp CLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000 timer:(keepalive,4min54sec,0) uid:115 ino:5913893 sk:216e -->}} was (Author: alienth): Extra note: One second after {{.88}}'s gossiper announced shutdown, there was a newly established socket from {{.40}} to {{.88}}: {{Fri May 4 16:02:15 PDT 201}} {{tcp ESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000 timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}} This is the same socket that would become CLOSE-WAIT only seconds later, and remain that way for the following 18 minutes: {{Fri May 4 16:02:20 PDT 2018}} {{tcp CLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000 timer:(keepalive,4min54sec,0) uid:115 ino:5913893 sk:216e -->}} > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new > node's ECHOs, but the new node is never getting the ECHO's. This appears, to > me, to indicate that the old node's {{OutboundTcpConnection}} thread is just > stuck and can't make any forward progress. By the time we could notice this > and slap TRACE logging on, the only thing we see is ~10 minutes later a > {{SocketException}} inside {{writeConnected}}'s flush and an immediate > recovery. It is interesting to me that the exception happens in > {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write > failure}} I believe that this can't be a connection reset), because my > understanding is that we should have a fully handshaked SSL connection at > that point in the code. > Current theory:
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464604#comment-16464604 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:36 AM: -- Extra note: One second after {{.88}}'s gossiper announced shutdown, there was a newly established socket from {{.40}} to {{.88}}: {{Fri May 4 16:02:15 PDT 201}} {{tcp ESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000 timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}} This is the same socket that would become CLOSE-WAIT only seconds later, and remain that way for the following 18 minutes: {{Fri May 4 16:02:20 PDT 2018}} {{tcp CLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000 timer:(keepalive,4min54sec,0) uid:115 ino:5913893 sk:216e -->}} was (Author: alienth): Extra note: One second after {{.88}}'s gossiper announced shutdown, there was a newly established socket to {{.40}}: {{Fri May 4 16:02:15 PDT 201}} {{tcpESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}} This is the same socket that would become CLOSE-WAIT only seconds later, and remain that way for the following 18 minutes: {{Fri May 4 16:02:20 PDT 2018}} {{tcpCLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000timer:(keepalive,4min54sec,0) uid:115 ino:5913893 sk:216e -->}} > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new > node's ECHOs, but the new node is never getting the ECHO's. This appears, to > me, to indicate that the old node's {{OutboundTcpConnection}} thread is just > stuck and can't make any forward progress. By the time we could notice this > and slap TRACE logging on, the only thing we see is ~10 minutes later a > {{SocketException}} inside {{writeConnected}}'s flush and an immediate > recovery. It is interesting to me that the exception happens in > {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write > failure}} I believe that this can't be a connection reset), because my > understanding is that we should have a fully
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464604#comment-16464604 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:34 AM: -- Extra note: One second after {{.88}}'s gossiper announced shutdown, there was a newly established socket to {{.40}}: {{Fri May 4 16:02:15 PDT 201}} {{tcpESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}} This is the same socket that would become CLOSE-WAIT only seconds later, and remain that way for the following 18 minutes: {{Fri May 4 16:02:20 PDT 2018}} {{tcpCLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000timer:(keepalive,4min54sec,0) uid:115 ino:5913893 sk:216e -->}} was (Author: alienth): Extra note: One second after {{.88}}'s gossiper announced shutdown, there was a newly established socket to {{.40}}: {{Fri May 4 16:02:15 PDT 2018 }} {{tcpESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}} This is the same socket that would become CLOSE-WAIT only seconds later, and remain that way for the following 18 minutes: {{Fri May 4 16:02:20 PDT 2018}} {{tcpCLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000timer:(keepalive,4min54sec,0) uid:115 ino:5913893 sk:216e -->}} > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new > node's ECHOs, but the new node is never getting the ECHO's. This appears, to > me, to indicate that the old node's {{OutboundTcpConnection}} thread is just > stuck and can't make any forward progress. By the time we could notice this > and slap TRACE logging on, the only thing we see is ~10 minutes later a > {{SocketException}} inside {{writeConnected}}'s flush and an immediate > recovery. It is interesting to me that the exception happens in > {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write > failure}} I believe that this can't be a
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464596#comment-16464596 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:27 AM: -- [~jasobrown] {{.88}} was down for about 10 minutes, and was shutdown via `nodetool drain`, followed by stopping the service gracefully. {{.40}} has no gossiper logs showing {{.88}} going down, until the socket to {{.88}} is killed about 20 minutes later. {{.88}} was shut down at 1602 and came back up at 1602: {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 StorageService.java:1449 - DRAINING: starting drain process}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 HintsService.java:220 - Paused hints dispatch}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 ThriftServer.java:139 - Stop listening to thrift clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 Server.java:176 - Stop listening for CQL clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 Gossiper.java:1540 - Announcing shutdown}} {{INFO [main] 2018-05-04 16:12:39,746 StorageService.java:1449 - JOINING: Finish joining ring}} And {{.40}} has no logs of this happening, and seemingly doesn't see it as down until ~20 minutes later: {{INFO [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - InetAddress /10.0.107.88 is now DOWN}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node /10.0.107.88 has restarted, now UP}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node /10.0.107.88 state jump to NORMAL}} That is.. bizarre. was (Author: alienth): [~jasobrown] {{.88}} was down for about 20 minutes, and was shutdown via `nodetool drain`, followed by stopping the service gracefully. {{.40}} has no gossiper logs showing {{.88}} going down, until {{.88}} comes back 20 minutes later. {{.88}} was shut down at 1602: {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 StorageService.java:1449 - DRAINING: starting drain process}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 HintsService.java:220 - Paused hints dispatch}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 ThriftServer.java:139 - Stop listening to thrift clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 Server.java:176 - Stop listening for CQL clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 Gossiper.java:1540 - Announcing shutdown}} And {{.40}} has no logs of this happening, and seemingly doesn't see it as down until ~20 minutes later: {{INFO [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - InetAddress /10.0.107.88 is now DOWN}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node /10.0.107.88 has restarted, now UP}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node /10.0.107.88 state jump to NORMAL}} That is.. bizarre. > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partit
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464596#comment-16464596 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:26 AM: -- [~jasobrown] {{.88}} was down for about 10 minutes, and was shutdown via `nodetool drain`, followed by stopping the service gracefully. {{.40}} has no gossiper logs showing {{.88}} going down, until {{.88}} comes back 20 minutes later. {{.88}} was shut down at 1602: {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 StorageService.java:1449 - DRAINING: starting drain process}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 HintsService.java:220 - Paused hints dispatch}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 ThriftServer.java:139 - Stop listening to thrift clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 Server.java:176 - Stop listening for CQL clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 Gossiper.java:1540 - Announcing shutdown}} And {{.40}} has no logs of this happening, and seemingly doesn't see it as down until ~20 minutes later: {{INFO [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - InetAddress /10.0.107.88 is now DOWN}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node /10.0.107.88 has restarted, now UP}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node /10.0.107.88 state jump to NORMAL}} That is.. bizarre. was (Author: alienth): [~jasobrown] {{.88}} was down for about 10 minutes, and was shutdown via `nodetool drain`, followed by stopping the service gracefully. {{.40}} has no gossiper logs showing {{.88}} going down, until {{.88}} comes back 20 minutes later. {{.88}} was shut down at 1602: {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 StorageService.java:1449 - DRAINING: starting drain process}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 HintsService.java:220 - Paused hints dispatch}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 ThriftServer.java:139 - Stop listening to thrift clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 Server.java:176 - Stop listening for CQL clients {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 Gossiper.java:1540 - Announcing shutdown}} And {{.40}} has no logs of this happening, and seemingly doesn't see it as down until ~20 minutes later: {{INFO [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - InetAddress /10.0.107.88 is now DOWN}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node /10.0.107.88 has restarted, now UP}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node /10.0.107.88 state jump to NORMAL}} That is.. bizarre. > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464596#comment-16464596 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:26 AM: -- [~jasobrown] {{.88}} was down for about 20 minutes, and was shutdown via `nodetool drain`, followed by stopping the service gracefully. {{.40}} has no gossiper logs showing {{.88}} going down, until {{.88}} comes back 20 minutes later. {{.88}} was shut down at 1602: {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 StorageService.java:1449 - DRAINING: starting drain process}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 HintsService.java:220 - Paused hints dispatch}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 ThriftServer.java:139 - Stop listening to thrift clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 Server.java:176 - Stop listening for CQL clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 Gossiper.java:1540 - Announcing shutdown}} And {{.40}} has no logs of this happening, and seemingly doesn't see it as down until ~20 minutes later: {{INFO [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - InetAddress /10.0.107.88 is now DOWN}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node /10.0.107.88 has restarted, now UP}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node /10.0.107.88 state jump to NORMAL}} That is.. bizarre. was (Author: alienth): [~jasobrown] {{.88}} was down for about 10 minutes, and was shutdown via `nodetool drain`, followed by stopping the service gracefully. {{.40}} has no gossiper logs showing {{.88}} going down, until {{.88}} comes back 20 minutes later. {{.88}} was shut down at 1602: {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 StorageService.java:1449 - DRAINING: starting drain process}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 HintsService.java:220 - Paused hints dispatch}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 ThriftServer.java:139 - Stop listening to thrift clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 Server.java:176 - Stop listening for CQL clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 Gossiper.java:1540 - Announcing shutdown}} And {{.40}} has no logs of this happening, and seemingly doesn't see it as down until ~20 minutes later: {{INFO [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - InetAddress /10.0.107.88 is now DOWN}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node /10.0.107.88 has restarted, now UP}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node /10.0.107.88 state jump to NORMAL}} That is.. bizarre. > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464596#comment-16464596 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/5/18 3:25 AM: -- [~jasobrown] {{.88}} was down for about 10 minutes, and was shutdown via `nodetool drain`, followed by stopping the service gracefully. {{.40}} has no gossiper logs showing {{.88}} going down, until {{.88}} comes back 20 minutes later. {{.88}} was shut down at 1602: {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 StorageService.java:1449 - DRAINING: starting drain process}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 HintsService.java:220 - Paused hints dispatch}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 ThriftServer.java:139 - Stop listening to thrift clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 Server.java:176 - Stop listening for CQL clients {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 Gossiper.java:1540 - Announcing shutdown}} And {{.40}} has no logs of this happening, and seemingly doesn't see it as down until ~20 minutes later: {{INFO [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - InetAddress /10.0.107.88 is now DOWN}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node /10.0.107.88 has restarted, now UP}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node /10.0.107.88 state jump to NORMAL}} That is.. bizarre. was (Author: alienth): {{.88}} was down for about 10 minutes, and was shutdown via `nodetool drain`, followed by stopping the service gracefully. {{.40}} has no gossiper logs showing {{.88}} going down, until {{.88}} comes back 20 minutes later. {{.88}} was shut down at 1602: {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,055 StorageService.java:1449 - DRAINING: starting drain process}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,057 HintsService.java:220 - Paused hints dispatch}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,058 ThriftServer.java:139 - Stop listening to thrift clients}} {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,077 Server.java:176 - Stop listening for CQL clients {{INFO [RMI TCP Connection(22)-127.0.0.1] 2018-05-04 16:02:14,078 Gossiper.java:1540 - Announcing shutdown}} And {{.40}} has no logs of this happening, and seemingly doesn't see it as down until ~20 minutes later: {{INFO [GossipStage:1] 2018-05-01 16:20:36,680 Gossiper.java:1034 - InetAddress /10.0.107.88 is now DOWN}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,693 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [HANDSHAKE-/10.0.107.88] 2018-05-01 16:20:36,822 OutboundTcpConnection.java:560 - Handshaking version with /10.0.107.88}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,323 Gossiper.java:1053 - Node /10.0.107.88 has restarted, now UP}} {{INFO [GossipStage:1] 2018-05-01 16:20:56,325 StorageService.java:2292 - Node /10.0.107.88 state jump to NORMAL}} That is.. bizarre. > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > c
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464566#comment-16464566 ] Jason Brown edited comment on CASSANDRA-14358 at 5/5/18 2:06 AM: - [~alienth] nice detective work. it looks like {{10.0.161.40}} hasn't closed the previous sockets/connections. If you have the logs of {{10.0.161.40}} still handy, can you see if, during and after the bounce, there are log statements about {{10.0.107.88}} being down or alive (or restarted). Not looking at the code to see the exact log messages, but typically they are emitted by {{Gossiper}}. Also, how long did it take {{10.0.107.88}} to bounce? If it went down and came back up fast enough, it's possible the failure detector on {{10.0.161.40}} didn't mark it as down. How did you terminate {{10.0.107.88}}? A normal shutdown, or {{kill -9}}? was (Author: jasobrown): [~alienth] nice detective work. it looks like {{10.0.161.40}} hasn't closed the previous sockets/connections. If you have the logs of {{10.0.161.40}} still handy, can you see if, during and after the bounce, there are log statements about {{10.0.107.88}} being down or alive (or restarted). Not looking at the code to see the exact log messages, but typically they are emitted by {{Gossiper}}. > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new > node's ECHOs, but the new node is never getting the ECHO's. This appears, to > me, to indicate that the old node's {{OutboundTcpConnection}} thread is just > stuck and can't make any forward progress. By the time we could notice this > and slap TRACE logging on, the only thing we see is ~10 minutes later a > {{SocketException}} inside {{writeConnected}}'s flush and an immediate > recovery. It is interesting to me that the exception happens in > {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write > failure}} I believe that this can't be a connection reset), because my > understanding is that we should have a fully handshaked SSL connection at > that point in the code. > Current theory: > # "New" node restarts, "Old" node calls > [newSocket|https
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464495#comment-16464495 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/4/18 11:44 PM: --- Captured the socket state for before, during, and after the restart, from the POV of a node which the restarted node sees as down: Before, when nothing has been done: {{Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port}} {{tcp ESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000 timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}} After 10.0.107.88 has been restarted. Note the {{1}} in in the recv-q: {{tcp CLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000 timer:(keepalive,4min14sec,0) uid:115 ino:5913893 sk:216e -->}} When 10.0.107.88 comes back up and 10.0.161.40 tries to respond to the EchoMessage, using the previous socket which has been in CLOSE-WAIT. You can see the outbounds piling up in the send-q: {{tcp CLOSE-WAIT 1 36527 10.0.161.40:59739 10.0.107.88:7000 timer:(on,1.932ms,4) uid:115 ino:5913893 sk:216e -->}} was (Author: alienth): Captured the socket state for before, during, and after the restart, from the POV of a node which the restarted node sees as down: Before, when nothing has been done: {{Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port}} {{tcpESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}} After 10.0.107.88 has been restarted. Note the {{1}} in in the recv-q: {{tcpCLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000timer:(keepalive,4min14sec,0) uid:115 ino:5913893 sk:216e -->}} When 10.0.107.88 comes back up and 10.0.161.40 tries to respond to the EchoMessage, using the previous socket which has been in CLOSE-WAIT. You can see the outbounds piling up in the send-q: {{tcpCLOSE-WAIT 1 36527 10.0.161.40:59739 10.0.107.88:7000timer:(on,1.932ms,4) uid:115 ino:5913893 sk:216e -->}} > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQU
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464495#comment-16464495 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/4/18 11:43 PM: --- Captured the socket state for before, during, and after the restart, from the POV of a node which the restarted node sees as down: Before, when nothing has been done: {{Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port}} {{tcpESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}} After 10.0.107.88 has been restarted. Note the {{1}} in in the recv-q: {{tcpCLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000timer:(keepalive,4min14sec,0) uid:115 ino:5913893 sk:216e -->}} When 10.0.107.88 comes back up and 10.0.161.40 tries to respond to the EchoMessage, using the previous socket which has been in CLOSE-WAIT. You can see the outbounds piling up in the send-q: {{tcpCLOSE-WAIT 1 36527 10.0.161.40:59739 10.0.107.88:7000timer:(on,1.932ms,4) uid:115 ino:5913893 sk:216e -->}} was (Author: alienth): Captured the socket state for before, during, and after the restart, from the POV of a node which the restarted node sees as down: Before, when nothing has been done: {{tcpESTAB 0 0 10.0.161.40:59739 10.0.107.88:7000timer:(keepalive,4min59sec,0) uid:115 ino:5913893 sk:216e <->}} After 10.0.107.88 has been restarted. Note the {{1}} in in the recv-q: {{tcpCLOSE-WAIT 1 0 10.0.161.40:59739 10.0.107.88:7000timer:(keepalive,4min14sec,0) uid:115 ino:5913893 sk:216e -->}} When 10.0.107.88 comes back up and 10.0.161.40 tries to respond to the EchoMessage, using the previous socket which has been in CLOSE-WAIT: {{tcpCLOSE-WAIT 1 36527 10.0.161.40:59739 10.0.107.88:7000timer:(on,1.932ms,4) uid:115 ino:5913893 sk:216e -->}} > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPO
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464368#comment-16464368 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/4/18 8:53 PM: -- Reproduced this behaviour several times now,and in all cases the socket that cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`. I think that suggests that the more common case might be where AWS stops tracking the flow for a FINd connection, and an ACK being dropped somewhere resulting in a node not fully closing the connection? Note that I'm seeing this on cassandra 3.11.2, which has the patches from CASSANDRA-9630 {{tcpCLOSE-WAIT 1 322683 10.0.161.40:18254 10.0.109.39:7000timer:(on,1min54sec,13) uid:115 ino:5837280 sk:1a1 -->}} was (Author: alienth): Reproduced this behaviour several times now,and in all cases the socket that cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`. I think that suggests that the more common case might be where AWS stops tracking the flow for a FINd connection, and an ACK being dropped somewhere resulting in a node not fully closing the connection. Note that I'm seeing this on cassandra 3.11.2, which has the patches from CASSANDRA-9630 {{tcpCLOSE-WAIT 1 322683 10.0.161.40:18254 10.0.109.39:7000timer:(on,1min54sec,13) uid:115 ino:5837280 sk:1a1 -->}} > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new > node's ECHOs, but the new node is never getting the ECHO's. This appears, to > me, to indicate that the old node's {{OutboundTcpConnection}} thread is just > stuck and can't make any forward progress. By the time we could notice this > and slap TRACE logging on, the only thing we see is ~10 minutes later a > {{SocketException}} inside {{writeConnected}}'s flush and an immediate > recovery. It is interesting to me that the exception happens in > {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write > failure}} I believe that this can't be a connection reset), because my > understanding is that we should have
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464368#comment-16464368 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/4/18 8:52 PM: -- Reproduced this behaviour several times now,and in all cases the socket that cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`. I think that suggests that the more common case might be where AWS stops tracking the flow for a FINd connection, and an ACK being dropped somewhere resulting in a node not fully closing the connection. Note that I'm seeing this on cassandra 3.11.2, which has the patches from CASSANDRA-9630 {{tcpCLOSE-WAIT 1 322683 10.0.161.40:18254 10.0.109.39:7000timer:(on,1min54sec,13) uid:115 ino:5837280 sk:1a1 -->}} was (Author: alienth): Reproduced this behaviour several times now,and in all cases the socket that cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`. I think that suggests that the more common case might be where AWS stops tracking the flow for a FINd connection, and an ACK being dropped somewhere resulting in a node not fully closing the connection. {{tcpCLOSE-WAIT 1 322683 10.0.161.40:18254 10.0.109.39:7000timer:(on,1min54sec,13) uid:115 ino:5837280 sk:1a1 -->}} > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new > node's ECHOs, but the new node is never getting the ECHO's. This appears, to > me, to indicate that the old node's {{OutboundTcpConnection}} thread is just > stuck and can't make any forward progress. By the time we could notice this > and slap TRACE logging on, the only thing we see is ~10 minutes later a > {{SocketException}} inside {{writeConnected}}'s flush and an immediate > recovery. It is interesting to me that the exception happens in > {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write > failure}} I believe that this can't be a connection reset), because my > understanding is that we should have a fully handshaked SSL connection at > that point in the code. > Current theory: > # "N
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464368#comment-16464368 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/4/18 8:44 PM: -- Reproduced this behaviour several times now,and in all cases the socket that cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`. I think that suggests that the more common case might be where AWS stops tracking the flow for a FINd connection, and an ACK being dropped somewhere resulting in a node not fully closing the connection. {{tcpCLOSE-WAIT 1 322683 10.0.161.40:18254 10.0.109.39:7000timer:(on,1min54sec,13) uid:115 ino:5837280 sk:1a1 -->}} was (Author: alienth): Reproduced this behaviour several times now,and in all cases the socket that cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`. I think that suggests that the more common case might be where AWS stops tracking the flow for a FINd connection having a dropped ACK somewhere. {{tcpCLOSE-WAIT 1 322683 10.0.161.40:18254 10.0.109.39:7000timer:(on,1min54sec,13) uid:115 ino:5837280 sk:1a1 -->}} > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new > node's ECHOs, but the new node is never getting the ECHO's. This appears, to > me, to indicate that the old node's {{OutboundTcpConnection}} thread is just > stuck and can't make any forward progress. By the time we could notice this > and slap TRACE logging on, the only thing we see is ~10 minutes later a > {{SocketException}} inside {{writeConnected}}'s flush and an immediate > recovery. It is interesting to me that the exception happens in > {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write > failure}} I believe that this can't be a connection reset), because my > understanding is that we should have a fully handshaked SSL connection at > that point in the code. > Current theory: > # "New" node restarts, "Old" node calls > [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apa
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464368#comment-16464368 ] Jason Harvey edited comment on CASSANDRA-14358 at 5/4/18 8:42 PM: -- Reproduced this behaviour several times now,and in all cases the socket that cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`. I think that suggests that the more common case might be where AWS stops tracking the flow for a FINd connection having a dropped ACK somewhere. {{tcpCLOSE-WAIT 1 322683 10.0.161.40:18254 10.0.109.39:7000timer:(on,1min54sec,13) uid:115 ino:5837280 sk:1a1 -->}} was (Author: alienth): Reproduced this behaviour several times now,and in all cases the socket that cassandra is trying to Outbound on has been in stuck in `CLOSE_WAIT`. I think that suggests that the more common case might be where AWS stops tracking the flow for a FINd connection having a dropped ACK somewhere. > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new > node's ECHOs, but the new node is never getting the ECHO's. This appears, to > me, to indicate that the old node's {{OutboundTcpConnection}} thread is just > stuck and can't make any forward progress. By the time we could notice this > and slap TRACE logging on, the only thing we see is ~10 minutes later a > {{SocketException}} inside {{writeConnected}}'s flush and an immediate > recovery. It is interesting to me that the exception happens in > {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write > failure}} I believe that this can't be a connection reset), because my > understanding is that we should have a fully handshaked SSL connection at > that point in the code. > Current theory: > # "New" node restarts, "Old" node calls > [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433] > # Old node starts [creating a > new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTc
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433171#comment-16433171 ] Ariel Weisberg edited comment on CASSANDRA-14358 at 4/10/18 11:20 PM: -- 30 seconds (effectively 60) and a hot prop sounds excellent. was (Author: aweisberg): 30 seconds and a hot prop sounds excellent. > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new > node's ECHOs, but the new node is never getting the ECHO's. This appears, to > me, to indicate that the old node's {{OutboundTcpConnection}} thread is just > stuck and can't make any forward progress. By the time we could notice this > and slap TRACE logging on, the only thing we see is ~10 minutes later a > {{SocketException}} inside {{writeConnected}}'s flush and an immediate > recovery. It is interesting to me that the exception happens in > {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write > failure}} I believe that this can't be a connection reset), because my > understanding is that we should have a fully handshaked SSL connection at > that point in the code. > Current theory: > # "New" node restarts, "Old" node calls > [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433] > # Old node starts [creating a > new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L141] > SSL socket > # SSLSocket calls > [createSocket|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/security/SSLFactory.java#L98], > which conveniently calls connect with a default timeout of "forever". We > could hang here forever until the OS kills us. > # If we continue, we get to > [writeConnected|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L263] > which eventually calls > [flush|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431491#comment-16431491 ] Ariel Weisberg edited comment on CASSANDRA-14358 at 4/9/18 11:27 PM: - Nice find! The numbers you picked seem pretty aggressive. I was expecting in the 30-60 second range before giving up. If the network conditions are unreliable but working you might drop connections and then not be able to recreate them quickly due to packet loss. Granted things are probably mostly unavailable anyways in that scenario, but you don't want them looping too tightly trying to recreate connections. The last question is if this needs to be a hot property or not. was (Author: aweisberg): Nice find! Seems like something we should set this and make configurable in the YAML. The numbers you picked seem pretty aggressive. I was expecting in the 30-60 second range before giving up. If the network conditions are unreliable but working you might drop connections and then not be able to recreate them quickly due to packet loss. Granted things are probably mostly unavailable anyways in that scenario, but you don't want them looping too tightly trying to recreate connections. The last question is if this needs to be a hot property or not. > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new > node's ECHOs, but the new node is never getting the ECHO's. This appears, to > me, to indicate that the old node's {{OutboundTcpConnection}} thread is just > stuck and can't make any forward progress. By the time we could notice this > and slap TRACE logging on, the only thing we see is ~10 minutes later a > {{SocketException}} inside {{writeConnected}}'s flush and an immediate > recovery. It is interesting to me that the exception happens in > {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write > failure}} I believe that this can't be a connection reset), because my > understanding is that we should have a fully handshaked SSL connection at > that point in the code. > Current theory: > # "New" node restarts, "Old" node calls > [newSocket|https://githu
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424802#comment-16424802 ] Joseph Lynch edited comment on CASSANDRA-14358 at 4/4/18 12:49 AM: --- For what it's worth we've opened an AWS ticket to find out if this is expected behaviour in VPCs (that the network blackholes packets on unknown flows rather than resetting them). I have a feeling that answer will be "yea that's how stateful firewalls work we can't keep flows forever", in which case Cassandra (probably) should be resilient to it I think. [~djoshi3] I think you've got it now. I'm proposing that we tune Cassandra's socket options (opt-in of course like most socket options we set) to tune TCP's parameters per Cassandra's requirement of high availability. In the case of blocking socket the way that I'm familiar with are connect timeouts ([unix|http://man7.org/linux/man-pages/man2/connect.2.html], [java|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#connect-java.net.SocketAddress-int-]), receive timeouts ([SO_RCVTIMEO|https://linux.die.net/man/7/socket], [java|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#setSoTimeout-int-]) and send timeouts which Java doesn't expose afaict ([SO_SNDTIMEO|https://linux.die.net/man/7/socket]). For {{O_NONBLOCK}} sockets I think typically the timeout is handled at the event loop level (e.g. I think that's what {{WriteTimeoutHandler}} and {{ReadTimeoutHandler}} do in Netty). I still need to test this thoroughly, but I believe asking users to set {{/proc/sys/net/ipv4/tcp_retries2 = 5}} is a reasonable workaround. I was able to simulate pretty easily the exact error in the production logs by doing the following: {noformat} $ sudo lsof -p $(pgrep -f Cassandra) -n | grep OTHER_IP | grep STORAGE_PORT java5277 cassandra 64u IPv4 113860 0t0TCP LOCAL_IP:38665->OTHER_IP:STORAGE_PORT (ESTABLISHED) java5277 cassandra 69u IPv4 114797 0t0TCP LOCAL_IP:STORAGE_PORT->OTHER_IP:54875 (ESTABLISHED) $ sudo iptables -A OUTPUT -p tcp -d OTHER_IP --sport 38665 --dport STORAGE_PORT -j DROP $ cqlsh cqlsh> CONSISTENCY ALL Consistency level set to ALL. cqlsh> select * from test_ks.test_cf WHERE key = 'key_that_lives_on_OTHER_IP'; OperationTimedOut: errors={}, last_host= ... timeouts happen for next ~15 minutes $ tail -20 system.log TRACE [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:05,009 OutboundTcpConnection.java:365 - Socket to /OTHER_IP closed DEBUG [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:05,010 OutboundTcpConnection.java:303 - error writing to /OTHER_IP java.net.SocketException: Connection timed out (Write failed) at java.net.SocketOutputStream.socketWrite0(Native Method) ~[na:1.8.0_152] at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[na:1.8.0_152] at java.net.SocketOutputStream.write(SocketOutputStream.java:155) ~[na:1.8.0_152] at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431) ~[na:1.8.0_152] at sun.security.ssl.OutputRecord.write(OutputRecord.java:417) ~[na:1.8.0_152] at sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:886) ~[na:1.8.0_152] at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:857) ~[na:1.8.0_152] at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123) ~[na:1.8.0_152] at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205) ~[lz4-1.2.0.jar:na] at net.jpountz.lz4.LZ4BlockOutputStream.flush(LZ4BlockOutputStream.java:222) ~[lz4-1.2.0.jar:na] at org.apache.cassandra.io.util.DataOutputStreamPlus.flush(DataOutputStreamPlus.java:55) ~[cassandra-2.1.19.jar:2.1.19] at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:294) [cassandra-2.1.19.jar:2.1.19] at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:222) [cassandra-2.1.19.jar:2.1.19] DEBUG [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:49,867 OutboundTcpConnection.java:380 - attempting to connect to /OTHER_IP INFO [HANDSHAKE-/OTHER_IP] 2018-04-04 00:13:49,916 OutboundTcpConnection.java:496 - Handshaking version with /OTHER_IP TRACE [MessagingService-Outgoing-/OTHER_IP] 2018-04-04 00:13:49,960 OutboundTcpConnection.java:453 - Upgrading OutputStream to be compressed $ cqlsh cqlsh> CONSISTENCY ALL Consistency level set to ALL. cqlsh> select * from test_ks.test_cf WHERE key = 'key_that_lives_on_OTHER_IP'; ... results come back and everything is happy $ sudo lsof -p $(pgrep -f Cassandra) -n | grep OTHER_IP | grep STORAGE_PORT java5277 cassandra 134u IPv4 113860 0t0TCP LOCAL_IP:33417->OTHER_IP:STORAGE_PORT (ESTABLI
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424033#comment-16424033 ] Jason Brown edited comment on CASSANDRA-14358 at 4/3/18 1:42 PM: - For trunk, we already use {{IdleStateHandler}} [in the outbound pipeline|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/async/OutboundHandshakeHandler.java#L196] for the "not making progress sending data" case. Take a look at {{MessageOutHandler#userEventTriggered}}. iirc, {{WriteTimeoutHandler}} / {{ReadTimeoutHandler}} does not fit our use case for messaging; I'd have to look again to page it back in. Further, [~jolynch], I thought your problem was in pre-4.0? There is a problem (or, at least, non-optimization) with {{MessagingService.SocketThread}} where the single-threaded accept thread handles receiving the first message pf the c* internode messaging protocol, which may require executing the TLS handshake before that can be received (if using internode TLS). Thus a single new establishing connection may block the accept thread for 500-1000 ms for a cross-datacenter connection. Even worse, if you are using the {{Ec2MultiregionSnitch}}, all connections for the local region will be torn down and built anew as we switch from the public IP to the region-local private IP - more traffic on the accept thread. So one slow connection interaction screws the whole accept thread. I have an internal patch for 3.0 which, when a new socket is accepted, moves all processing (internode protocol handshake, TLS handshake) to a separate thread (via an executor group). So, if one connection runs into a problem, the other incoming connections are not held up. I've been reticent about OSS'ing this late in the 3.0/3.11 branches, even though it's not that clever or invasive. Let me know if you'd be interested in giving it shot and I can post it. If it's useful/helpful we can consider committing it, as well. was (Author: jasobrown): For trunk, we already use {{IdleStateHandler}} [in the outbound pipeline|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/async/OutboundHandshakeHandler.java#L196] for the "not making progress sending data" case. iirc, {{WriteTimeoutHandler}} / {{ReadTimeoutHandler}} does not fit our use case for messaging; I'd have to look again to page it back in. Further, [~jolynch], I thought your problem was in pre-4.0? There is a problem (or, at least, non-optimization) with {{MessagingService.SocketThread}} where the single-threaded accept thread handles receiving the first message pf the c* internode messaging protocol, which may require executing the TLS handshake before that can be received (if using internode TLS). Thus a single new establishing connection may block the accept thread for 500-1000 ms for a cross-datacenter connection. Even worse, if you are using the {{Ec2MultiregionSnitch}}, all connections for the local region will be torn down and built anew as we switch from the public IP to the region-local private IP - more traffic on the accept thread. So one slow connection interaction screws the whole accept thread. I have an internal patch for 3.0 which, when a new socket is accepted, moves all processing (internode protocol handshake, TLS handshake) to a separate thread (via an executor group). So, if one connection runs into a problem, the other incoming connections are not held up. I've been reticent about OSS'ing this late in the 3.0/3.11 branches, even though it's not that clever or invasive. Let me know if you'd be interested in giving it shot and I can post it. If it's useful/helpful we can consider committing it, as well. > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423205#comment-16423205 ] Joseph Lynch edited comment on CASSANDRA-14358 at 4/3/18 12:42 AM: --- [~aweisberg] if the issue is what I think it is (and I definitely need more supporting evidence), I think we can fix it in trunk with a netty {{WriteTimeoutHandler}}/ {{ReadTimeoutHandler}} which will throw away the tcp connection if recv/write take too long. I think it's reasonable to say in earlier versions we are not going to patch it past connection and recv timeouts and just recommend {{/proc/sys/net/ipv4/tcp_retries2 = 5}}. I tested the sysctl and it leads to a much shorter partition with the iptables test, like 20s instead of 10 minutes. I'm testing it out on a large cluster with a rolling restart soon. [~pauloricardomg] I believe that CASSANDRA-9630 is a separate issue (although +1 to a 2.1.x backport there since it's so simple and 2.1.x is actually still used places) because in this case I believe we have an established connection that is blackholing. I'll double check this next time I run the test and ensure that the tcp connection state is {{ESTABLISHED}}. was (Author: jolynch): [~aweisberg] if the issue is what I think it is (and I definitely need more supporting evidence), I think we can fix it in trunk with a netty {{WriteTimeoutHandler}}/ {{ReadTimeoutHandler}} which will throw away the tcp connection if recv/write take too long. I think it's reasonable to say in earlier versions we are not going to patch it and just recommend {{/proc/sys/net/ipv4/tcp_retries2 = 5}}. I tested the sysctl and it leads to a much shorter partition with the iptables test, like 20s instead of 10 minutes. I'm testing it out on a large cluster with a rolling restart soon. [~pauloricardomg] I believe that CASSANDRA-9630 is a separate issue (although +1 to a 2.1.x backport there since it's so simple and 2.1.x is actually still used places) because in this case I believe we have an established connection that is blackholing. I'll double check this next time I run the test and ensure that the tcp connection state is {{ESTABLISHED}}. > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422969#comment-16422969 ] Joseph Lynch edited comment on CASSANDRA-14358 at 4/2/18 7:56 PM: -- [~djoshi3] I agree that the issue is probably a bad TCP level connection, but also I think it's a bug that Cassandra blocks forever never making any progress and potentially causing an outage for users (until the OS kills the connection); we're supposed to be highly available right ;). In non JVM languages this would be as simple (I think) as setting {{SO_SNDTIMEO}} to a reasonable number like 30s, but I don't see how to do it without JNI in Cassandra 2.1.x/3.0.x/3.11.x. If this is the issue (I think we still need more data), I also think recommending users tune {{tcp_retries2}} to 8, 5 or even 3 would be a reasonable workaround if it's fixed in trunk. I haven't tried in trunk (tbh we can't even build/deploy trunk right now), but unfortunately I think it would be vulnerable to same kind of issue since afaict {{[NettyFactory|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/async/NettyFactory.java#L358-L377]}} doesn't add any kind of {{[WriteTimeoutHandler|https://netty.io/4.0/api/io/netty/handler/timeout/WriteTimeoutHandler.html]}} or {{[ReadTimeoutHandler|https://netty.io/4.0/api/io/netty/handler/timeout/ReadTimeoutHandler.html]}} which will throw an exception (I believe) if the write/read on the socket doesn't complete in a reasonable amount of time. Admittedly I'm very new to the netty refactor in 4.0 so I might be missing the part where we set a bunch of timeouts somewhere. I don't want to invest too much time testing trunk until I confirm the type of partition that is actually happening, but if our theory is correct it should be easy to reproduce on trunk with the above iptables rule (if trunk handles it then the tcp connection should get thrown out after ~30s or something reasonable and we re-connect). was (Author: jolynch): [~djoshi3] I agree that the issue is probably a bad TCP level connection, but also I think it's a bug that Cassandra blocks forever never making any progress and potentially causing an outage for users (until the OS kills the connection); we're supposed to be highly available right ;). In non JVM languages this would be as simple (I think) as setting {{SO_SNDTIMEO}} to a reasonable number like 30s, but I don't see how to do it without JNI in Cassandra. If this is the issue (I think we still need more data), I also think recommending users tune {{tcp_retries2}} to 8, 5 or even 3 would be a reasonable workaround if it's fixed in trunk. I haven't tried in trunk (tbh we can't even build/deploy trunk right now), but unfortunately I think it would be vulnerable to same kind of issue since afaict {{[NettyFactory|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/async/NettyFactory.java#L358-L377]}} doesn't add any kind of {{[WriteTimeoutHandler|https://netty.io/4.0/api/io/netty/handler/timeout/WriteTimeoutHandler.html]}} or {{[ReadTimeoutHandler|https://netty.io/4.0/api/io/netty/handler/timeout/ReadTimeoutHandler.html]}} which will throw an exception (I believe) if the write/read on the socket doesn't complete in a reasonable amount of time. Admittedly I'm very new to the netty refactor in 4.0 so I might be missing the part where we set a bunch of timeouts somewhere. I don't want to invest too much time testing trunk until I confirm the type of partition that is actually happening, but if our theory is correct it should be easy to reproduce on trunk with the above iptables rule (if trunk handles it then the tcp connection should get thrown out after ~30s or something reasonable and we re-connect). > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422783#comment-16422783 ] Joseph Lynch edited comment on CASSANDRA-14358 at 4/2/18 5:08 PM: -- [~djoshi3] yea I think that you're probably right, but I think that this might not be a slow consumer so much as a never consumer. Specifically if there is a stateful firewall (e.g. security groups or vpc) in the way, the network could absolutely blackhole packets on a connection that's been reset. My plan for this week is I'm going to try to catch this happening and run the following analysis: # Get a netstat view of established connections and their send buffers on both sides # Slap a tcpdump on both sides to see if {{RESETS}} from the restarted node are even getting to the old node (VPC might be swallowing them) # If #1 or #2 confirm the theory that we're just in a stuck tcp connection, I will try tuning {{tcp_retries2}} down to ~5, setting {{SO_SNDBUF}} or {{wmem_default}} down to a more reasonable number (I think right now it's like 16 megs) to see if that fixes the issue For what it's worth, Cassandra is _really_ intolerant of these kinds of network partitions. For example if I tell a stateful firewall to just drop all packets on one of Cassandra's internode connections with {{iptables -A OUPTUT -p tcp -d --sport --dport -j DROP}}, Cassandra continues sending packets out on that {{OutboundTcpConnection}} until the operating system says "hey that's probably not going to work" 15 minutes later. Unfortunately we have two TCP connections so I can't just do a simple heartbeat mechanism. was (Author: jolynch): [~djoshi3] yea I think that you're probably right, but I think that this might not be a slow consumer so much as a never consumer. Specifically if there is a stateful firewall (e.g. security groups or vpc) in the way, the network could absolutely blackhole packets on a connection that's been reset. My plan for this week is I'm going to try to catch this happening and run the following analysis: # Get a netstat view of established connections and their send buffers on both sides # Slap a tcpdump on both sides to see if {{RESETS}} from the restarted node are even getting to the old node (VPC might be swallowing them) # If #1 or #2 confirm the theory that we're just in a stuck tcp connection, I will try tuning {{tcp_retries2}} down to ~5, setting {{SO_SNDBUF}} or {{wmem_default}} down to a more reasonable number (I think right now it's like 16 megs) to see if that fixes the issue For what it's worth, Cassandra is _really_ intolerant of these kinds of network partitions. For example if I tell a stateful firewall to just drop all packets on one of Cassandra's internode connections with {{iptables -A OUPTUT -p tcp -d --sport --dport -j DROP}}, Cassandra continues sending packets out on that {{OutboundTcpConnection}} until the operating system says "hey that's probably not going to work". Unfortunately we have two TCP connections so I can't just do a simple heartbeat mechanism. > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422783#comment-16422783 ] Joseph Lynch edited comment on CASSANDRA-14358 at 4/2/18 5:07 PM: -- [~djoshi3] yea I think that you're probably right, but I think that this might not be a slow consumer so much as a never consumer. Specifically if there is a stateful firewall (e.g. security groups or vpc) in the way, the network could absolutely blackhole packets on a connection that's been reset. My plan for this week is I'm going to try to catch this happening and run the following analysis: # Get a netstat view of established connections and their send buffers on both sides # Slap a tcpdump on both sides to see if {{RESETS}} from the restarted node are even getting to the old node (VPC might be swallowing them) # If #1 or #2 confirm the theory that we're just in a stuck tcp connection, I will try tuning {{tcp_retries2}} down to ~5, setting {{SO_SNDBUF}} or {{wmem_default}} down to a more reasonable number (I think right now it's like 16 megs) to see if that fixes the issue For what it's worth, Cassandra is _really_ intolerant of these kinds of network partitions. For example if I tell a stateful firewall to just drop all packets on one of Cassandra's internode connections with {{iptables -A OUPTUT -p tcp -d --sport --dport -j DROP}}, Cassandra continues sending packets out on that {{OutboundTcpConnection}} until the operating system says "hey that's probably not going to work". Unfortunately we have two TCP connections so I can't just do a simple heartbeat mechanism. was (Author: jolynch): [~djoshi3] yea I think that you're probably right, but I think that this might not be a slow consumer so much as a never consumer. Specifically if there is a stateful firewall (e.g. security groups or vpc) in the way, the network could absolutely blackhole packets on a connection that's been reset. My plan for this week is I'm going to try to catch this happening and run the following analysis: # Get a netstat view of established connections and their send buffers on both sides # Slap a tcpdump on both sides to see if {{RESETS}} from the restarted node are even getting to the old node (VPC might be swallowing them) # If #1 or #2 confirm the theory that we're just in a stuck tcp connection, I will try tuning {{tcp_retries2}} down to ~5, setting {{SO_SNDBUF}} or {{wmem_default}} down to a more reasonable number (I think right now it's like 16 megs) to see if that fixes the issue For what it's worth, Cassandra is _really_ intolerant of these kinds of network partitions. For example if I tell a stateful firewall to just drop all packets on on of Cassandra's internode connections with {{iptables -A OUPTUT -p tcp -d --sport --dport -j DROP}}, Cassandra continues sending packets out on that {{OutboundTcpConnection}} until the operating system says "hey that's probably not going to work". Unfortunately we have two TCP connections so I can't just do a simple heartbeat mechanism. > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that