[jira] [Comment Edited] (CASSANDRA-14362) Node unable to join cluster in trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-14362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429722#comment-16429722 ] Dinesh Joshi edited comment on CASSANDRA-14362 at 4/9/18 4:58 PM: -- [~jasobrown] / [~aweisberg] I think I know what the issue is. [~jasobrown] not only changed the {{OutboundConnectionIdentifier}} to prefer the local address of the node but also removed the binding to a specific local IP address during {{NettyFactory#createOutboundBootstrap}} setup. This had an unintended consequence. When you leave the {{Socket}}'s source IP (local address) unset, the kernel will choose one for you. This normally is not an issue. However, during the dtest run all nodes are local and rely on 127.0.0.x IPs. When the C* nodes try to setup the streaming connections, they get tripped up due to this change. Say node1 (127.0.0.1) & node2 (127.0.0.2) are started up by the dtest (I know in reality its 3 nodes but for the sake of simplicity I'm limiting it to two nodes). node1 attempts to create an outbound connection for streaming with node2, the outbound socket should actually be {{127.0.0.1:ANY_PORT -> 127.0.0.2:INTERNODE_PORT}}. However, per this patch, we leave the source IP unbound. When the connection is established and node1 sends the {{FirstHandshakeMessage}}, it has a source IP set to set to an arbitrarily chose IP by the kernel - in this case it was {{127.0.0.2}} (when it should actually be {{127.0.0.1}}). The receiving node (node2) gets this message {{InboundHandshakeHandler#handleStart}}, decodes it fine but when it goes to {{InboundHandshakeHandler#setupStreamingPipeline}}, it creates a {{StreamingInboundHandler}} with remote IP set to {{127.0.0.2}}. This messes up the streaming pipeline because now node2 is waiting on {{127.0.0.2}} (itself) for data. Hence we find it stuck in the bootstrap phase - {noformat} "Stream-Deserializer-127.0.0.2:7000-4004ef3b" #133 daemon prio=5 os_prio=35 tid=0x7ff58c601ca0 nid=0xdc03 waiting on condition [0x77613000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at java.lang.Thread.sleep(Thread.java:340) at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386) at com.google.common.util.concurrent.Uninterruptibles.sleepUninterruptibly(Uninterruptibles.java:285) at org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:174) at java.lang.Thread.run(Thread.java:748) "main" #1 prio=5 os_prio=35 tid=0x7ff58c706000 nid=0x2803 waiting on condition [0x74e5e000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0007a11a3f98> (a org.apache.cassandra.streaming.StreamResultFuture) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:497) at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1519) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:949) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:651) - locked <0x0007a40f80b8> (a org.apache.cassandra.service.StorageService) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:586) - locked <0x0007a40f80b8> (a org.apache.cassandra.service.StorageService) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:367) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:590) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:679) {noformat} I also noted this error in the logs which further points to the wrong IP for the peer being used - {noformat} ERROR [Stream-Deserializer-127.0.0.2:53830-7b781083] 2018-04-06 13:19:25,863 StreamingInboundHandler.java:210 - [Stream channel: 7b781083] stream operation from 127.0.0.2:53830 failed java.lang.IllegalArgumentException: Unknown peer requested: 127.0.0.2:7000 at org.apache.cassandra.streaming.StreamCoordinator.getHostData(StreamCoordinator.java:242) at org.apache.cassandra.streaming.StreamCoordinator.getSessionById(StreamCoordinator.java:170) at org.apache.cassandra.streaming.StreamResultFuture.getSession(StreamResultFuture.java:237) at org.apache.cassandra.streaming.StreamManager.findSession(StreamManager.java:194) at org.apache.cassandra.streaming.StreamManager.findSession(StreamManager.java:185) at org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:41) at
[jira] [Comment Edited] (CASSANDRA-14362) Node unable to join cluster in trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-14362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426241#comment-16426241 ] Jason Brown edited comment on CASSANDRA-14362 at 4/5/18 1:58 AM: - So it looks like when working on CASSANDRA-8457/CASSANDRA-12229, I introduced a regression of CASSANDRA-12673. The short of it is we no longer bind the local side of a outbound connection as of CASSANDRA-12673, yet I mistakenly brought that back (see [{{NettyFactory#createOutboundBootstrap()}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/async/NettyFactory.java#L337]). I believe the most correct fix is just not setting the local address to bind to for outbound connections, as that's bascially what CASSANDRA-12673 does. Removing the local bind should resolve [~Lerh Low]'s error as well as eliminate the regression against CASSANDRA-12673. However, we should also incorporate [~Lerh Low]'s fix as well, to clarify the intent of which address we want to 'reference' (but not actually use) on the local side. ||14362|| |[branch|https://github.com/jasobrown/cassandra/tree/14362]| |[utests & dtests|https://circleci.com/gh/jasobrown/workflows/cassandra/tree/14362]| tests are running now. [~aweisberg] do you mind reviewing? UPDATE: dtests seem to be pretty unhappy. I'll investigate what's wrong and ping this ticket with updates. was (Author: jasobrown): So it looks like when working on CASSANDRA-8457/CASSANDRA-12229, I introduced a regression of CASSANDRA-12673. The short of it is we no longer bind the local side of a outbound connection as of CASSANDRA-12673, yet I mistakenly brought that back (see [{{NettyFactory#createOutboundBootstrap()}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/async/NettyFactory.java#L337]). I believe the most correct fix is just not setting the local address to bind to for outbound connections, as that's bascially what CASSANDRA-12673 does. Removing the local bind should resolve [~Lerh Low]'s error as well as eliminate the regression against CASSANDRA-12673. However, we should also incorporate [~Lerh Low]'s fix as well, to clarify the intent of which address we want to 'reference' (but not actually use) on the local side. ||14362|| |[branch|https://github.com/jasobrown/cassandra/tree/14362]| |[utests & dtests|https://circleci.com/gh/jasobrown/workflows/cassandra/tree/14362]| tests are running now. [~aweisberg] do you mind reviewing? > Node unable to join cluster in trunk > > > Key: CASSANDRA-14362 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14362 > Project: Cassandra > Issue Type: Bug >Reporter: Lerh Chuan Low >Assignee: Jason Brown >Priority: Major > Fix For: 4.0 > > > I'm not sure if anybody has tried using {{trunk}} and starting an actual EC2 > cluster, but with a 3 node setup that works on 3.11, it fails to work on > {{trunk}}. I mistakenly stumbled into it while testing my own code changes. > My setup is as follows: > broadcast_rpc_address: publicIP > broadcast_address: publicIP > listen_address: omitted. Ends up as privateIP. > Works on 3.11 just fine. > On {{trunk}} though, it never works. My node is never able to join the > cluster: > {code} > Apr 03 05:57:06 ip-10-0-47-122 cassandra[13914]: INFO [main] 2018-04-03 > 05:57:05,895 RangeStreamer.java:195 - Bootstrap: range > (-128373781239966537,-122439194129870521] exists on 52.88.241.181:7000 for > keyspace system_traces > Apr 03 05:57:06 ip-10-0-47-122 cassandra[13914]: INFO [main] 2018-04-03 > 05:57:05,895 RangeStreamer.java:195 - Bootstrap: range > (6968670424536541270,6973888347502882935] exists on 52.88.241.181:7000 for > keyspace system_traces > Apr 03 05:57:42 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] > 2018-04-03 05:57:42,298 FailureDetector.java:324 - Not marking nodes down due > to local pause of 26215173446ns > 50ns > Apr 03 05:57:53 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] > 2018-04-03 05:57:53,035 FailureDetector.java:324 - Not marking nodes down due > to local pause of 10736485907ns > 50ns > Apr 03 05:58:30 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] > 2018-04-03 05:58:30,790 Gossiper.java:814 - Gossip stage has 28 pending > tasks; skipping status check (no nodes will be marked down) > Apr 03 05:58:33 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] > 2018-04-03 05:58:33,060 Gossiper.java:814 - Gossip stage has 20 pending > tasks; skipping status check (no nodes will be marked down) > Apr 03 06:04:33 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] > 2018-04-03 06:04:33,826 FailureDetector.java:324 - Not marking nodes down due > to local pause of 400790432954ns > 50ns > Apr 03 06:04:49
[jira] [Comment Edited] (CASSANDRA-14362) Node unable to join cluster in trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-14362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424869#comment-16424869 ] Lerh Chuan Low edited comment on CASSANDRA-14362 at 4/4/18 2:02 AM: I'm using EC2Snitch (mostly was just using bare bones to get a cluster working). If the commit I made seems like the appropriate fix to you (I may be missing something :/ ) then I'm happy to create a proper patch for it. (And feel free to LMK if you need any more details) was (Author: lerh low): I'm using EC2Snitch (mostly was just using bare bones to get a cluster working). If the commit I made seems like the appropriate fix to you (I may be missing something :/ ) then I'm happy to create a proper patch for it. > Node unable to join cluster in trunk > > > Key: CASSANDRA-14362 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14362 > Project: Cassandra > Issue Type: Bug >Reporter: Lerh Chuan Low >Priority: Major > Fix For: 4.0 > > > I'm not sure if anybody has tried using {{trunk}} and starting an actual EC2 > cluster, but with a 3 node setup that works on 3.11, it fails to work on > {{trunk}}. I mistakenly stumbled into it while testing my own code changes. > My setup is as follows: > broadcast_rpc_address: publicIP > broadcast_address: publicIP > listen_address: omitted. Ends up as privateIP. > Works on 3.11 just fine. > On {{trunk}} though, it never works. My node is never able to join the > cluster: > {code} > Apr 03 05:57:06 ip-10-0-47-122 cassandra[13914]: INFO [main] 2018-04-03 > 05:57:05,895 RangeStreamer.java:195 - Bootstrap: range > (-128373781239966537,-122439194129870521] exists on 52.88.241.181:7000 for > keyspace system_traces > Apr 03 05:57:06 ip-10-0-47-122 cassandra[13914]: INFO [main] 2018-04-03 > 05:57:05,895 RangeStreamer.java:195 - Bootstrap: range > (6968670424536541270,6973888347502882935] exists on 52.88.241.181:7000 for > keyspace system_traces > Apr 03 05:57:42 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] > 2018-04-03 05:57:42,298 FailureDetector.java:324 - Not marking nodes down due > to local pause of 26215173446ns > 50ns > Apr 03 05:57:53 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] > 2018-04-03 05:57:53,035 FailureDetector.java:324 - Not marking nodes down due > to local pause of 10736485907ns > 50ns > Apr 03 05:58:30 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] > 2018-04-03 05:58:30,790 Gossiper.java:814 - Gossip stage has 28 pending > tasks; skipping status check (no nodes will be marked down) > Apr 03 05:58:33 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] > 2018-04-03 05:58:33,060 Gossiper.java:814 - Gossip stage has 20 pending > tasks; skipping status check (no nodes will be marked down) > Apr 03 06:04:33 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] > 2018-04-03 06:04:33,826 FailureDetector.java:324 - Not marking nodes down due > to local pause of 400790432954ns > 50ns > Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: WARN [GossipTasks:1] > 2018-04-03 06:04:49,133 Gossiper.java:814 - Gossip stage has 2 pending tasks; > skipping status check (no nodes will be marked down) > Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: ERROR > [StreamConnectionEstablisher:1] 2018-04-03 06:04:49,138 > StreamSession.java:524 - [Stream #d4cd6420-3703-11e8-a6a5-e51ddc10cfe6] > Streaming error occurred on session with peer 52.88.241.181:7000 > Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: java.io.IOException: failed > to connect to 52.88.241.181:7000 (STREAM) for streaming data > Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at > org.apache.cassandra.streaming.DefaultConnectionFactory.createConnection(DefaultConnectionFactory.java:98) > Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at > org.apache.cassandra.streaming.DefaultConnectionFactory.createConnection(DefaultConnectionFactory.java:57) > Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at > org.apache.cassandra.streaming.async.NettyStreamingMessageSender.createChannel(NettyStreamingMessageSender.java:183) > Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at > org.apache.cassandra.streaming.async.NettyStreamingMessageSender.setupControlMessageChannel(NettyStreamingMessageSender.java:165) > Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at > org.apache.cassandra.streaming.async.NettyStreamingMessageSender.sendMessage(NettyStreamingMessageSender.java:222) > Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at > org.apache.cassandra.streaming.async.NettyStreamingMessageSender.initialize(NettyStreamingMessageSender.java:146) > Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: at >