Jiri Horky created CASSANDRA-7696: ------------------------------------- Summary: Fail to reconnect to other nodes after intermittent network failure (SSL + auth enabled) Key: CASSANDRA-7696 URL: https://issues.apache.org/jira/browse/CASSANDRA-7696 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 1.2.16, CentOS 6.5, Cluster spread to 3 datacenters with 1 node each. SSL enabled between Cassandra nodes and also to clients, authentization enabled. Reporter: Jiri Horky
We experienced a strange problem after intermittent network failure when the affected node did not reconnect to the rest of the cluster but did allow to autenticate users (which was not possible during the actual network outage, see below). The cluster consists of 1 node in each of 3 datacenters, it uses C* 1.2.16 with SSL enabled both to clients and between C* nodes. The authentication is enabled as well. The problem started around 2014-08-01 when Cassandra first noticed a network problem: INFO [GossipTasks:1] 2014-08-01 07:47:52,618 Gossiper.java (line 823) InetAddress /A.B.D.20 is now DOWN INFO [GossipTasks:1] 2014-08-01 07:47:55,619 Gossiper.java (line 823) InetAddress mia10/A.B.C.20 is now DOWN The network came up for a while: INFO [GossipStage:1] 2014-08-01 07:51:29,380 Gossiper.java (line 809) InetAddress /A.B.C.20 is now UP INFO [HintedHandoff:1] 2014-08-01 07:51:29,381 HintedHandOffManager.java (line 296) Started hinted handoff for host: 9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /A.B.C.20 INFO [GossipStage:1] 2014-08-01 07:51:29,381 Gossiper.java (line 809) InetAddress /A.B.D.20 is now UP INFO [HintedHandoff:2] 2014-08-01 07:51:29,385 HintedHandOffManager.java (line 296) Started hinted handoff for host: 97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /A.B.D.20 But it failed to send hints: INFO [HintedHandoff:1] 2014-08-01 07:51:39,389 HintedHandOffManager.java (line 427) Timed out replaying hints to /A.B.C.20; aborting (0 delivered) INFO [HintedHandoff:2] 2014-08-01 07:51:39,390 HintedHandOffManager.java (line 427) Timed out replaying hints to /A.B.D.20; aborting (0 delivered) Also, the log started to be "flooded" with failed autentication tries. My understanding is that authentication data are read with QUORUM which failed as the other two nodes were down: ERROR [Native-Transport-Requests:446116] 2014-08-01 07:51:39,985 QueryMessage.java (line 97) Unexpected error during query com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2258) at com.google.common.cache.LocalCache.get(LocalCache.java:3990) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3994) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4878) at org.apache.cassandra.service.ClientState.authorize(ClientState.java:292) at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:172) at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:165) at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:149) at org.apache.cassandra.cql3.statements.SelectStatement.checkAccess(SelectStatement.java:116) at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:102) at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:113) at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:87) at org.apache.cassandra.transport.Message$Dispatcher.messageReceived(Message.java:287) at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) at org.jboss.netty.handler.execution.ChannelUpstreamEventRunnable.doRun(ChannelUpstreamEventRunnable.java:43) at org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:67) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.auth.Auth.selectUser(Auth.java:256) at org.apache.cassandra.auth.Auth.isSuperuser(Auth.java:84) at org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:50) at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:68) at org.apache.cassandra.service.ClientState$1.load(ClientState.java:278) at org.apache.cassandra.service.ClientState$1.load(ClientState.java:275) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2374) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252) ... 20 more Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:105) at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:930) at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:815) at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:140) at org.apache.cassandra.auth.Auth.selectUser(Auth.java:245) ... 29 more These exception stopped to appear at 2014-08-01 08:59:17,48 (and did not reapper after that), and cassandra seemed to be happy as it marked the other two nodes as up for a while. But after a few tens of seconds, it marked them as DOWN again and kept it in this state until restart three days later, even the network connectivity was stable: INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809) InetAddress /A.B.C.20 is now UP INFO [HintedHandoff:2] 2014-08-01 09:02:37,305 HintedHandOffManager.java (line 296) Started hinted handoff for host: 9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /A.B.C.20 INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809) InetAddress /A.B.D.20 is now UP INFO [HintedHandoff:1] 2014-08-01 09:02:37,308 HintedHandOffManager.java (line 296) Started hinted handoff for host: 97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /A.B.D.20 INFO [BatchlogTasks:1] 2014-08-01 09:02:45,724 ColumnFamilyStore.java (line 633) Enqueuing flush of Memtable-batchlog@1311733948(239547/247968 serialized/live bytes, 32 ops) INFO [FlushWriter:1221] 2014-08-01 09:02:45,725 Memtable.java (line 398) Writing Memtable-batchlog@1311733948(239547/247968 serialized/live bytes, 32 ops) INFO [FlushWriter:1221] 2014-08-01 09:02:45,738 Memtable.java (line 443) Completed flushing; nothing needed to be retained. Commitlog position was ReplayPosition(segmentId=1403712545417, position=20482758) INFO [HintedHandoff:1] 2014-08-01 09:02:47,312 HintedHandOffManager.java (line 427) Timed out replaying hints to /A.B.D.20; aborting (0 delivered) INFO [HintedHandoff:2] 2014-08-01 09:02:47,312 HintedHandOffManager.java (line 427) Timed out replaying hints to /A.B.C.20; aborting (0 delivered) INFO [GossipTasks:1] 2014-08-01 09:02:59,690 Gossiper.java (line 823) InetAddress /A.B.D.20 is now DOWN INFO [GossipTasks:1] 2014-08-01 09:03:08,693 Gossiper.java (line 823) InetAddress mia10/A.B.C.20 is now DOWN After one day, the node started to discards hints it saved for the other nodes. WARN [OptionalTasks:1] 2014-08-02 09:09:10,277 HintedHandoffMetrics.java (line 78) /A.B.D.20 has 9039 dropped hints, because node is down past configured hint window. WARN [OptionalTasks:1] 2014-08-02 09:09:10,290 HintedHandoffMetrics.java (line 78) /A.B.C.20 has 8816 dropped hints, because node is down past configured hint window. The "nodetool status" marked the other two nodes as down. After restart, it reconnected to the cluster without any problem. What puzzles me is the fact that the authentization apparently started to work after the network recovered but the exchange of data did not. -- This message was sent by Atlassian JIRA (v6.2#6252)