Curator has a feature called "protected mode". It adds a UUID to the node name and when there is connection issue or other connection it tries to find the node it created (see here: https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L601 <https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L601>). I wonder why this mechanism is getting defeated. It would be nice to get a test simulation that reproduces this. It's possible that the retry policy is expiring and FindAndDeleteProtectedNodeInBackground is giving up and rethrowing the exception. What is your retry policy?
-Jordan > On Jul 30, 2021, at 12:29 AM, H S <hsbugrepo...@icloud.com> wrote: > > Hi, > > While using the LeaderSelector recipe I noticed what appears to be an issue > where under some circumstances during zookeeper failover or network issues, > orphaned ephemeral nodes are created resulting in no leader election for a > cluster. I have reproduced this issue in versions 5.1.0 and 5.2.0. > > What appears to happening is the following: > 1) Node A is attempting to acquire the interprocess lock. > 2) It attempts to create its ephemeral node by calling > StandardLockInternalsDriver.createsTheLock > 3) The zookeeper client issues the request to the zookeeper server > 4) The zookeeper server creates the ephemeral node > 5) While the response is being returned from the server to the zookeeper > client, the channel is broken, resulting in an EndOfStreamException. > 6) This results in an unhandled ConnectionLossException propagating all the > way up the LeaderSelector.internalRequeue call stack, killing the submitted > task without deleting the created ephemeral node > 7) The zookeeper session for the client is still valid, resulting in the > ephemeral node remaining orphaned indefinitely. > 8) During all subsequent requeue attempts the orphaned node is a > predecessor of all nodes and treated as if it is the leader, however, it's > not running because it errored out before calling the selector listener. > 9) Currently the only way to resolve the issue appears to be to check the > number of participants around any failover occurrences and if more > participants are listed than nodes, the framework session associated with the > extra participants must be closed to invalidate its session and delete the > orphaned node. > > I have recreated the issue by repeatedly restarting the nodes in a zookeeper > cluster to simulate failover until the orphaned nodes can be seen using 'echo > dump | nc zookeeperHost zookeeperPort' > > I turned on debugging when reproducing the issue and below is the sample log > from the IO error and the associated uncaught thread exception. > > 2021-07-29 22:23:50.367-0500 WARN APP= COMP= (localhost:2182) ClientCnxn > Session 0x200086ade230000 for sever localhost/0:0:0:0:0:0:0:1:2183, Closing > socket connection. Attempting reconnect except it is a > SessionExpiredException. > org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read > additional data from server sessionid 0x200086ade230000, likely server has > closed socket > at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) > 2021-07-29 22:23:50.475-0500 DEBUG APP= COMP= LeaderSelector-1 RetryLoopImpl > Retry-able exception received > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for > /MyApp/MyLeaderKey/_c_15e65f7d-4f53-4213-a915-16d3aa318c90-lock- > at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1837) > at > org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1216) > [110:curator-framework:5.2.0] > at > org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1193) > [110:curator-framework:5.2.0] > at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93) > [109:curator-client:5.2.0] > at > org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1190) > [110:curator-framework:5.2.0] > at > org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:605) > [110:curator-framework:5.2.0] > at > org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595) > [110:curator-framework:5.2.0] > at > org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:573) > [110:curator-framework:5.2.0] > at > org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48) > [110:curator-framework:5.2.0] > at > org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54) > [111:curator-recipes:5.2.0] > at > org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:231) > [111:curator-recipes:5.2.0] > at > org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:242) > [111:curator-recipes:5.2.0] > at > org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:93) > [111:curator-recipes:5.2.0] > at > org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:412) > [111:curator-recipes:5.2.0] > at > org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:483) > [111:curator-recipes:5.2.0] > at > org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:66) > [111:curator-recipes:5.2.0] > at > org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:247) > [111:curator-recipes:5.2.0] > at > org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:241) > [111:curator-recipes:5.2.0] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?] > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [?:?] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:?] > at java.lang.Thread.run(Thread.java:748) [?:?] > > Thanks. > > -hs