Curator has a feature called "protected mode". It adds a UUID to the node name 
and when there is connection issue or other connection it tries to find the 
node it created (see here: 
https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L601
 
<https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L601>).
 I wonder why this mechanism is getting defeated. It would be nice to get a 
test simulation that reproduces this. It's possible that the retry policy is 
expiring and FindAndDeleteProtectedNodeInBackground is giving up and rethrowing 
the exception. What is your retry policy?

-Jordan

> On Jul 30, 2021, at 12:29 AM, H S <hsbugrepo...@icloud.com> wrote:
> 
> Hi,
> 
> While using the LeaderSelector recipe I noticed what appears to be an issue 
> where under some circumstances during zookeeper failover or network issues, 
> orphaned ephemeral nodes are created resulting in no leader election for a 
> cluster. I have reproduced this issue in versions 5.1.0 and 5.2.0.
> 
> What appears to happening is the following:
>   1) Node A is attempting to acquire the interprocess lock.
>   2) It attempts to create its ephemeral node by calling 
> StandardLockInternalsDriver.createsTheLock
>   3) The zookeeper client issues the request to the zookeeper server
>   4) The zookeeper server creates the ephemeral node
>   5) While the response is being returned from the server to the zookeeper 
> client, the channel is broken, resulting in an EndOfStreamException.
>   6) This results in an unhandled ConnectionLossException propagating all the 
> way up the LeaderSelector.internalRequeue call stack, killing the submitted 
> task without deleting the created ephemeral node
>   7) The zookeeper session for the client is still valid, resulting in the 
> ephemeral node remaining orphaned indefinitely.
>   8) During all subsequent requeue attempts the orphaned node is a 
> predecessor of all nodes and treated as if it is the leader, however, it's 
> not running because it errored out before calling the selector listener.
>   9) Currently the only way to resolve the issue appears to be to check the 
> number of participants around any failover occurrences and if more 
> participants are listed than nodes, the framework session associated with the 
> extra participants must be closed to invalidate its session and delete the 
> orphaned node.
> 
> I have recreated the issue by repeatedly restarting the nodes in a zookeeper 
> cluster to simulate failover until the orphaned nodes can be seen using 'echo 
> dump | nc zookeeperHost zookeeperPort'
> 
> I turned on debugging when reproducing the issue and below is the sample log 
> from the IO error and the associated uncaught thread exception.
> 
> 2021-07-29 22:23:50.367-0500 WARN APP= COMP= (localhost:2182) ClientCnxn 
> Session 0x200086ade230000 for sever localhost/0:0:0:0:0:0:0:1:2183, Closing 
> socket connection. Attempting reconnect except it is a 
> SessionExpiredException.
> org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read 
> additional data from server sessionid 0x200086ade230000, likely server has 
> closed socket
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290)
> 2021-07-29 22:23:50.475-0500 DEBUG APP= COMP= LeaderSelector-1 RetryLoopImpl 
> Retry-able exception received
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /MyApp/MyLeaderKey/_c_15e65f7d-4f53-4213-a915-16d3aa318c90-lock-
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 
> at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1837)
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1216)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1193)
>  [110:curator-framework:5.2.0]
> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93) 
> [109:curator-client:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1190)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:605)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:573)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48)
>  [110:curator-framework:5.2.0]
> at 
> org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:231)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:242)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:93)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:412)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:483)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:66)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:247)
>  [111:curator-recipes:5.2.0]
> at 
> org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:241)
>  [111:curator-recipes:5.2.0]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [?:?]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:?]
> at java.lang.Thread.run(Thread.java:748) [?:?]
> 
> Thanks.
> 
> -hs

Reply via email to