I didn’t try adding the extra delete to see if it had an impact, but I created some docker containers that can be used to replicate the issue if that helps. Sometimes you have to run a fair number of restarts on the zookeeper containers to trigger the issue, but I’ve always been able to replicate it if I restart the zookeeper instances enough. You can find the source for building/running the docker containers here: https://github.com/hsbugreports/zookeeper_test
Thanks. -hs > On Jul 30, 2021, at 2:32 PM, Jordan Zimmerman <jor...@jordanzimmerman.com> > wrote: > > A quick thought... I wonder if we need to add a guaranteed delete right here > for the node just in case: > https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L625 > > <https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L625> > >> On Jul 30, 2021, at 2:29 PM, H S <hsbugrepo...@icloud.com >> <mailto:hsbugrepo...@icloud.com>> wrote: >> >> Hi, >> >> Thanks for the quick response. On the call to >> CuratorFramworkFactory.newClient I am specifying RetryForever as the retry >> policy if that is the retry policy you are referring to. I did notice that I >> was also occasionally seeing that get interrupted (example stack trace >> below) but that seemed like it was happening in a create call, not a delete, >> but perhaps something similar is happening in the delete scenario but never >> getting logged. I have also noticed that sometimes when it’s happening I am >> not seeing that original debug stack trace, so I think there might be a >> couple of possible races that are causing it to happen. Maybe I’ll try to >> put together a docker compose image or something if I can that simulates >> what I was doing to create the issue so that it can be independently >> verified. >> >> Thanks. >> -hs >> >> java.lang.InterruptedException: sleep interrupted >> at java.lang.Thread.sleep(Native Method) ~[?:?] >> at java.lang.Thread.sleep(Thread.java:340) ~[?:?] >> at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386) ~[?:?] >> at >> org.apache.curator.RetryLoopImpl.lambda$static$0(RetryLoopImpl.java:39) >> ~[109:curator-client:5.2.0] >> at >> org.apache.curator.retry.RetryForever.allowRetry(RetryForever.java:50) >> [109:curator-client:5.2.0] >> at >> org.apache.curator.RetryLoopImpl.takeException(RetryLoopImpl.java:76) >> [109:curator-client:5.2.0] >> at >> org.apache.curator.connection.ThreadLocalRetryLoop$WrappedRetryLoop.takeException(ThreadLocalRetryLoop.java:115) >> [109:curator-client:5.2.0] >> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:99) >> [109:curator-client:5.2.0] >> at >> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1190) >> [110:curator-framework:5.2.0] >> at >> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:605) >> [110:curator-framework:5.2.0] >> at >> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595) >> [110:curator-framework:5.2.0] >> at >> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:573) >> [110:curator-framework:5.2.0] >> at >> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48) >> [110:curator-framework:5.2.0] >> at >> org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54) >> [111:curator-recipes:5.2.0] >> at >> org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:231) >> [111:curator-recipes:5.2.0] >> at >> org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:242) >> [111:curator-recipes:5.2.0] >> at >> org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:93) >> [111:curator-recipes:5.2.0] >> at >> org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:412) >> [111:curator-recipes:5.2.0] >> at >> org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:483) >> [111:curator-recipes:5.2.0] >> at >> org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:66) >> [111:curator-recipes:5.2.0] >> at >> org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:247) >> [111:curator-recipes:5.2.0] >> at >> org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:241) >> [111:curator-recipes:5.2.0] >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?] >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:?] >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?] >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> [?:?] >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> [?:?] >> at java.lang.Thread.run(Thread.java:748) [?:?] >> >>> On Jul 30, 2021, at 8:57 AM, Jordan Zimmerman <jor...@jordanzimmerman.com >>> <mailto:jor...@jordanzimmerman.com>> wrote: >>> >>> Curator has a feature called "protected mode". It adds a UUID to the node >>> name and when there is connection issue or other connection it tries to >>> find the node it created (see here: >>> https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L601 >>> >>> <https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L601>). >>> I wonder why this mechanism is getting defeated. It would be nice to get a >>> test simulation that reproduces this. It's possible that the retry policy >>> is expiring and FindAndDeleteProtectedNodeInBackground is giving up and >>> rethrowing the exception. What is your retry policy? >>> >>> -Jordan >>> >>>> On Jul 30, 2021, at 12:29 AM, H S <hsbugrepo...@icloud.com >>>> <mailto:hsbugrepo...@icloud.com>> wrote: >>>> >>>> Hi, >>>> >>>> While using the LeaderSelector recipe I noticed what appears to be an >>>> issue where under some circumstances during zookeeper failover or network >>>> issues, orphaned ephemeral nodes are created resulting in no leader >>>> election for a cluster. I have reproduced this issue in versions 5.1.0 and >>>> 5.2.0. >>>> >>>> What appears to happening is the following: >>>> 1) Node A is attempting to acquire the interprocess lock. >>>> 2) It attempts to create its ephemeral node by calling >>>> StandardLockInternalsDriver.createsTheLock >>>> 3) The zookeeper client issues the request to the zookeeper server >>>> 4) The zookeeper server creates the ephemeral node >>>> 5) While the response is being returned from the server to the zookeeper >>>> client, the channel is broken, resulting in an EndOfStreamException. >>>> 6) This results in an unhandled ConnectionLossException propagating all >>>> the way up the LeaderSelector.internalRequeue call stack, killing the >>>> submitted task without deleting the created ephemeral node >>>> 7) The zookeeper session for the client is still valid, resulting in the >>>> ephemeral node remaining orphaned indefinitely. >>>> 8) During all subsequent requeue attempts the orphaned node is a >>>> predecessor of all nodes and treated as if it is the leader, however, it's >>>> not running because it errored out before calling the selector listener. >>>> 9) Currently the only way to resolve the issue appears to be to check >>>> the number of participants around any failover occurrences and if more >>>> participants are listed than nodes, the framework session associated with >>>> the extra participants must be closed to invalidate its session and delete >>>> the orphaned node. >>>> >>>> I have recreated the issue by repeatedly restarting the nodes in a >>>> zookeeper cluster to simulate failover until the orphaned nodes can be >>>> seen using 'echo dump | nc zookeeperHost zookeeperPort' >>>> >>>> I turned on debugging when reproducing the issue and below is the sample >>>> log from the IO error and the associated uncaught thread exception. >>>> >>>> 2021-07-29 22:23:50.367-0500 WARN APP= COMP= (localhost:2182) ClientCnxn >>>> Session 0x200086ade230000 for sever localhost/0:0:0:0:0:0:0:1:2183, >>>> Closing socket connection. Attempting reconnect except it is a >>>> SessionExpiredException. >>>> org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read >>>> additional data from server sessionid 0x200086ade230000, likely server has >>>> closed socket >>>> at >>>> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) >>>> at >>>> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) >>>> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) >>>> 2021-07-29 22:23:50.475-0500 DEBUG APP= COMP= LeaderSelector-1 >>>> RetryLoopImpl Retry-able exception received >>>> org.apache.zookeeper.KeeperException$ConnectionLossException: >>>> KeeperErrorCode = ConnectionLoss for >>>> /MyApp/MyLeaderKey/_c_15e65f7d-4f53-4213-a915-16d3aa318c90-lock- >>>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) >>>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) >>>> at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1837) >>>> at >>>> org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1216) >>>> [110:curator-framework:5.2.0] >>>> at >>>> org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1193) >>>> [110:curator-framework:5.2.0] >>>> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93) >>>> [109:curator-client:5.2.0] >>>> at >>>> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1190) >>>> [110:curator-framework:5.2.0] >>>> at >>>> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:605) >>>> [110:curator-framework:5.2.0] >>>> at >>>> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595) >>>> [110:curator-framework:5.2.0] >>>> at >>>> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:573) >>>> [110:curator-framework:5.2.0] >>>> at >>>> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48) >>>> [110:curator-framework:5.2.0] >>>> at >>>> org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54) >>>> [111:curator-recipes:5.2.0] >>>> at >>>> org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:231) >>>> [111:curator-recipes:5.2.0] >>>> at >>>> org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:242) >>>> [111:curator-recipes:5.2.0] >>>> at >>>> org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:93) >>>> [111:curator-recipes:5.2.0] >>>> at >>>> org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:412) >>>> [111:curator-recipes:5.2.0] >>>> at >>>> org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:483) >>>> [111:curator-recipes:5.2.0] >>>> at >>>> org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:66) >>>> [111:curator-recipes:5.2.0] >>>> at >>>> org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:247) >>>> [111:curator-recipes:5.2.0] >>>> at >>>> org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:241) >>>> [111:curator-recipes:5.2.0] >>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?] >>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>> [?:?] >>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?] >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>> [?:?] >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>> [?:?] >>>> at java.lang.Thread.run(Thread.java:748) [?:?] >>>> >>>> Thanks. >>>> >>>> -hs >>> >> >