Re: LeaderSelector Recipe Can Result In Orphaned Ephemeral Nodes And No Leader Election

H S Sun, 01 Aug 2021 23:59:42 -0700

I didn’t try adding the extra delete to see if it had an impact, but I created 
some docker containers that can be used to replicate the issue if that helps. 
Sometimes you have to run a fair number of restarts on the zookeeper containers 
to trigger the issue, but I’ve always been able to replicate it if I restart 
the zookeeper instances enough. You can find the source for building/running 
the docker containers here: https://github.com/hsbugreports/zookeeper_test


Thanks.
-hs

> On Jul 30, 2021, at 2:32 PM, Jordan Zimmerman <jor...@jordanzimmerman.com> 
> wrote:
> 
> A quick thought... I wonder if we need to add a guaranteed delete right here 
> for the node just in case: 
> https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L625
>  
> <https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L625>
> 
>> On Jul 30, 2021, at 2:29 PM, H S <hsbugrepo...@icloud.com 
>> <mailto:hsbugrepo...@icloud.com>> wrote:
>> 
>> Hi,
>> 
>> Thanks for the quick response. On the call to 
>> CuratorFramworkFactory.newClient I am specifying RetryForever as the retry 
>> policy if that is the retry policy you are referring to. I did notice that I 
>> was also occasionally seeing that get interrupted (example stack trace 
>> below) but that seemed like it was happening in a create call, not a delete, 
>> but perhaps something similar is happening in the delete scenario but never 
>> getting logged. I have also noticed that sometimes when it’s happening I am 
>> not seeing that original debug stack trace, so I think there might be a 
>> couple of possible races that are causing it to happen. Maybe I’ll try to 
>> put together a docker compose image or something if I can that simulates 
>> what I was doing to create the issue so that it can be independently 
>> verified. 
>> 
>> Thanks.
>> -hs
>> 
>> java.lang.InterruptedException: sleep interrupted
>>         at java.lang.Thread.sleep(Native Method) ~[?:?]
>>         at java.lang.Thread.sleep(Thread.java:340) ~[?:?]
>>         at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386) ~[?:?]
>>         at 
>> org.apache.curator.RetryLoopImpl.lambda$static$0(RetryLoopImpl.java:39) 
>> ~[109:curator-client:5.2.0]
>>         at 
>> org.apache.curator.retry.RetryForever.allowRetry(RetryForever.java:50) 
>> [109:curator-client:5.2.0]
>>         at 
>> org.apache.curator.RetryLoopImpl.takeException(RetryLoopImpl.java:76) 
>> [109:curator-client:5.2.0]
>>         at 
>> org.apache.curator.connection.ThreadLocalRetryLoop$WrappedRetryLoop.takeException(ThreadLocalRetryLoop.java:115)
>>  [109:curator-client:5.2.0]
>>         at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:99) 
>> [109:curator-client:5.2.0]
>>         at 
>> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1190)
>>  [110:curator-framework:5.2.0]
>>         at 
>> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:605)
>>  [110:curator-framework:5.2.0]
>>         at 
>> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595)
>>  [110:curator-framework:5.2.0]
>>         at 
>> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:573)
>>  [110:curator-framework:5.2.0]
>>         at 
>> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48)
>>  [110:curator-framework:5.2.0]
>>         at 
>> org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)
>>  [111:curator-recipes:5.2.0]
>>         at 
>> org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:231)
>>  [111:curator-recipes:5.2.0]
>>         at 
>> org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:242)
>>  [111:curator-recipes:5.2.0]
>>         at 
>> org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:93)
>>  [111:curator-recipes:5.2.0]
>>         at 
>> org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:412)
>>  [111:curator-recipes:5.2.0]
>>         at 
>> org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:483)
>>  [111:curator-recipes:5.2.0]
>>         at 
>> org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:66)
>>  [111:curator-recipes:5.2.0]
>>         at 
>> org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:247)
>>  [111:curator-recipes:5.2.0]
>>         at 
>> org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:241)
>>  [111:curator-recipes:5.2.0]
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?]
>>         at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:?]
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?]
>>         at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>  [?:?]
>>         at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>  [?:?]
>>         at java.lang.Thread.run(Thread.java:748) [?:?]
>> 
>>> On Jul 30, 2021, at 8:57 AM, Jordan Zimmerman <jor...@jordanzimmerman.com 
>>> <mailto:jor...@jordanzimmerman.com>> wrote:
>>> 
>>> Curator has a feature called "protected mode". It adds a UUID to the node 
>>> name and when there is connection issue or other connection it tries to 
>>> find the node it created (see here: 
>>> https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L601
>>>  
>>> <https://github.com/apache/curator/blob/master/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L601>).
>>>  I wonder why this mechanism is getting defeated. It would be nice to get a 
>>> test simulation that reproduces this. It's possible that the retry policy 
>>> is expiring and FindAndDeleteProtectedNodeInBackground is giving up and 
>>> rethrowing the exception. What is your retry policy?
>>> 
>>> -Jordan
>>> 
>>>> On Jul 30, 2021, at 12:29 AM, H S <hsbugrepo...@icloud.com 
>>>> <mailto:hsbugrepo...@icloud.com>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> While using the LeaderSelector recipe I noticed what appears to be an 
>>>> issue where under some circumstances during zookeeper failover or network 
>>>> issues, orphaned ephemeral nodes are created resulting in no leader 
>>>> election for a cluster. I have reproduced this issue in versions 5.1.0 and 
>>>> 5.2.0.
>>>> 
>>>> What appears to happening is the following:
>>>>   1) Node A is attempting to acquire the interprocess lock.
>>>>   2) It attempts to create its ephemeral node by calling 
>>>> StandardLockInternalsDriver.createsTheLock
>>>>   3) The zookeeper client issues the request to the zookeeper server
>>>>   4) The zookeeper server creates the ephemeral node
>>>>   5) While the response is being returned from the server to the zookeeper 
>>>> client, the channel is broken, resulting in an EndOfStreamException.
>>>>   6) This results in an unhandled ConnectionLossException propagating all 
>>>> the way up the LeaderSelector.internalRequeue call stack, killing the 
>>>> submitted task without deleting the created ephemeral node
>>>>   7) The zookeeper session for the client is still valid, resulting in the 
>>>> ephemeral node remaining orphaned indefinitely.
>>>>   8) During all subsequent requeue attempts the orphaned node is a 
>>>> predecessor of all nodes and treated as if it is the leader, however, it's 
>>>> not running because it errored out before calling the selector listener.
>>>>   9) Currently the only way to resolve the issue appears to be to check 
>>>> the number of participants around any failover occurrences and if more 
>>>> participants are listed than nodes, the framework session associated with 
>>>> the extra participants must be closed to invalidate its session and delete 
>>>> the orphaned node.
>>>> 
>>>> I have recreated the issue by repeatedly restarting the nodes in a 
>>>> zookeeper cluster to simulate failover until the orphaned nodes can be 
>>>> seen using 'echo dump | nc zookeeperHost zookeeperPort'
>>>> 
>>>> I turned on debugging when reproducing the issue and below is the sample 
>>>> log from the IO error and the associated uncaught thread exception.
>>>> 
>>>> 2021-07-29 22:23:50.367-0500 WARN APP= COMP= (localhost:2182) ClientCnxn 
>>>> Session 0x200086ade230000 for sever localhost/0:0:0:0:0:0:0:1:2183, 
>>>> Closing socket connection. Attempting reconnect except it is a 
>>>> SessionExpiredException.
>>>> org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read 
>>>> additional data from server sessionid 0x200086ade230000, likely server has 
>>>> closed socket
>>>> at 
>>>> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77)
>>>> at 
>>>> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
>>>> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290)
>>>> 2021-07-29 22:23:50.475-0500 DEBUG APP= COMP= LeaderSelector-1 
>>>> RetryLoopImpl Retry-able exception received
>>>> org.apache.zookeeper.KeeperException$ConnectionLossException: 
>>>> KeeperErrorCode = ConnectionLoss for 
>>>> /MyApp/MyLeaderKey/_c_15e65f7d-4f53-4213-a915-16d3aa318c90-lock-
>>>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>>>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 
>>>> at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1837)
>>>> at 
>>>> org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1216)
>>>>  [110:curator-framework:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1193)
>>>>  [110:curator-framework:5.2.0]
>>>> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93) 
>>>> [109:curator-client:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1190)
>>>>  [110:curator-framework:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:605)
>>>>  [110:curator-framework:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595)
>>>>  [110:curator-framework:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:573)
>>>>  [110:curator-framework:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48)
>>>>  [110:curator-framework:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)
>>>>  [111:curator-recipes:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:231)
>>>>  [111:curator-recipes:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:242)
>>>>  [111:curator-recipes:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:93)
>>>>  [111:curator-recipes:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:412)
>>>>  [111:curator-recipes:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:483)
>>>>  [111:curator-recipes:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:66)
>>>>  [111:curator-recipes:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:247)
>>>>  [111:curator-recipes:5.2.0]
>>>> at 
>>>> org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:241)
>>>>  [111:curator-recipes:5.2.0]
>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?]
>>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
>>>> [?:?]
>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?]
>>>> at 
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>  [?:?]
>>>> at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>  [?:?]
>>>> at java.lang.Thread.run(Thread.java:748) [?:?]
>>>> 
>>>> Thanks.
>>>> 
>>>> -hs
>>> 
>> 
>

Re: LeaderSelector Recipe Can Result In Orphaned Ephemeral Nodes And No Leader Election

Reply via email to