Curator retry policies

Szekeres, Zoltan Fri, 22 Jul 2016 02:57:07 -0700

Hi Curator team,



We've three retry related questions.


1.       We're trying to decide, which retry policy we should set. Our desired 
behavior is to retry until succeeded with an exponential back-off up to a max 
limit of wait. However current ExponentialBackoffRetry implementation doesn't 
allow having an unbounded number of retries. I've found the change[1] for 
adding maximum number of retries to ExponentialBackoffRetry, but it suggests 
that the reason was integer overflow. I'm happy to write my own policy, but do 
you know any reason not to allow unbounded number of retries?


2.       Another issue we've faced is that our users might not always set the 
ACL entries correctly on the nodes and because of this they receive NOAUTH 
errors. We're using PersistentEphemeralNode and PathChildrenCache recipes and 
the behavior we'd like is to retry (with an exponential back-off) until the 
ACLs are corrected. However none of the mentioned recipes retries on NO_AUTH 
error.



A possible solution would be to configure the CuratorFramework to retry on 
NOAUTH code, but the retriable result codes are hard coded in RetryLoop. As a 
feature request can the retriable result codes can be made configurable via the 
CuratorFramework.



The solution we've tried is to add a new field to CuratorFrameworkImpl, which 
is a Set of KeeperException.Code and initialize it through the builder. At 
CuratorFrameworkImpl#processBackgroundOperation in the condition for retrying 
we've also tested whether the result code is in the Set. This way we're able to 
retry with an exponential back-off for NOAUTH result codes.


3.       During my investigation with the retry policy it occurred to me that 
the SharedValue recipe reads the value of the node synchronously when a watch 
event is triggered. However it doesn't check the keeper state and it sends the 
request even, when the state is "Disconnected". This'll block the zookeeper 
event thread until the request's retries are exhausted, which could be quite 
long based on the retry policy in use and it delays the delivery of the 
disconnect event to other listeners. I think in this case the request might be 
not sent if disconnected and sent, when a reconnect even arrives or send the 
read asynchronously.



Any advice is appreciated.



Kind regards,

Zoltan Szekeres



[1] 
https://github.com/Netflix/curator/commit/3c1b1b4dbf256e318b803e7bbcc2a3dcd2b88619



________________________________

NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or 
views contained herein are not intended to be, and do not constitute, advice 
within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and 
Consumer Protection Act. If you have received this communication in error, 
please destroy all electronic and paper copies; do not disclose, use or act 
upon the information; and notify the sender immediately. Mistransmission is not 
intended to waive confidentiality or privilege. Morgan Stanley reserves the 
right, to the extent permitted under applicable law, to monitor electronic 
communications. This message is subject to terms available at the following 
link: http://www.morganstanley.com/disclaimers If you cannot access these 
links, please notify us by reply message and we will send the contents to you. 
By messaging with Morgan Stanley you consent to the foregoing.

Curator retry policies

Reply via email to