Hi, With your suggestion, the following scenario seems possible: master A is about to write value X to an external system so it logs it to ZK, then freezes for some time, ZK suspects it as failed, another master B is elected, writes X (completing what A wanted to do) then starts doing something else and writes Y. Then leader A "wakes up" and re-executes the old operation writing X which is now stale.
perhaps if your external system supports conditional updates this can be avoided - a write of X only works if the current state is as expected. Alex On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <[email protected]> wrote: > Thanks for the replies everyone, most of it was very useful. > > @Alexander: The section of chubby paper you pointed me to tries to > address exactly what I was looking for. That clearly is one good way > of doing it. Im also thinking of an alternative way and can use a > review or some feedback on that. > > @Powel: About x509 auth on intra-cluster communication, I don't have a > blocking need for it, as it can be achieved by setting up firewall > rules to accept only from desired hosts. It may be a good-to-have > thing though, because in cloud-based scenarios where IP addresses are > re-used, a recycled IP can still talk to a secure zk-cluster unless > config is changed to remove the old peer IP and replace it with the > new one. Its clearly a corner-case though. > > Here is the approach Im thinking of: > - Implement all operations(atleast master-triggered operations) on > operand machines idempotently > - Have master journal these operations to ZK before issuing RPC > - In case master fails with some of these operations in flight, the > newly elected master will need to read all issued (but not retired > yet) operations and issue them again. > - Existing master(before failure or after failure) can retry and > retire operations according to whatever the retry policy and > success-criterion is. > > Why am I thinking of this as opposed to going with chubby sequencer > passing: > - I need to implement idempotency regardless, because recovery-path > involving master-death after successful execution of operation but > before writing ack to coordination service requires it. So idempotent > implementation complexity is here to stay. > - I need to increase surface-area of the architecture which is exposed > to coordination-service for sequencer validation. Which may bring > verification RPC in data-plane in some cases. > - The sequencer may expire after verification but before ack, in which > case new master may not recognize the operation as consistent with its > decisions (or previous decision path). > > Thoughts? Suggestions? > > > > On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <[email protected]> > wrote: > > regarding atomic multi-znode updates -- check out "multi" updates > > < > http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html > > > > . > > > > On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <[email protected]> > wrote: > > > >> for 1, see the chubby paper > >> < > http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf > >, > >> section 2.4. > >> for 2, I'm not sure I fully understand the question, but essentially, ZK > >> guarantees that even during failures > >> consistency of updates is preserved. The user doesn't need to do > anything > >> in particular to guarantee this, even > >> during leader failures. In such case, some suffix of operations executed > >> by the leader may be lost if they weren't > >> previously acked by a majority.However, none of these operations could > >> have been visible > >> to reads. > >> > >> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti < > >> [email protected]> wrote: > >> > >>> Hi Janmejay, > >>> Regarding question 1, if a node takes a lock and the lock has timed-out > >>> from system perspective then it can mean that other nodes are free to > take > >>> the lock and work on the resource. Hence the history could be well > into the > >>> future when the previous node discovers the time-out. The question of > >>> rollback in the specific context depends on the implementation > details, is > >>> the lock holder updating some common area?, then there could be > corruption > >>> since other nodes are free to write in parallel to the first one?. In > the > >>> usual sense a time-out of lock held means the node which held the lock > is > >>> dead. It is upto the implementation to ensure this case and, using this > >>> primitive, if there is a timeout which means other nodes are sure that > no > >>> one else is working on the resource and hence can move forward. > >>> Question 2 seems to imply the assumption that leader has significant > work > >>> todo and leader change is quite common, which seems contrary to common > >>> implementation pattern. If the work can be broken down into smaller > chunks > >>> which need serialization separately then each chunk/work type can have > a > >>> different leader. > >>> For question 3, ZK does support auth and encryption for client > >>> connections but not for inter ZK node channels. Do you have > requirement to > >>> secure inter ZK nodes, can you let us know what your requirements are > so we > >>> can implement a solution to fit all needs?. > >>> For question 4 the official implementation is C, people tend to wrap > that > >>> with C++ and there should projects that use ZK doing that you can look > them > >>> up and see if you can separate it out and use them. > >>> Hope this helps.Powell. > >>> > >>> > >>> > >>> On Thursday, December 31, 2015 8:07 AM, Edward Capriolo < > >>> [email protected]> wrote: > >>> > >>> > >>> Q:What is the best way of handling distributed-lock expiry? The owner > >>> of the lock managed to acquire it and may be in middle of some > >>> computation when the session expires or lock expire > >>> > >>> If you are using Java a way I can see doing this is by using the > >>> ExecutorCompletionService > >>> > >>> > https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html > >>> . > >>> It allows you to keep your workers in a group, You can poll the group > and > >>> provide cancel semantics as needed. > >>> An example of that service is here: > >>> > >>> > https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java > >>> where I am issuing multiple reads and I want to abandon the process if > >>> they > >>> do not timeout in a while. Many async/promices frameworks do this by > >>> launching two task ComputationTask and a TimeoutTask that returns in 10 > >>> seconds. Then they ask the completions service to poll. If the service > is > >>> given the TimoutTask after the timeout that means the Computation did > not > >>> finish in time. > >>> > >>> Do people generally take action in middle of the computation (abort it > and > >>> do itin a clever way such that effect appears atomic, so abort is > >>> notreally > >>> visible, if so what are some of those clever ways)? > >>> > >>> The base issue is java's synchronized/ AtomicReference do not have a > >>> rollback. > >>> > >>> There are a few ways I know to work around this. Clojure has STM > (software > >>> Transactional Memory) such that if an exception is through inside a > doSync > >>> all of the stems inside the critical block never happened. This assumes > >>> your using all clojure structures which you are probably not. > >>> A way co workers have done this is as follows. Move your entire > >>> transnational state into a SINGLE big object that you can > >>> copy/mutate/compare and swap. You never need to rollback each piece > >>> because > >>> your changing the clone up until the point you commit it. > >>> Writing reversal code is possible depending on the problem. There are > >>> questions to ask like "what if the reversal somehow fails?" > >>> > >>> > >>> > >>> > >>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay < > [email protected] > >>> > > >>> wrote: > >>> > >>> > Hi, > >>> > > >>> > Was wondering if there are any reference designs, patterns on > handling > >>> > common operations involving distributed coordination. > >>> > > >>> > I have a few questions and I guess they must have been asked before, > I > >>> > am unsure what to search for to surface the right answers. It'll be > >>> > really valuable if someone can provide links to relevant > >>> > "best-practices guide" or "suggestions" per question or share some > >>> > wisdom or ideas on patterns to do this in the best way. > >>> > > >>> > 1. What is the best way of handling distributed-lock expiry? The > owner > >>> > of the lock managed to acquire it and may be in middle of some > >>> > computation when the session expires or lock expires. When it > finishes > >>> > that computation, it can tell that the lock expired, but do people > >>> > generally take action in middle of the computation (abort it and do > it > >>> > in a clever way such that effect appears atomic, so abort is not > >>> > really visible, if so what are some of those clever ways)? Or is the > >>> > right thing to do, is to write reversal-code, such that operations > can > >>> > be cleanly undone in case the verification at the end of computation > >>> > shows that lock expired? The later obviously is a lot harder to > >>> > achieve. > >>> > > >>> > 2. Same as above for leader-election scenarios. Leader generally > >>> > administers operations on data-systems that take significant time to > >>> > complete and have significant resource overhead and RPC to administer > >>> > such operations synchronously from leader to data-node can't be > atomic > >>> > and can't be made latency-resilient to such a degree that issuing > >>> > operation across a large set of nodes on a cluster can be guaranteed > >>> > to finish without leader-change. What do people generally do in such > >>> > situations? How are timeouts for operations issued when operations > are > >>> > issued using sequential-znode as a per-datanode dedicated queue? How > >>> > well does it scale, and what are some things to watch-out for > >>> > (operation-size, encoding, clustering into one znode for atomicity > >>> > etc)? Or how are atomic operations that need to be issued across > >>> > multiple data-nodes managed (do they have to be clobbered into one > >>> > znode)? > >>> > > >>> > 3. How do people secure zookeeper based services? Is > >>> > client-certificate-verification the recommended way? How well does > >>> > this work with C client? Is inter-zk-node communication done with > >>> > X509-auth too? > >>> > > >>> > 4. What other projects, reference-implementations or libraries should > >>> > I look at for working with C client? > >>> > > >>> > Most of what I have asked revolves around leader or lock-owner having > >>> > a false-failure (where it doesn't know that coordinator thinks it has > >>> > failed). > >>> > > >>> > -- > >>> > Regards, > >>> > Janmejay > >>> > http://codehunk.wordpress.com > >>> > > >>> > >>> > >>> > >>> > >> > >> > > > > -- > Regards, > Janmejay > http://codehunk.wordpress.com >
