Hi, Was wondering if there are any reference designs, patterns on handling common operations involving distributed coordination.
I have a few questions and I guess they must have been asked before, I am unsure what to search for to surface the right answers. It'll be really valuable if someone can provide links to relevant "best-practices guide" or "suggestions" per question or share some wisdom or ideas on patterns to do this in the best way. 1. What is the best way of handling distributed-lock expiry? The owner of the lock managed to acquire it and may be in middle of some computation when the session expires or lock expires. When it finishes that computation, it can tell that the lock expired, but do people generally take action in middle of the computation (abort it and do it in a clever way such that effect appears atomic, so abort is not really visible, if so what are some of those clever ways)? Or is the right thing to do, is to write reversal-code, such that operations can be cleanly undone in case the verification at the end of computation shows that lock expired? The later obviously is a lot harder to achieve. 2. Same as above for leader-election scenarios. Leader generally administers operations on data-systems that take significant time to complete and have significant resource overhead and RPC to administer such operations synchronously from leader to data-node can't be atomic and can't be made latency-resilient to such a degree that issuing operation across a large set of nodes on a cluster can be guaranteed to finish without leader-change. What do people generally do in such situations? How are timeouts for operations issued when operations are issued using sequential-znode as a per-datanode dedicated queue? How well does it scale, and what are some things to watch-out for (operation-size, encoding, clustering into one znode for atomicity etc)? Or how are atomic operations that need to be issued across multiple data-nodes managed (do they have to be clobbered into one znode)? 3. How do people secure zookeeper based services? Is client-certificate-verification the recommended way? How well does this work with C client? Is inter-zk-node communication done with X509-auth too? 4. What other projects, reference-implementations or libraries should I look at for working with C client? Most of what I have asked revolves around leader or lock-owner having a false-failure (where it doesn't know that coordinator thinks it has failed). -- Regards, Janmejay http://codehunk.wordpress.com
