I may be wrong but I don't think that being idempotent gives you what you said. Just because f(f(x))=f(x) doesn't mean that f(g(f(x))) = g(f(x)) -- this was my example. But if your system can detect that X was already executed (or if the operations are conditional on state) my scenario indeed can't happen.
On Wed, Jan 13, 2016 at 2:08 AM, singh.janmejay <[email protected]> wrote: > @Alexander: In that scenario, write of X will be attempted by A, but > external system will not act upon write-X because that operation has > already been acted upon in the past. This is guaranteed by idempotent > operations invariant. But it does point out another problem, which I > hadn't handled in my original algorithm. Problem: If X and Y have both > not been issued yet, and if Y is issued before X towards external > system, because neither operations have executed yet, it'll overwrite > Y with X. I need another constraint, master should only issue 1 > operation on a certain external-system at a time and must issue > operations in the order of operation-id (sequential-znode sequence > number). So we need the following invariants: > - order of issuing operation being fixed (matching order of creation > of operations) > - concurrency of operation fixed to 1 > - idempotent execution on external-system side > > @Powell: Im kind of doing the same thing. Except the loop doesn't run > on consumer, instead it runs on master, which is assigning work to > consumers. So triggerWork function is basically changed to issueWork, > which is RPC + triggerWork. The replay if history is basically just > replay of 1 operation per operand-node (in this thread we are calling > it external-system), so its as if triggerWork failed, in which case we > need to re-execute triggerWork. Idempotency also follows from that > requirement. If triggerWork fails in the last step, and all the > desired effect that was necessary has happened, we will still need to > run triggerWork again, but we need awareness that actual work has been > done, which is why idempotency is necessary. > > Btw, thanks for continuing to spare time for this, I really appreciate > this feedback/validation. > > Thoughts? > > On Wed, Jan 13, 2016 at 3:47 AM, powell molleti > <[email protected]> wrote: > > Wouldn't a distributed queue recipe for consumer work?. Where one needs > to add extra logic something like this: > > > > with lock() { > > p = queue.peek() > > if triggerWork(p) is Done: > > queue.pop() > > } > > > > With this a consumer that worked on it but crashed before popping the > queue would result in another consumer processing the same work. > > > > I am not sure with the details of where you are getting the work from > and the scale of it is but producers(leader) could write to this queue. > Since there is guarantee of read after write , producer could delete from > its local queue the work that was successfully queued. Hence again new > producer could re-send the last entry of work so one has to handle that. > Without more details on the origin of work etc its hard to design end to > end. > > > > I do not see a need to do a total replay of past history etc when using > ZK like system because ZK is built on idea of serialized and replicated > log, hence if you are using ZK then your design should be much simpler i.e > fail and re-start from last know transaction. > > > > Powell. > > > > > > > > On Tuesday, January 12, 2016 11:51 AM, Alexander Shraer < > [email protected]> wrote: > > Hi, > > > > With your suggestion, the following scenario seems possible: master A is > > about to write value X to an external system so it logs it to ZK, then > > freezes for some time, ZK suspects it as failed, another master B is > > elected, writes X (completing what A wanted to do) > > then starts doing something else and writes Y. Then leader A "wakes up" > and > > re-executes the old operation writing X which is now stale. > > > > perhaps if your external system supports conditional updates this can be > > avoided - a write of X only works if the current state is as expected. > > > > Alex > > > > > > On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <[email protected] > > > > wrote: > > > >> Thanks for the replies everyone, most of it was very useful. > >> > >> @Alexander: The section of chubby paper you pointed me to tries to > >> address exactly what I was looking for. That clearly is one good way > >> of doing it. Im also thinking of an alternative way and can use a > >> review or some feedback on that. > >> > >> @Powel: About x509 auth on intra-cluster communication, I don't have a > >> blocking need for it, as it can be achieved by setting up firewall > >> rules to accept only from desired hosts. It may be a good-to-have > >> thing though, because in cloud-based scenarios where IP addresses are > >> re-used, a recycled IP can still talk to a secure zk-cluster unless > >> config is changed to remove the old peer IP and replace it with the > >> new one. Its clearly a corner-case though. > >> > >> Here is the approach Im thinking of: > >> - Implement all operations(atleast master-triggered operations) on > >> operand machines idempotently > >> - Have master journal these operations to ZK before issuing RPC > >> - In case master fails with some of these operations in flight, the > >> newly elected master will need to read all issued (but not retired > >> yet) operations and issue them again. > >> - Existing master(before failure or after failure) can retry and > >> retire operations according to whatever the retry policy and > >> success-criterion is. > >> > >> Why am I thinking of this as opposed to going with chubby sequencer > >> passing: > >> - I need to implement idempotency regardless, because recovery-path > >> involving master-death after successful execution of operation but > >> before writing ack to coordination service requires it. So idempotent > >> implementation complexity is here to stay. > >> - I need to increase surface-area of the architecture which is exposed > >> to coordination-service for sequencer validation. Which may bring > >> verification RPC in data-plane in some cases. > >> - The sequencer may expire after verification but before ack, in which > >> case new master may not recognize the operation as consistent with its > >> decisions (or previous decision path). > >> > >> Thoughts? Suggestions? > >> > >> > >> > >> On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <[email protected]> > >> wrote: > >> > regarding atomic multi-znode updates -- check out "multi" updates > >> > < > >> > http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html > >> > > >> > . > >> > > >> > On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <[email protected]> > >> wrote: > >> > > >> >> for 1, see the chubby paper > >> >> < > >> > http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf > >> >, > >> >> section 2.4. > >> >> for 2, I'm not sure I fully understand the question, but > essentially, ZK > >> >> guarantees that even during failures > >> >> consistency of updates is preserved. The user doesn't need to do > >> anything > >> >> in particular to guarantee this, even > >> >> during leader failures. In such case, some suffix of operations > executed > >> >> by the leader may be lost if they weren't > >> >> previously acked by a majority.However, none of these operations > could > >> >> have been visible > >> >> to reads. > >> >> > >> >> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti < > >> >> [email protected]> wrote: > >> >> > >> >>> Hi Janmejay, > >> >>> Regarding question 1, if a node takes a lock and the lock has > timed-out > >> >>> from system perspective then it can mean that other nodes are free > to > >> take > >> >>> the lock and work on the resource. Hence the history could be well > >> into the > >> >>> future when the previous node discovers the time-out. The question > of > >> >>> rollback in the specific context depends on the implementation > >> details, is > >> >>> the lock holder updating some common area?, then there could be > >> corruption > >> >>> since other nodes are free to write in parallel to the first one?. > In > >> the > >> >>> usual sense a time-out of lock held means the node which held the > lock > >> is > >> >>> dead. It is upto the implementation to ensure this case and, using > this > >> >>> primitive, if there is a timeout which means other nodes are sure > that > >> no > >> >>> one else is working on the resource and hence can move forward. > >> >>> Question 2 seems to imply the assumption that leader has significant > >> work > >> >>> todo and leader change is quite common, which seems contrary to > common > >> >>> implementation pattern. If the work can be broken down into smaller > >> chunks > >> >>> which need serialization separately then each chunk/work type can > have > >> a > >> >>> different leader. > >> >>> For question 3, ZK does support auth and encryption for client > >> >>> connections but not for inter ZK node channels. Do you have > >> requirement to > >> >>> secure inter ZK nodes, can you let us know what your requirements > are > >> so we > >> >>> can implement a solution to fit all needs?. > >> >>> For question 4 the official implementation is C, people tend to wrap > >> that > >> >>> with C++ and there should projects that use ZK doing that you can > look > >> them > >> >>> up and see if you can separate it out and use them. > >> >>> Hope this helps.Powell. > >> >>> > >> >>> > >> >>> > >> >>> On Thursday, December 31, 2015 8:07 AM, Edward Capriolo < > >> >>> [email protected]> wrote: > >> >>> > >> >>> > >> >>> Q:What is the best way of handling distributed-lock expiry? The > owner > >> >>> of the lock managed to acquire it and may be in middle of some > >> >>> computation when the session expires or lock expire > >> >>> > >> >>> If you are using Java a way I can see doing this is by using the > >> >>> ExecutorCompletionService > >> >>> > >> >>> > >> > https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html > >> >>> . > >> >>> It allows you to keep your workers in a group, You can poll the > group > >> and > >> >>> provide cancel semantics as needed. > >> >>> An example of that service is here: > >> >>> > >> >>> > >> > https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java > >> >>> where I am issuing multiple reads and I want to abandon the process > if > >> >>> they > >> >>> do not timeout in a while. Many async/promices frameworks do this by > >> >>> launching two task ComputationTask and a TimeoutTask that returns > in 10 > >> >>> seconds. Then they ask the completions service to poll. If the > service > >> is > >> >>> given the TimoutTask after the timeout that means the Computation > did > >> not > >> >>> finish in time. > >> >>> > >> >>> Do people generally take action in middle of the computation (abort > it > >> and > >> >>> do itin a clever way such that effect appears atomic, so abort is > >> >>> notreally > >> >>> visible, if so what are some of those clever ways)? > >> >>> > >> >>> The base issue is java's synchronized/ AtomicReference do not have a > >> >>> rollback. > >> >>> > >> >>> There are a few ways I know to work around this. Clojure has STM > >> (software > >> >>> Transactional Memory) such that if an exception is through inside a > >> doSync > >> >>> all of the stems inside the critical block never happened. This > assumes > >> >>> your using all clojure structures which you are probably not. > >> >>> A way co workers have done this is as follows. Move your entire > >> >>> transnational state into a SINGLE big object that you can > >> >>> copy/mutate/compare and swap. You never need to rollback each piece > >> >>> because > >> >>> your changing the clone up until the point you commit it. > >> >>> Writing reversal code is possible depending on the problem. There > are > >> >>> questions to ask like "what if the reversal somehow fails?" > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay < > >> [email protected] > >> >>> > > >> >>> wrote: > >> >>> > >> >>> > Hi, > >> >>> > > >> >>> > Was wondering if there are any reference designs, patterns on > >> handling > >> >>> > common operations involving distributed coordination. > >> >>> > > >> >>> > I have a few questions and I guess they must have been asked > before, > >> I > >> >>> > am unsure what to search for to surface the right answers. It'll > be > >> >>> > really valuable if someone can provide links to relevant > >> >>> > "best-practices guide" or "suggestions" per question or share some > >> >>> > wisdom or ideas on patterns to do this in the best way. > >> >>> > > >> >>> > 1. What is the best way of handling distributed-lock expiry? The > >> owner > >> >>> > of the lock managed to acquire it and may be in middle of some > >> >>> > computation when the session expires or lock expires. When it > >> finishes > >> >>> > that computation, it can tell that the lock expired, but do people > >> >>> > generally take action in middle of the computation (abort it and > do > >> it > >> >>> > in a clever way such that effect appears atomic, so abort is not > >> >>> > really visible, if so what are some of those clever ways)? Or is > the > >> >>> > right thing to do, is to write reversal-code, such that operations > >> can > >> >>> > be cleanly undone in case the verification at the end of > computation > >> >>> > shows that lock expired? The later obviously is a lot harder to > >> >>> > achieve. > >> >>> > > >> >>> > 2. Same as above for leader-election scenarios. Leader generally > >> >>> > administers operations on data-systems that take significant time > to > >> >>> > complete and have significant resource overhead and RPC to > administer > >> >>> > such operations synchronously from leader to data-node can't be > >> atomic > >> >>> > and can't be made latency-resilient to such a degree that issuing > >> >>> > operation across a large set of nodes on a cluster can be > guaranteed > >> >>> > to finish without leader-change. What do people generally do in > such > >> >>> > situations? How are timeouts for operations issued when operations > >> are > >> >>> > issued using sequential-znode as a per-datanode dedicated queue? > How > >> >>> > well does it scale, and what are some things to watch-out for > >> >>> > (operation-size, encoding, clustering into one znode for atomicity > >> >>> > etc)? Or how are atomic operations that need to be issued across > >> >>> > multiple data-nodes managed (do they have to be clobbered into one > >> >>> > znode)? > >> >>> > > >> >>> > 3. How do people secure zookeeper based services? Is > >> >>> > client-certificate-verification the recommended way? How well does > >> >>> > this work with C client? Is inter-zk-node communication done with > >> >>> > X509-auth too? > >> >>> > > >> >>> > 4. What other projects, reference-implementations or libraries > should > >> >>> > I look at for working with C client? > >> >>> > > >> >>> > Most of what I have asked revolves around leader or lock-owner > having > >> >>> > a false-failure (where it doesn't know that coordinator thinks it > has > >> >>> > failed). > >> >>> > > >> >>> > -- > >> >>> > Regards, > >> >>> > Janmejay > >> >>> > http://codehunk.wordpress.com > >> >>> > > >> >>> > >> >>> > >> >>> > >> >>> > >> >> > >> >> > >> > >> > >> > >> -- > >> Regards, > >> Janmejay > >> http://codehunk.wordpress.com > >> > > > > -- > Regards, > Janmejay > http://codehunk.wordpress.com >
