@Martin: Master reacting to achieve operation-timeout makes sense, I never thought of it that way and it seems really effective. I can't see it failing in any scenario. The only scenario where it has a problem(solvable) is when the lock-owner dies to never comes back, in which case it will need manual assistance (someone confirming that the lock-owner is dead for good) or a very high-level control like an API to kill the machine which is supposed to be running the lock owner. About work distribution, I have a stateful set of nodes, so traditional queue model assuming all workers are equal doesn't work. A dedicated queue per worker would, but that will still have larger surface-area of architecture exposed to coordination-service and will increase scaling-burden on zookeeper. It doesn't really simplify idempotency requirement anyway, so its equal to RPC triggered work in complexity of implementation.
On Wed, Jan 13, 2016 at 4:51 PM, Martin Kersten <[email protected]> wrote: > Hello, > > I am not quite aware of the zookeeper specialities but from my point of > view you should think about distributing the task as well like having > multiple nodes being responsible for a task to be done and if one node > fails the other nodes take over and perform/complete the task instead. This > would involve having becon messages and a place you can put your code in. > > Locks that time out should actually never happen, since it makes everything > go boom and become overly complex. Just bind a lock to the lifeliness of > the node. So if the node is considered dead free the lock. If the node is > kind of zombie (not reacting (stuck) but responsive in terms of sending > i-am-alife beacons (heart beat)) it is the task of the leader to kill the > node remotely or remove the node from the list of members and inform anyone > else about it. Once this happens this would also revoke the lock. > > The goal is to simply let the leader kill any node that seams to be > malfunction in any possible way (like missing a deadline). A node that > wants to complete its operation needs to interact with the leader and at > this point in time the node should realize if it was considered dead and > should restart by crashing and rebooting instantly. > > Another thing you might consider is to renew the lock in certain periodes. > If you have a workflow, your lock times out in 10 minutes just every time > you make real progress in your workflow renew the lock giving you another > 10minutes to do the next steps. > > This way (as long as you do not have a loop in the workflow) you are save > in assuming that a workflow is being completed in the future. If you need a > hard deadline the node processing the operation might as well check the > estimate of the workflow and drop the lock and abort the operation if it > estimates the operation is likely to timeout and might even perform a > compensation operation. > > > Cheers, > > Martin (Kersten) > > > > 2016-01-13 11:08 GMT+01:00 singh.janmejay <[email protected]>: > >> @Alexander: In that scenario, write of X will be attempted by A, but >> external system will not act upon write-X because that operation has >> already been acted upon in the past. This is guaranteed by idempotent >> operations invariant. But it does point out another problem, which I >> hadn't handled in my original algorithm. Problem: If X and Y have both >> not been issued yet, and if Y is issued before X towards external >> system, because neither operations have executed yet, it'll overwrite >> Y with X. I need another constraint, master should only issue 1 >> operation on a certain external-system at a time and must issue >> operations in the order of operation-id (sequential-znode sequence >> number). So we need the following invariants: >> - order of issuing operation being fixed (matching order of creation >> of operations) >> - concurrency of operation fixed to 1 >> - idempotent execution on external-system side >> >> @Powell: Im kind of doing the same thing. Except the loop doesn't run >> on consumer, instead it runs on master, which is assigning work to >> consumers. So triggerWork function is basically changed to issueWork, >> which is RPC + triggerWork. The replay if history is basically just >> replay of 1 operation per operand-node (in this thread we are calling >> it external-system), so its as if triggerWork failed, in which case we >> need to re-execute triggerWork. Idempotency also follows from that >> requirement. If triggerWork fails in the last step, and all the >> desired effect that was necessary has happened, we will still need to >> run triggerWork again, but we need awareness that actual work has been >> done, which is why idempotency is necessary. >> >> Btw, thanks for continuing to spare time for this, I really appreciate >> this feedback/validation. >> >> Thoughts? >> >> On Wed, Jan 13, 2016 at 3:47 AM, powell molleti >> <[email protected]> wrote: >> > Wouldn't a distributed queue recipe for consumer work?. Where one needs >> to add extra logic something like this: >> > >> > with lock() { >> > p = queue.peek() >> > if triggerWork(p) is Done: >> > queue.pop() >> > } >> > >> > With this a consumer that worked on it but crashed before popping the >> queue would result in another consumer processing the same work. >> > >> > I am not sure with the details of where you are getting the work from >> and the scale of it is but producers(leader) could write to this queue. >> Since there is guarantee of read after write , producer could delete from >> its local queue the work that was successfully queued. Hence again new >> producer could re-send the last entry of work so one has to handle that. >> Without more details on the origin of work etc its hard to design end to >> end. >> > >> > I do not see a need to do a total replay of past history etc when using >> ZK like system because ZK is built on idea of serialized and replicated >> log, hence if you are using ZK then your design should be much simpler i.e >> fail and re-start from last know transaction. >> > >> > Powell. >> > >> > >> > >> > On Tuesday, January 12, 2016 11:51 AM, Alexander Shraer < >> [email protected]> wrote: >> > Hi, >> > >> > With your suggestion, the following scenario seems possible: master A is >> > about to write value X to an external system so it logs it to ZK, then >> > freezes for some time, ZK suspects it as failed, another master B is >> > elected, writes X (completing what A wanted to do) >> > then starts doing something else and writes Y. Then leader A "wakes up" >> and >> > re-executes the old operation writing X which is now stale. >> > >> > perhaps if your external system supports conditional updates this can be >> > avoided - a write of X only works if the current state is as expected. >> > >> > Alex >> > >> > >> > On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <[email protected] >> > >> > wrote: >> > >> >> Thanks for the replies everyone, most of it was very useful. >> >> >> >> @Alexander: The section of chubby paper you pointed me to tries to >> >> address exactly what I was looking for. That clearly is one good way >> >> of doing it. Im also thinking of an alternative way and can use a >> >> review or some feedback on that. >> >> >> >> @Powel: About x509 auth on intra-cluster communication, I don't have a >> >> blocking need for it, as it can be achieved by setting up firewall >> >> rules to accept only from desired hosts. It may be a good-to-have >> >> thing though, because in cloud-based scenarios where IP addresses are >> >> re-used, a recycled IP can still talk to a secure zk-cluster unless >> >> config is changed to remove the old peer IP and replace it with the >> >> new one. Its clearly a corner-case though. >> >> >> >> Here is the approach Im thinking of: >> >> - Implement all operations(atleast master-triggered operations) on >> >> operand machines idempotently >> >> - Have master journal these operations to ZK before issuing RPC >> >> - In case master fails with some of these operations in flight, the >> >> newly elected master will need to read all issued (but not retired >> >> yet) operations and issue them again. >> >> - Existing master(before failure or after failure) can retry and >> >> retire operations according to whatever the retry policy and >> >> success-criterion is. >> >> >> >> Why am I thinking of this as opposed to going with chubby sequencer >> >> passing: >> >> - I need to implement idempotency regardless, because recovery-path >> >> involving master-death after successful execution of operation but >> >> before writing ack to coordination service requires it. So idempotent >> >> implementation complexity is here to stay. >> >> - I need to increase surface-area of the architecture which is exposed >> >> to coordination-service for sequencer validation. Which may bring >> >> verification RPC in data-plane in some cases. >> >> - The sequencer may expire after verification but before ack, in which >> >> case new master may not recognize the operation as consistent with its >> >> decisions (or previous decision path). >> >> >> >> Thoughts? Suggestions? >> >> >> >> >> >> >> >> On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <[email protected]> >> >> wrote: >> >> > regarding atomic multi-znode updates -- check out "multi" updates >> >> > < >> >> >> http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html >> >> > >> >> > . >> >> > >> >> > On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <[email protected]> >> >> wrote: >> >> > >> >> >> for 1, see the chubby paper >> >> >> < >> >> >> http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf >> >> >, >> >> >> section 2.4. >> >> >> for 2, I'm not sure I fully understand the question, but >> essentially, ZK >> >> >> guarantees that even during failures >> >> >> consistency of updates is preserved. The user doesn't need to do >> >> anything >> >> >> in particular to guarantee this, even >> >> >> during leader failures. In such case, some suffix of operations >> executed >> >> >> by the leader may be lost if they weren't >> >> >> previously acked by a majority.However, none of these operations >> could >> >> >> have been visible >> >> >> to reads. >> >> >> >> >> >> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti < >> >> >> [email protected]> wrote: >> >> >> >> >> >>> Hi Janmejay, >> >> >>> Regarding question 1, if a node takes a lock and the lock has >> timed-out >> >> >>> from system perspective then it can mean that other nodes are free >> to >> >> take >> >> >>> the lock and work on the resource. Hence the history could be well >> >> into the >> >> >>> future when the previous node discovers the time-out. The question >> of >> >> >>> rollback in the specific context depends on the implementation >> >> details, is >> >> >>> the lock holder updating some common area?, then there could be >> >> corruption >> >> >>> since other nodes are free to write in parallel to the first one?. >> In >> >> the >> >> >>> usual sense a time-out of lock held means the node which held the >> lock >> >> is >> >> >>> dead. It is upto the implementation to ensure this case and, using >> this >> >> >>> primitive, if there is a timeout which means other nodes are sure >> that >> >> no >> >> >>> one else is working on the resource and hence can move forward. >> >> >>> Question 2 seems to imply the assumption that leader has significant >> >> work >> >> >>> todo and leader change is quite common, which seems contrary to >> common >> >> >>> implementation pattern. If the work can be broken down into smaller >> >> chunks >> >> >>> which need serialization separately then each chunk/work type can >> have >> >> a >> >> >>> different leader. >> >> >>> For question 3, ZK does support auth and encryption for client >> >> >>> connections but not for inter ZK node channels. Do you have >> >> requirement to >> >> >>> secure inter ZK nodes, can you let us know what your requirements >> are >> >> so we >> >> >>> can implement a solution to fit all needs?. >> >> >>> For question 4 the official implementation is C, people tend to wrap >> >> that >> >> >>> with C++ and there should projects that use ZK doing that you can >> look >> >> them >> >> >>> up and see if you can separate it out and use them. >> >> >>> Hope this helps.Powell. >> >> >>> >> >> >>> >> >> >>> >> >> >>> On Thursday, December 31, 2015 8:07 AM, Edward Capriolo < >> >> >>> [email protected]> wrote: >> >> >>> >> >> >>> >> >> >>> Q:What is the best way of handling distributed-lock expiry? The >> owner >> >> >>> of the lock managed to acquire it and may be in middle of some >> >> >>> computation when the session expires or lock expire >> >> >>> >> >> >>> If you are using Java a way I can see doing this is by using the >> >> >>> ExecutorCompletionService >> >> >>> >> >> >>> >> >> >> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html >> >> >>> . >> >> >>> It allows you to keep your workers in a group, You can poll the >> group >> >> and >> >> >>> provide cancel semantics as needed. >> >> >>> An example of that service is here: >> >> >>> >> >> >>> >> >> >> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java >> >> >>> where I am issuing multiple reads and I want to abandon the process >> if >> >> >>> they >> >> >>> do not timeout in a while. Many async/promices frameworks do this by >> >> >>> launching two task ComputationTask and a TimeoutTask that returns >> in 10 >> >> >>> seconds. Then they ask the completions service to poll. If the >> service >> >> is >> >> >>> given the TimoutTask after the timeout that means the Computation >> did >> >> not >> >> >>> finish in time. >> >> >>> >> >> >>> Do people generally take action in middle of the computation (abort >> it >> >> and >> >> >>> do itin a clever way such that effect appears atomic, so abort is >> >> >>> notreally >> >> >>> visible, if so what are some of those clever ways)? >> >> >>> >> >> >>> The base issue is java's synchronized/ AtomicReference do not have a >> >> >>> rollback. >> >> >>> >> >> >>> There are a few ways I know to work around this. Clojure has STM >> >> (software >> >> >>> Transactional Memory) such that if an exception is through inside a >> >> doSync >> >> >>> all of the stems inside the critical block never happened. This >> assumes >> >> >>> your using all clojure structures which you are probably not. >> >> >>> A way co workers have done this is as follows. Move your entire >> >> >>> transnational state into a SINGLE big object that you can >> >> >>> copy/mutate/compare and swap. You never need to rollback each piece >> >> >>> because >> >> >>> your changing the clone up until the point you commit it. >> >> >>> Writing reversal code is possible depending on the problem. There >> are >> >> >>> questions to ask like "what if the reversal somehow fails?" >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay < >> >> [email protected] >> >> >>> > >> >> >>> wrote: >> >> >>> >> >> >>> > Hi, >> >> >>> > >> >> >>> > Was wondering if there are any reference designs, patterns on >> >> handling >> >> >>> > common operations involving distributed coordination. >> >> >>> > >> >> >>> > I have a few questions and I guess they must have been asked >> before, >> >> I >> >> >>> > am unsure what to search for to surface the right answers. It'll >> be >> >> >>> > really valuable if someone can provide links to relevant >> >> >>> > "best-practices guide" or "suggestions" per question or share some >> >> >>> > wisdom or ideas on patterns to do this in the best way. >> >> >>> > >> >> >>> > 1. What is the best way of handling distributed-lock expiry? The >> >> owner >> >> >>> > of the lock managed to acquire it and may be in middle of some >> >> >>> > computation when the session expires or lock expires. When it >> >> finishes >> >> >>> > that computation, it can tell that the lock expired, but do people >> >> >>> > generally take action in middle of the computation (abort it and >> do >> >> it >> >> >>> > in a clever way such that effect appears atomic, so abort is not >> >> >>> > really visible, if so what are some of those clever ways)? Or is >> the >> >> >>> > right thing to do, is to write reversal-code, such that operations >> >> can >> >> >>> > be cleanly undone in case the verification at the end of >> computation >> >> >>> > shows that lock expired? The later obviously is a lot harder to >> >> >>> > achieve. >> >> >>> > >> >> >>> > 2. Same as above for leader-election scenarios. Leader generally >> >> >>> > administers operations on data-systems that take significant time >> to >> >> >>> > complete and have significant resource overhead and RPC to >> administer >> >> >>> > such operations synchronously from leader to data-node can't be >> >> atomic >> >> >>> > and can't be made latency-resilient to such a degree that issuing >> >> >>> > operation across a large set of nodes on a cluster can be >> guaranteed >> >> >>> > to finish without leader-change. What do people generally do in >> such >> >> >>> > situations? How are timeouts for operations issued when operations >> >> are >> >> >>> > issued using sequential-znode as a per-datanode dedicated queue? >> How >> >> >>> > well does it scale, and what are some things to watch-out for >> >> >>> > (operation-size, encoding, clustering into one znode for atomicity >> >> >>> > etc)? Or how are atomic operations that need to be issued across >> >> >>> > multiple data-nodes managed (do they have to be clobbered into one >> >> >>> > znode)? >> >> >>> > >> >> >>> > 3. How do people secure zookeeper based services? Is >> >> >>> > client-certificate-verification the recommended way? How well does >> >> >>> > this work with C client? Is inter-zk-node communication done with >> >> >>> > X509-auth too? >> >> >>> > >> >> >>> > 4. What other projects, reference-implementations or libraries >> should >> >> >>> > I look at for working with C client? >> >> >>> > >> >> >>> > Most of what I have asked revolves around leader or lock-owner >> having >> >> >>> > a false-failure (where it doesn't know that coordinator thinks it >> has >> >> >>> > failed). >> >> >>> > >> >> >>> > -- >> >> >>> > Regards, >> >> >>> > Janmejay >> >> >>> > http://codehunk.wordpress.com >> >> >>> > >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> Regards, >> >> Janmejay >> >> http://codehunk.wordpress.com >> >> >> >> >> >> -- >> Regards, >> Janmejay >> http://codehunk.wordpress.com >> -- Regards, Janmejay http://codehunk.wordpress.com
