Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Alexander Shraer Wed, 13 Jan 2016 10:20:07 -0800

I may be wrong but I don't think that being idempotent gives you what you
said. Just because f(f(x))=f(x) doesn't mean that f(g(f(x))) = g(f(x)) --
this was my example. But if your system can detect that X was already
executed (or if the operations are conditional on state) my scenario indeed
can't happen.



On Wed, Jan 13, 2016 at 2:08 AM, singh.janmejay <[email protected]>
wrote:

> @Alexander: In that scenario, write of X will be attempted by A, but
> external system will not act upon write-X because that operation has
> already been acted upon in the past. This is guaranteed by idempotent
> operations invariant. But it does point out another problem, which I
> hadn't handled in my original algorithm. Problem: If X and Y have both
> not been issued yet, and if Y is issued before X towards external
> system, because neither operations have executed yet, it'll overwrite
> Y with X. I need another constraint, master should only issue 1
> operation on a certain external-system at a time and must issue
> operations in the order of operation-id (sequential-znode sequence
> number). So we need the following invariants:
> - order of issuing operation being fixed (matching order of creation
> of operations)
> - concurrency of operation fixed to 1
> - idempotent execution on external-system side
>
> @Powell: Im kind of doing the same thing. Except the loop doesn't run
> on consumer, instead it runs on master, which is assigning work to
> consumers. So triggerWork function is basically changed to issueWork,
> which is RPC + triggerWork. The replay if history is basically just
> replay of 1 operation per operand-node (in this thread we are calling
> it external-system), so its as if triggerWork failed, in which case we
> need to re-execute triggerWork. Idempotency also follows from that
> requirement. If triggerWork fails in the last step, and all the
> desired effect that was necessary has happened, we will still need to
> run triggerWork again, but we need awareness that actual work has been
> done, which is why idempotency is necessary.
>
> Btw, thanks for continuing to spare time for this, I really appreciate
> this feedback/validation.
>
> Thoughts?
>
> On Wed, Jan 13, 2016 at 3:47 AM, powell molleti
> <[email protected]> wrote:
> > Wouldn't a distributed queue recipe for consumer work?. Where one needs
> to add extra logic something like this:
> >
> > with lock() {
> >     p = queue.peek()
> >     if triggerWork(p) is Done:
> >         queue.pop()
> > }
> >
> > With this a consumer that worked on it but crashed before popping the
> queue would result in another consumer processing the same work.
> >
> > I am not sure with the details of where you are getting the work from
> and the scale of it is but producers(leader) could write to this queue.
> Since there is guarantee of read after write , producer could delete from
> its local queue the work that was successfully queued. Hence again new
> producer could re-send the last entry of work so one has to handle that.
> Without more details on the origin of work etc its hard to design end to
> end.
> >
> > I do not see a need to do a total replay of past history etc when using
> ZK like system because ZK is built on idea of serialized and replicated
> log, hence if you are using ZK then your design should be much simpler i.e
> fail and re-start from last know transaction.
> >
> > Powell.
> >
> >
> >
> > On Tuesday, January 12, 2016 11:51 AM, Alexander Shraer <
> [email protected]> wrote:
> > Hi,
> >
> > With your suggestion, the following scenario seems possible: master A is
> > about to write value X to an external system so it logs it to ZK, then
> > freezes for some time, ZK suspects it as failed, another master B is
> > elected, writes X (completing what A wanted to do)
> > then starts doing something else and writes Y. Then leader A "wakes up"
> and
> > re-executes the old operation writing X which is now stale.
> >
> > perhaps if your external system supports conditional updates this can be
> > avoided - a write of X only works if the current state is as expected.
> >
> > Alex
> >
> >
> > On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <[email protected]
> >
> > wrote:
> >
> >> Thanks for the replies everyone, most of it was very useful.
> >>
> >> @Alexander: The section of chubby paper you pointed me to tries to
> >> address exactly what I was looking for. That clearly is one good way
> >> of doing it. Im also thinking of an alternative way and can use a
> >> review or some feedback on that.
> >>
> >> @Powel: About x509 auth on intra-cluster communication, I don't have a
> >> blocking need for it, as it can be achieved by setting up firewall
> >> rules to accept only from desired hosts. It may be a good-to-have
> >> thing though, because in cloud-based scenarios where IP addresses are
> >> re-used, a recycled IP can still talk to a secure zk-cluster unless
> >> config is changed to remove the old peer IP and replace it with the
> >> new one. Its clearly a corner-case though.
> >>
> >> Here is the approach Im thinking of:
> >> - Implement all operations(atleast master-triggered operations) on
> >> operand machines idempotently
> >> - Have master journal these operations to ZK before issuing RPC
> >> - In case master fails with some of these operations in flight, the
> >> newly elected master will need to read all issued (but not retired
> >> yet) operations and issue them again.
> >> - Existing master(before failure or after failure) can retry and
> >> retire operations according to whatever the retry policy and
> >> success-criterion is.
> >>
> >> Why am I thinking of this as opposed to going with chubby sequencer
> >> passing:
> >> - I need to implement idempotency regardless, because recovery-path
> >> involving master-death after successful execution of operation but
> >> before writing ack to coordination service requires it. So idempotent
> >> implementation complexity is here to stay.
> >> - I need to increase surface-area of the architecture which is exposed
> >> to coordination-service for sequencer validation. Which may bring
> >> verification RPC in data-plane in some cases.
> >> - The sequencer may expire after verification but before ack, in which
> >> case new master may not recognize the operation as consistent with its
> >> decisions (or previous decision path).
> >>
> >> Thoughts? Suggestions?
> >>
> >>
> >>
> >> On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <[email protected]>
> >> wrote:
> >> > regarding atomic multi-znode updates -- check out "multi" updates
> >> > <
> >>
> http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html
> >> >
> >> > .
> >> >
> >> > On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <[email protected]>
> >> wrote:
> >> >
> >> >> for 1, see the chubby paper
> >> >> <
> >>
> http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
> >> >,
> >> >> section 2.4.
> >> >> for 2, I'm not sure I fully understand the question, but
> essentially, ZK
> >> >> guarantees that even during failures
> >> >> consistency of updates is preserved. The user doesn't need to do
> >> anything
> >> >> in particular to guarantee this, even
> >> >> during leader failures. In such case, some suffix of operations
> executed
> >> >> by the leader may be lost if they weren't
> >> >> previously acked by a majority.However, none of these operations
> could
> >> >> have been visible
> >> >> to reads.
> >> >>
> >> >> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
> >> >> [email protected]> wrote:
> >> >>
> >> >>> Hi Janmejay,
> >> >>> Regarding question 1, if a node takes a lock and the lock has
> timed-out
> >> >>> from system perspective then it can mean that other nodes are free
> to
> >> take
> >> >>> the lock and work on the resource. Hence the history could be well
> >> into the
> >> >>> future when the previous node discovers the time-out. The question
> of
> >> >>> rollback in the specific context depends on the implementation
> >> details, is
> >> >>> the lock holder updating some common area?, then there could be
> >> corruption
> >> >>> since other nodes are free to write in parallel to the first one?.
> In
> >> the
> >> >>> usual sense a time-out of lock held means the node which held the
> lock
> >> is
> >> >>> dead. It is upto the implementation to ensure this case and, using
> this
> >> >>> primitive, if there is a timeout which means other nodes are sure
> that
> >> no
> >> >>> one else is working on the resource and hence can move forward.
> >> >>> Question 2 seems to imply the assumption that leader has significant
> >> work
> >> >>> todo and leader change is quite common, which seems contrary to
> common
> >> >>> implementation pattern. If the work can be broken down into smaller
> >> chunks
> >> >>> which need serialization separately then each chunk/work type can
> have
> >> a
> >> >>> different leader.
> >> >>> For question 3, ZK does support auth and encryption for client
> >> >>> connections but not for inter ZK node channels. Do you have
> >> requirement to
> >> >>> secure inter ZK nodes, can you let us know what your requirements
> are
> >> so we
> >> >>> can implement a solution to fit all needs?.
> >> >>> For question 4 the official implementation is C, people tend to wrap
> >> that
> >> >>> with C++ and there should projects that use ZK doing that you can
> look
> >> them
> >> >>> up and see if you can separate it out and use them.
> >> >>> Hope this helps.Powell.
> >> >>>
> >> >>>
> >> >>>
> >> >>>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
> >> >>> [email protected]> wrote:
> >> >>>
> >> >>>
> >> >>>  Q:What is the best way of handling distributed-lock expiry? The
> owner
> >> >>> of the lock managed to acquire it and may be in middle of some
> >> >>> computation when the session expires or lock expire
> >> >>>
> >> >>> If you are using Java a way I can see doing this is by using the
> >> >>> ExecutorCompletionService
> >> >>>
> >> >>>
> >>
> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
> >> >>> .
> >> >>> It allows you to keep your workers in a group, You can poll the
> group
> >> and
> >> >>> provide cancel semantics as needed.
> >> >>> An example of that service is here:
> >> >>>
> >> >>>
> >>
> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
> >> >>> where I am issuing multiple reads and I want to abandon the process
> if
> >> >>> they
> >> >>> do not timeout in a while. Many async/promices frameworks do this by
> >> >>> launching two task ComputationTask and a TimeoutTask that returns
> in 10
> >> >>> seconds. Then they ask the completions service to poll. If the
> service
> >> is
> >> >>> given the TimoutTask after the timeout that means the Computation
> did
> >> not
> >> >>> finish in time.
> >> >>>
> >> >>> Do people generally take action in middle of the computation (abort
> it
> >> and
> >> >>> do itin a clever way such that effect appears atomic, so abort is
> >> >>> notreally
> >> >>> visible, if so what are some of those clever ways)?
> >> >>>
> >> >>> The base issue is java's synchronized/ AtomicReference do not have a
> >> >>> rollback.
> >> >>>
> >> >>> There are a few ways I know to work around this. Clojure has STM
> >> (software
> >> >>> Transactional Memory) such that if an exception is through inside a
> >> doSync
> >> >>> all of the stems inside the critical block never happened. This
> assumes
> >> >>> your using all clojure structures which you are probably not.
> >> >>> A way co workers have done this is as follows. Move your entire
> >> >>> transnational state into a SINGLE big object that you can
> >> >>> copy/mutate/compare and swap. You never need to rollback each piece
> >> >>> because
> >> >>> your changing the clone up until the point you commit it.
> >> >>> Writing reversal code is possible depending on the problem. There
> are
> >> >>> questions to ask like "what if the reversal somehow fails?"
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <
> >> [email protected]
> >> >>> >
> >> >>> wrote:
> >> >>>
> >> >>> > Hi,
> >> >>> >
> >> >>> > Was wondering if there are any reference designs, patterns on
> >> handling
> >> >>> > common operations involving distributed coordination.
> >> >>> >
> >> >>> > I have a few questions and I guess they must have been asked
> before,
> >> I
> >> >>> > am unsure what to search for to surface the right answers. It'll
> be
> >> >>> > really valuable if someone can provide links to relevant
> >> >>> > "best-practices guide" or "suggestions" per question or share some
> >> >>> > wisdom or ideas on patterns to do this in the best way.
> >> >>> >
> >> >>> > 1. What is the best way of handling distributed-lock expiry? The
> >> owner
> >> >>> > of the lock managed to acquire it and may be in middle of some
> >> >>> > computation when the session expires or lock expires. When it
> >> finishes
> >> >>> > that computation, it can tell that the lock expired, but do people
> >> >>> > generally take action in middle of the computation (abort it and
> do
> >> it
> >> >>> > in a clever way such that effect appears atomic, so abort is not
> >> >>> > really visible, if so what are some of those clever ways)? Or is
> the
> >> >>> > right thing to do, is to write reversal-code, such that operations
> >> can
> >> >>> > be cleanly undone in case the verification at the end of
> computation
> >> >>> > shows that lock expired? The later obviously is a lot harder to
> >> >>> > achieve.
> >> >>> >
> >> >>> > 2. Same as above for leader-election scenarios. Leader generally
> >> >>> > administers operations on data-systems that take significant time
> to
> >> >>> > complete and have significant resource overhead and RPC to
> administer
> >> >>> > such operations synchronously from leader to data-node can't be
> >> atomic
> >> >>> > and can't be made latency-resilient to such a degree that issuing
> >> >>> > operation across a large set of nodes on a cluster can be
> guaranteed
> >> >>> > to finish without leader-change. What do people generally do in
> such
> >> >>> > situations? How are timeouts for operations issued when operations
> >> are
> >> >>> > issued using sequential-znode as a per-datanode dedicated queue?
> How
> >> >>> > well does it scale, and what are some things to watch-out for
> >> >>> > (operation-size, encoding, clustering into one znode for atomicity
> >> >>> > etc)? Or how are atomic operations that need to be issued across
> >> >>> > multiple data-nodes managed (do they have to be clobbered into one
> >> >>> > znode)?
> >> >>> >
> >> >>> > 3. How do people secure zookeeper based services? Is
> >> >>> > client-certificate-verification the recommended way? How well does
> >> >>> > this work with C client? Is inter-zk-node communication done with
> >> >>> > X509-auth too?
> >> >>> >
> >> >>> > 4. What other projects, reference-implementations or libraries
> should
> >> >>> > I look at for working with C client?
> >> >>> >
> >> >>> > Most of what I have asked revolves around leader or lock-owner
> having
> >> >>> > a false-failure (where it doesn't know that coordinator thinks it
> has
> >> >>> > failed).
> >> >>> >
> >> >>> > --
> >> >>> > Regards,
> >> >>> > Janmejay
> >> >>> > http://codehunk.wordpress.com
> >> >>> >
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Janmejay
> >> http://codehunk.wordpress.com
> >>
>
>
>
> --
> Regards,
> Janmejay
> http://codehunk.wordpress.com
>

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Reply via email to