Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

singh.janmejay Wed, 13 Jan 2016 04:38:08 -0800

@Martin: Master reacting to achieve operation-timeout makes sense, I
never thought of it that way and it seems really effective. I can't
see it failing in any scenario. The only scenario where it has a
problem(solvable) is when the lock-owner dies to never comes back, in
which case it will need manual assistance (someone confirming that the
lock-owner is dead for good) or a very high-level control like an API
to kill the machine which is supposed to be running the lock owner.
About work distribution, I have a stateful set of nodes, so
traditional queue model assuming all workers are equal doesn't work. A
dedicated queue per worker would, but that will still have larger
surface-area of architecture exposed to coordination-service and will
increase scaling-burden on zookeeper. It doesn't really simplify
idempotency requirement anyway, so its equal to RPC triggered work in
complexity of implementation.



On Wed, Jan 13, 2016 at 4:51 PM, Martin Kersten
<[email protected]> wrote:
> Hello,
>
>    I am not quite aware of the zookeeper specialities but from my point of
> view you should think about distributing the task as well like having
> multiple nodes being responsible for a task to be done and if one node
> fails the other nodes take over and perform/complete the task instead. This
> would involve having becon messages and a place you can put your code in.
>
> Locks that time out should actually never happen, since it makes everything
> go boom and become overly complex. Just bind a lock to the lifeliness of
> the node. So if the node is considered dead free the lock. If the node is
> kind of zombie (not reacting (stuck) but responsive in terms of sending
> i-am-alife beacons (heart beat)) it is the task of the leader to kill the
> node remotely or remove the node from the list of members and inform anyone
> else about it. Once this happens this would also revoke the lock.
>
> The goal is to simply let the leader kill any node that seams to be
> malfunction in any possible way (like missing a deadline). A node that
> wants to complete its operation needs to interact with the leader and at
> this point in time the node should realize if it was considered dead and
> should restart by crashing and rebooting instantly.
>
> Another thing you might consider is to renew the lock in certain periodes.
> If you have a workflow, your lock times out in 10 minutes just every time
> you make real progress in your workflow renew the lock giving you another
> 10minutes to do the next steps.
>
> This way (as long as you do not have a loop in the workflow) you are save
> in assuming that a workflow is being completed in the future. If you need a
> hard deadline the node processing the operation might as well check the
> estimate of the workflow and drop the lock and abort the operation if it
> estimates the operation is likely to timeout and might even perform a
> compensation operation.
>
>
> Cheers,
>
> Martin (Kersten)
>
>
>
> 2016-01-13 11:08 GMT+01:00 singh.janmejay <[email protected]>:
>
>> @Alexander: In that scenario, write of X will be attempted by A, but
>> external system will not act upon write-X because that operation has
>> already been acted upon in the past. This is guaranteed by idempotent
>> operations invariant. But it does point out another problem, which I
>> hadn't handled in my original algorithm. Problem: If X and Y have both
>> not been issued yet, and if Y is issued before X towards external
>> system, because neither operations have executed yet, it'll overwrite
>> Y with X. I need another constraint, master should only issue 1
>> operation on a certain external-system at a time and must issue
>> operations in the order of operation-id (sequential-znode sequence
>> number). So we need the following invariants:
>> - order of issuing operation being fixed (matching order of creation
>> of operations)
>> - concurrency of operation fixed to 1
>> - idempotent execution on external-system side
>>
>> @Powell: Im kind of doing the same thing. Except the loop doesn't run
>> on consumer, instead it runs on master, which is assigning work to
>> consumers. So triggerWork function is basically changed to issueWork,
>> which is RPC + triggerWork. The replay if history is basically just
>> replay of 1 operation per operand-node (in this thread we are calling
>> it external-system), so its as if triggerWork failed, in which case we
>> need to re-execute triggerWork. Idempotency also follows from that
>> requirement. If triggerWork fails in the last step, and all the
>> desired effect that was necessary has happened, we will still need to
>> run triggerWork again, but we need awareness that actual work has been
>> done, which is why idempotency is necessary.
>>
>> Btw, thanks for continuing to spare time for this, I really appreciate
>> this feedback/validation.
>>
>> Thoughts?
>>
>> On Wed, Jan 13, 2016 at 3:47 AM, powell molleti
>> <[email protected]> wrote:
>> > Wouldn't a distributed queue recipe for consumer work?. Where one needs
>> to add extra logic something like this:
>> >
>> > with lock() {
>> >     p = queue.peek()
>> >     if triggerWork(p) is Done:
>> >         queue.pop()
>> > }
>> >
>> > With this a consumer that worked on it but crashed before popping the
>> queue would result in another consumer processing the same work.
>> >
>> > I am not sure with the details of where you are getting the work from
>> and the scale of it is but producers(leader) could write to this queue.
>> Since there is guarantee of read after write , producer could delete from
>> its local queue the work that was successfully queued. Hence again new
>> producer could re-send the last entry of work so one has to handle that.
>> Without more details on the origin of work etc its hard to design end to
>> end.
>> >
>> > I do not see a need to do a total replay of past history etc when using
>> ZK like system because ZK is built on idea of serialized and replicated
>> log, hence if you are using ZK then your design should be much simpler i.e
>> fail and re-start from last know transaction.
>> >
>> > Powell.
>> >
>> >
>> >
>> > On Tuesday, January 12, 2016 11:51 AM, Alexander Shraer <
>> [email protected]> wrote:
>> > Hi,
>> >
>> > With your suggestion, the following scenario seems possible: master A is
>> > about to write value X to an external system so it logs it to ZK, then
>> > freezes for some time, ZK suspects it as failed, another master B is
>> > elected, writes X (completing what A wanted to do)
>> > then starts doing something else and writes Y. Then leader A "wakes up"
>> and
>> > re-executes the old operation writing X which is now stale.
>> >
>> > perhaps if your external system supports conditional updates this can be
>> > avoided - a write of X only works if the current state is as expected.
>> >
>> > Alex
>> >
>> >
>> > On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <[email protected]
>> >
>> > wrote:
>> >
>> >> Thanks for the replies everyone, most of it was very useful.
>> >>
>> >> @Alexander: The section of chubby paper you pointed me to tries to
>> >> address exactly what I was looking for. That clearly is one good way
>> >> of doing it. Im also thinking of an alternative way and can use a
>> >> review or some feedback on that.
>> >>
>> >> @Powel: About x509 auth on intra-cluster communication, I don't have a
>> >> blocking need for it, as it can be achieved by setting up firewall
>> >> rules to accept only from desired hosts. It may be a good-to-have
>> >> thing though, because in cloud-based scenarios where IP addresses are
>> >> re-used, a recycled IP can still talk to a secure zk-cluster unless
>> >> config is changed to remove the old peer IP and replace it with the
>> >> new one. Its clearly a corner-case though.
>> >>
>> >> Here is the approach Im thinking of:
>> >> - Implement all operations(atleast master-triggered operations) on
>> >> operand machines idempotently
>> >> - Have master journal these operations to ZK before issuing RPC
>> >> - In case master fails with some of these operations in flight, the
>> >> newly elected master will need to read all issued (but not retired
>> >> yet) operations and issue them again.
>> >> - Existing master(before failure or after failure) can retry and
>> >> retire operations according to whatever the retry policy and
>> >> success-criterion is.
>> >>
>> >> Why am I thinking of this as opposed to going with chubby sequencer
>> >> passing:
>> >> - I need to implement idempotency regardless, because recovery-path
>> >> involving master-death after successful execution of operation but
>> >> before writing ack to coordination service requires it. So idempotent
>> >> implementation complexity is here to stay.
>> >> - I need to increase surface-area of the architecture which is exposed
>> >> to coordination-service for sequencer validation. Which may bring
>> >> verification RPC in data-plane in some cases.
>> >> - The sequencer may expire after verification but before ack, in which
>> >> case new master may not recognize the operation as consistent with its
>> >> decisions (or previous decision path).
>> >>
>> >> Thoughts? Suggestions?
>> >>
>> >>
>> >>
>> >> On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <[email protected]>
>> >> wrote:
>> >> > regarding atomic multi-znode updates -- check out "multi" updates
>> >> > <
>> >>
>> http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html
>> >> >
>> >> > .
>> >> >
>> >> > On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <[email protected]>
>> >> wrote:
>> >> >
>> >> >> for 1, see the chubby paper
>> >> >> <
>> >>
>> http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
>> >> >,
>> >> >> section 2.4.
>> >> >> for 2, I'm not sure I fully understand the question, but
>> essentially, ZK
>> >> >> guarantees that even during failures
>> >> >> consistency of updates is preserved. The user doesn't need to do
>> >> anything
>> >> >> in particular to guarantee this, even
>> >> >> during leader failures. In such case, some suffix of operations
>> executed
>> >> >> by the leader may be lost if they weren't
>> >> >> previously acked by a majority.However, none of these operations
>> could
>> >> >> have been visible
>> >> >> to reads.
>> >> >>
>> >> >> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
>> >> >> [email protected]> wrote:
>> >> >>
>> >> >>> Hi Janmejay,
>> >> >>> Regarding question 1, if a node takes a lock and the lock has
>> timed-out
>> >> >>> from system perspective then it can mean that other nodes are free
>> to
>> >> take
>> >> >>> the lock and work on the resource. Hence the history could be well
>> >> into the
>> >> >>> future when the previous node discovers the time-out. The question
>> of
>> >> >>> rollback in the specific context depends on the implementation
>> >> details, is
>> >> >>> the lock holder updating some common area?, then there could be
>> >> corruption
>> >> >>> since other nodes are free to write in parallel to the first one?.
>> In
>> >> the
>> >> >>> usual sense a time-out of lock held means the node which held the
>> lock
>> >> is
>> >> >>> dead. It is upto the implementation to ensure this case and, using
>> this
>> >> >>> primitive, if there is a timeout which means other nodes are sure
>> that
>> >> no
>> >> >>> one else is working on the resource and hence can move forward.
>> >> >>> Question 2 seems to imply the assumption that leader has significant
>> >> work
>> >> >>> todo and leader change is quite common, which seems contrary to
>> common
>> >> >>> implementation pattern. If the work can be broken down into smaller
>> >> chunks
>> >> >>> which need serialization separately then each chunk/work type can
>> have
>> >> a
>> >> >>> different leader.
>> >> >>> For question 3, ZK does support auth and encryption for client
>> >> >>> connections but not for inter ZK node channels. Do you have
>> >> requirement to
>> >> >>> secure inter ZK nodes, can you let us know what your requirements
>> are
>> >> so we
>> >> >>> can implement a solution to fit all needs?.
>> >> >>> For question 4 the official implementation is C, people tend to wrap
>> >> that
>> >> >>> with C++ and there should projects that use ZK doing that you can
>> look
>> >> them
>> >> >>> up and see if you can separate it out and use them.
>> >> >>> Hope this helps.Powell.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
>> >> >>> [email protected]> wrote:
>> >> >>>
>> >> >>>
>> >> >>>  Q:What is the best way of handling distributed-lock expiry? The
>> owner
>> >> >>> of the lock managed to acquire it and may be in middle of some
>> >> >>> computation when the session expires or lock expire
>> >> >>>
>> >> >>> If you are using Java a way I can see doing this is by using the
>> >> >>> ExecutorCompletionService
>> >> >>>
>> >> >>>
>> >>
>> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
>> >> >>> .
>> >> >>> It allows you to keep your workers in a group, You can poll the
>> group
>> >> and
>> >> >>> provide cancel semantics as needed.
>> >> >>> An example of that service is here:
>> >> >>>
>> >> >>>
>> >>
>> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
>> >> >>> where I am issuing multiple reads and I want to abandon the process
>> if
>> >> >>> they
>> >> >>> do not timeout in a while. Many async/promices frameworks do this by
>> >> >>> launching two task ComputationTask and a TimeoutTask that returns
>> in 10
>> >> >>> seconds. Then they ask the completions service to poll. If the
>> service
>> >> is
>> >> >>> given the TimoutTask after the timeout that means the Computation
>> did
>> >> not
>> >> >>> finish in time.
>> >> >>>
>> >> >>> Do people generally take action in middle of the computation (abort
>> it
>> >> and
>> >> >>> do itin a clever way such that effect appears atomic, so abort is
>> >> >>> notreally
>> >> >>> visible, if so what are some of those clever ways)?
>> >> >>>
>> >> >>> The base issue is java's synchronized/ AtomicReference do not have a
>> >> >>> rollback.
>> >> >>>
>> >> >>> There are a few ways I know to work around this. Clojure has STM
>> >> (software
>> >> >>> Transactional Memory) such that if an exception is through inside a
>> >> doSync
>> >> >>> all of the stems inside the critical block never happened. This
>> assumes
>> >> >>> your using all clojure structures which you are probably not.
>> >> >>> A way co workers have done this is as follows. Move your entire
>> >> >>> transnational state into a SINGLE big object that you can
>> >> >>> copy/mutate/compare and swap. You never need to rollback each piece
>> >> >>> because
>> >> >>> your changing the clone up until the point you commit it.
>> >> >>> Writing reversal code is possible depending on the problem. There
>> are
>> >> >>> questions to ask like "what if the reversal somehow fails?"
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <
>> >> [email protected]
>> >> >>> >
>> >> >>> wrote:
>> >> >>>
>> >> >>> > Hi,
>> >> >>> >
>> >> >>> > Was wondering if there are any reference designs, patterns on
>> >> handling
>> >> >>> > common operations involving distributed coordination.
>> >> >>> >
>> >> >>> > I have a few questions and I guess they must have been asked
>> before,
>> >> I
>> >> >>> > am unsure what to search for to surface the right answers. It'll
>> be
>> >> >>> > really valuable if someone can provide links to relevant
>> >> >>> > "best-practices guide" or "suggestions" per question or share some
>> >> >>> > wisdom or ideas on patterns to do this in the best way.
>> >> >>> >
>> >> >>> > 1. What is the best way of handling distributed-lock expiry? The
>> >> owner
>> >> >>> > of the lock managed to acquire it and may be in middle of some
>> >> >>> > computation when the session expires or lock expires. When it
>> >> finishes
>> >> >>> > that computation, it can tell that the lock expired, but do people
>> >> >>> > generally take action in middle of the computation (abort it and
>> do
>> >> it
>> >> >>> > in a clever way such that effect appears atomic, so abort is not
>> >> >>> > really visible, if so what are some of those clever ways)? Or is
>> the
>> >> >>> > right thing to do, is to write reversal-code, such that operations
>> >> can
>> >> >>> > be cleanly undone in case the verification at the end of
>> computation
>> >> >>> > shows that lock expired? The later obviously is a lot harder to
>> >> >>> > achieve.
>> >> >>> >
>> >> >>> > 2. Same as above for leader-election scenarios. Leader generally
>> >> >>> > administers operations on data-systems that take significant time
>> to
>> >> >>> > complete and have significant resource overhead and RPC to
>> administer
>> >> >>> > such operations synchronously from leader to data-node can't be
>> >> atomic
>> >> >>> > and can't be made latency-resilient to such a degree that issuing
>> >> >>> > operation across a large set of nodes on a cluster can be
>> guaranteed
>> >> >>> > to finish without leader-change. What do people generally do in
>> such
>> >> >>> > situations? How are timeouts for operations issued when operations
>> >> are
>> >> >>> > issued using sequential-znode as a per-datanode dedicated queue?
>> How
>> >> >>> > well does it scale, and what are some things to watch-out for
>> >> >>> > (operation-size, encoding, clustering into one znode for atomicity
>> >> >>> > etc)? Or how are atomic operations that need to be issued across
>> >> >>> > multiple data-nodes managed (do they have to be clobbered into one
>> >> >>> > znode)?
>> >> >>> >
>> >> >>> > 3. How do people secure zookeeper based services? Is
>> >> >>> > client-certificate-verification the recommended way? How well does
>> >> >>> > this work with C client? Is inter-zk-node communication done with
>> >> >>> > X509-auth too?
>> >> >>> >
>> >> >>> > 4. What other projects, reference-implementations or libraries
>> should
>> >> >>> > I look at for working with C client?
>> >> >>> >
>> >> >>> > Most of what I have asked revolves around leader or lock-owner
>> having
>> >> >>> > a false-failure (where it doesn't know that coordinator thinks it
>> has
>> >> >>> > failed).
>> >> >>> >
>> >> >>> > --
>> >> >>> > Regards,
>> >> >>> > Janmejay
>> >> >>> > http://codehunk.wordpress.com
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Janmejay
>> >> http://codehunk.wordpress.com
>> >>
>>
>>
>>
>> --
>> Regards,
>> Janmejay
>> http://codehunk.wordpress.com
>>



-- 
Regards,
Janmejay
http://codehunk.wordpress.com

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Reply via email to