Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

singh.janmejay Thu, 31 Dec 2015 00:11:08 -0800

Hi,

Was wondering if there are any reference designs, patterns on handling
common operations involving distributed coordination.


I have a few questions and I guess they must have been asked before, I
am unsure what to search for to surface the right answers. It'll be
really valuable if someone can provide links to relevant
"best-practices guide" or "suggestions" per question or share some
wisdom or ideas on patterns to do this in the best way.

1. What is the best way of handling distributed-lock expiry? The owner
of the lock managed to acquire it and may be in middle of some
computation when the session expires or lock expires. When it finishes
that computation, it can tell that the lock expired, but do people
generally take action in middle of the computation (abort it and do it
in a clever way such that effect appears atomic, so abort is not
really visible, if so what are some of those clever ways)? Or is the
right thing to do, is to write reversal-code, such that operations can
be cleanly undone in case the verification at the end of computation
shows that lock expired? The later obviously is a lot harder to
achieve.

2. Same as above for leader-election scenarios. Leader generally
administers operations on data-systems that take significant time to
complete and have significant resource overhead and RPC to administer
such operations synchronously from leader to data-node can't be atomic
and can't be made latency-resilient to such a degree that issuing
operation across a large set of nodes on a cluster can be guaranteed
to finish without leader-change. What do people generally do in such
situations? How are timeouts for operations issued when operations are
issued using sequential-znode as a per-datanode dedicated queue? How
well does it scale, and what are some things to watch-out for
(operation-size, encoding, clustering into one znode for atomicity
etc)? Or how are atomic operations that need to be issued across
multiple data-nodes managed (do they have to be clobbered into one
znode)?

3. How do people secure zookeeper based services? Is
client-certificate-verification the recommended way? How well does
this work with C client? Is inter-zk-node communication done with
X509-auth too?

4. What other projects, reference-implementations or libraries should
I look at for working with C client?

Most of what I have asked revolves around leader or lock-owner having
a false-failure (where it doesn't know that coordinator thinks it has
failed).

-- 
Regards,
Janmejay
http://codehunk.wordpress.com

Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Reply via email to