Sinfonia is pretty cool, but the commit mechanism is not simple and the
ordering guarantees are different. i think we can do it simpler in
zookeeper. basically, we would just need to be able to list a set of
operations in a single zxid rather than just one operation. in some
sense we do do this a little bit: close session is an atomic transaction
that deletes a bunch of ephemeral nodes.
to be honest here are the reservations i have:
1) ted's "non-blocking" observation is very good. right now we do
throttling and balancing to give consistent response time for users, and
it works pretty well because all operations are more or less equivalent.
if you can make a compound operation made up of multiple sub operations
(especially if they fall in the less equivalent case) this may not be
the case and for practical purposes you get blocking.
2) returning errors starts getting a bit funky. what happens if some of
the operations fail? like a create or conditional set. not a big problem
to decide and implement, but i think it makes it harder to use. (there
will be use cases to motivate all sorts of different choices: abort the
whole thing on any failures, execute everything an return results, fail
everything after first failure, etc)
3) if we ever get to partitioned namespace, it will be very hard to do
transactions across partitions. we will probably relax ordering across
partitions, so you could argue that we wouldn't support transactions
either, but then the question comes back to how do you reflect this back
to the user?
perhaps, we may want to broach this in the future, but i would rather
get things like ZOOKEEPER-22 in before we complicate things.
On 03/30/2010 01:29 PM, Henry Robinson wrote:
[Moving to dev]
Although I'm in total agreement with the idea of "no complexity until it's
necessary" I don't see that there's a really strong technical reason not to
include this primitive. It's very similar to the multi-get style API that,
say, memcache gives you.
zoo_multi_test_and_set(List<int> versions, List<string> znodes, List<byte>
would be an example API, and seems to me like it could be implemented in the
same way as a single set_data call. I definitely don't support any kind of
multiple-call api (like transactions) because it doesn't fit with the
ZooKeeper one method call = 1 linearization point model. I really do
recommend the Sinfonia paper from SOSP '07 for those that haven't read it (
a nice implementation of these kinds of ideas.
A supporting argument is this: if this *is* very hard to implement
currently, I think we could expend some effort to make it easier. Decoupling
operations on the data tree and voting for them further (and also decoupling
session management and data tree updates) would be a worthwhile cleanup for
3.4.0. It would be really cool to be able to put a different storage engine
behind ZK (I can think of many examples!) with a minimum of effort. At the
same time, there are some API calls that I might find useful (get minimum
sequential node, for example) whose prototyping and implementation would be
On 30 March 2010 13:00, Benjamin Reed<br...@yahoo-inc.com> wrote:
i agree with ted. i think he points out some disadvantages with trying do
do more. there is a slippery slope with these kinds of things. the
implementation is complicated enough even with the simple model that we use.
On 03/29/2010 08:34 PM, Ted Dunning wrote:
I perhaps should not have said power, except insofar as ZK's strengths are
in reliability which derives from simplicity.
There are essentially two common ways to implement multi-node update. The
first is the tradtional db style with begin-transaction paired with either
commit or a rollback after some number of updates. This is clearly
unacceptable in the ZK world if the updates are sent to the server because
there can be an indefinite delay between the begin and commit.
A second approach is to buffer all of the updates on the client side and
transmit them in a batch to the server to succeed or fail as a group.
allows updates to be arbitrarily complex which begins to eat away at the
"no-blocking" guarantee a bit.
On Mon, Mar 29, 2010 at 8:08 PM, Henry Robinson<he...@cloudera.com>
Could you say a bit about how you feel ZK would sacrifice power and
reliability through multi-node updates? My view is that it wouldn't:
all operations are executed serially, there's no concurrency to be lost
allowing multi-updates, and there doesn't need to be a 'start / end'
transactional style interface (which I do believe would be very bad).
I could see ZK implement a Sinfonia-style batch operation API which makes
all-or-none updates. The reason I can see that it doesn't already allow
is the avowed intent of the original ZK team to keep the API as simple as
can reasonably be, and to not introduce complexity without need.