Ben and I were discussing this thread, so we ended up writing a reply together.
This is an interesting discussion indeed, and we appreciate your interest in
learning more about the implementation of zookeeper. Here is another attempt to
explain you the differences between our algorithm and Paxos.
The Paxos Multi-Decree protocol basically consists of running parallel
instances of the Synod protocol (you probably know this one, but just for
completeness the paper is called "The Part-time parliament", and can be found
on Lamport's web page). The original Synod protocol, which is what we are
calling Paxos, proceeds in three phases, just like a three-phase commit
protocol. Our protocol instead, has only two phases, just like a two-phase
commit protocol. Of course, for Paxos, we can ignore the first phase in runs in
which we have a single proposer as we can run phase 1 for multiple instances at
a time, which is what Ben called previously Multi-Paxos, I believe. The trick
with skipping phase 1 is to deal with leader switching.
However, if we have a run with multiple proposers, operating simultaneously or
not, then we have to run phase 1 at least for the instances that haven't been
committed. The ZooKeeper protocol does not. The reason why we don't is
twofold. First, we assume FIFO channels. (FIFO meaning if a packet is received
from the channel all previously sent messages will have been delivered. If a
packet is lost in the channel, all subsequent packets will be lost. TCP is a
FIFO channel.) Paxos doesn't assume such a channel, and it is a rather
practical assumption that simplifies the protocol a lot. Second, there can be
at most one leader (proposer) at any time, and we guarantee this by making sure
that a quorum of replicas recognize the leader as a leader by committing to an
epoch change. This change in epoch also allows us to get unique zxids since the
epoch forms part of the zxid. Followers (they both acceptors and learners in
the Paxos terminology) have a FIFO
channel to a single leader, so that can only be a single active leader.
As a result, we can skip the phase 1 of Paxos completely, and also during
recovery we can skip all the uncommitted zxids of the epoch of the previous
leader. Since messages can be received out of order and even lost with Paxos,
it is possible to have gaps in the sequence of instances, and these instances
have to be decided when a new proposer arises. The conclusion is that by making
stronger assumptions for the system, we are able to use a simpler algorithm
that works truly in two phases.
One difference we find interesting is that Paxos embeds recovery into the
protocol. According to the algorithm, a new proposer just has to start one
phase 1 for each instance that it believes hasn't been committed yet. If such
an instance has been committed, then there is no problem as the value can't
change once it is committed. With the ZooKeeper protocol, we have to run an
auxiliary protocol to make sure that new leaders are up to date with respect to
operations that have been committed, but because of the FIFO assumption, we
know that the replica with the latest transaction id has the latest committed
state. Again, strengthening the assumptions for the system enable a simpler
solution.
Oh and don't get distracted by the leader election algorithm. Our protocol
assumes there is one, but it's not part of the broadcast protocol. The leader
election algorithm can easily change. There are actually two different ones in
the sources, and one of them doesn't even use notifications.
-Flavio and Ben
----- Original Message ----
From: Evan Jones <[EMAIL PROTECTED]>
To: Benjamin Reed <[EMAIL PROTECTED]>
Cc: zookeeper-user@lists.sourceforge.net
Sent: Tuesday, June 10, 2008 2:50:26 AM
Subject: Re: [Zookeeper-user] Does ZooKeeper includes Paxos ?
This is a pretty academic debate, but I'm sorry: I'm a graduate
student, I can't resist. Also: In case it isn't clear, I don't intend
this as a criticism of Zookeeper: I think it is a great, and am likely
to use it for a research project I am working on at the moment. I just
want to make sure I understand its implementation.
Summary: I am still unconvinced. I still think the Zookeeper protocol
is basically Paxos.
On Jun 7, 2008, at 2:33 , Benjamin Reed wrote:
> There are some basic elements that all atomic broadcast protocols
> have, but I don't think that makes them all the same.
Fair enough: I have had this same debate with others, so perhaps I
hold a minority opinion: All the consensus protocols tend to have the
same 2 rounds in order to reach a decision, in the failure recovery
case, and 1 round in the failure free case. They also tend to have the
same need for sequence numbers in the messages, and voting, etc. While
some of the details differ, the gist of all them seems pretty similar
to me. Hence, I consider all of them roughly equivalent.
Part of the confusion may be that "Paxos" as defined by Lamport isn't
really useful by itself: it is a protocol for deciding a single value.
You need to add a bunch of tweaks to the base algorithm to produce a
useful state machine replication system. These tweaks can all be done
in slightly different ways.
> A big difference between us and Paxos is that we never do phase1.
By Phase 1 you mean the prepare/ack round in Paxos? Except the
Zookeeper protocol *does* do this: They are equivalent to the
"notification" messages sent by the Leader election algorithm. The
notification messages and the acknowledgements contain the same
information as would be exchanged by Paxos phase 1 messages.
> Now you could say that the Propose/ACK phase is straight from Paxos,
> but is is also part of classic 2 phase commit or pretty much any
> other such protocol.
Right: Any distributed consensus protocol will use voting like this.
As you say, a Multi-Paxos implementation will have the exact same
number of messages as Zookeeper does, with almost exactly the same
message contents.
> Looking at ZooKeeper messaging will not give you insight into Paxos
> unfortunately. We take advantage of FIFO channels (TCP), which
> allows us to assume ordering;
How do you handle the cases where a TCP connection breaks and is then
re-established? This can violate the FIFO assumption. Is there code to
handle this case that I can look at?
> we also make sure that there will never be a two messages with the
> same zxid, which is why we do not need something like phase1 of Paxos;
Paxos makes the same assumption: there cannot be two messages with the
same proposal number.
> We think it is a cool protocol, and it's something that is really
> quite intuitive and easy to explain. (We have been pulling a paper
> together on it. I'll put up a presentation we give on the ZooKeeper
> site.) A cursory glance may make you think Paxos, but it really isn't.
I look forward to reading a description of the protocol. Reverse
engineering the protocol from the source code may not be the most
reliable way to figure it out. I *think* I know how Zookeeper's
algorithm works, but I could easily be mistaken in some edge cases.
Evan
--
Evan Jones
http://evanjones.ca/
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Zookeeper-user mailing list
Zookeeper-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/zookeeper-user
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Zookeeper-user mailing list
Zookeeper-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/zookeeper-user