Dynamic server addition/deletion

2009-05-01 Thread rag...@yahoo.com

Our product would require support for dynamic addition and deletion of ZK 
servers to the cluster. We would like to come up with a design, propose the 
design to the ZK developers and then implement the feature once the design is 
signed off by the ZK developers. Before we go down that path, I would like to 
know if people already have any ideas on this that they could share with us. 
Also, we don't want to duplicate the effort, so we would appreciate if you let 
us know anyone is already working on a design proposal for this feature.

Thanks
Raghu





Server-client connection timeout

2009-04-21 Thread rag...@yahoo.com

I have a question related how ZK server deals with client timeout. If the 
client loses connectivity with the ZK server (which is still alive), then the 
ZK server will close the client session by issuing a closeSession transaction, 
correct? So even if the client has reestablished the session by connecting to 
another server by now, closeSession transaction will force the session to be 
deleted on all servers. The client will have to reconnect to one of the servers 
again and create a brand new session, right?

Could you please clarify if the above is correct?

Thanks
Raghu



   


FastLeaderElection

2009-04-10 Thread rag...@yahoo.com

Hi,

Could someone please explain quickly why logical clock is used in 
FastLeaderElection? It looks to me like the peers can converge on a leader 
(with highest zxid or server id if zxids are the same) even without the logical 
clock. May be I am missing something here, I could not figure out why logical 
clock is needed.

Thanks
Raghu






Re: Divergence in ZK transaction logs in some corner cases?

2009-03-30 Thread rag...@yahoo.com

Ben,

Thanks a lot for explaining this.

I have one more corner case in mind where the transaction logs could diverge. I 
might be wrong this time as well, but would like to understand how it works. 
Reading the Leader.lead() code, it seems like the new leader reads the last 
logged zxid and bumps up the higher 32 bits while resetting the lower 32 bits. 
So this means that cascading leader crashes without a PROPOSAL in between would 
make the new leader chose the same zxid as the one before. This could lead to a 
corner case like below:

In an ensemble of 5 servers (A, B, C, D and E), say the zxid is 1,10 (higher 32 
bits, lower 32 bits) with A as the leader. Now the following events happen:

1. A crashes.
2. B is elected the leader. So the zxid of the ensemble moves to 2,0. If I read 
the code correctly, no one logs the new zxid until a new PROPOSAL is made. Now 
B starts a new PROPOSAL (2,1), B logs the PROPOSAL and moves to zxid (2,1).
3. B crashes before anyone else receives the PROPOSAL.
4. C is elected as the leader. Since the new zxid depends on the last logged 
zxid (which is still 1,10 according to C's log), the new zxid chosen by C is 
2,0 as well.
5. Now C starts a new PROPOSAL (2,1), C logs the PROPOSAL and crashes before 
anyone else has received the PROPOSAL. We have diverged logs in B and C with 
the same zxid (2,1).

Could you tell me if this is correct?

Thanks
Raghu





- Original Message 
From: Benjamin Reed br...@yahoo-inc.com
To: zookeeper-user@hadoop.apache.org zookeeper-user@hadoop.apache.org
Sent: Saturday, 28 March, 2009 10:49:32
Subject: Re: Divergence in ZK transaction logs in some corner cases?

if recover worked the way you outline, we would have a problem indeed. 
fortunately, we specifically address this case.

the problem is in your first step. when b is elected leader, he will not 
proposal 10, he will propose 101. the zxid is made up of two parts, 
the high order bits are an epoch number and the low order bits are a counter. 
when every a new leader is elected, he will increment the epoch number and 
reset the counter.

when A restarts you have the opposite problem, you need to make sure that A 
forgets 10 because we have skipped it and committing it will mean that 10 is 
delivered out of order. we take advantage of the epoch number in that case as 
well to make sure that A forgets about 10.

there is some discussion about this in: 
http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperInternals.html#sc_atomicBroadcast

we have a presentation as well that i'll put up that may make it more clear.

ben

rag...@yahoo.com wrote:
 ZK gurus,
 
 I think the ZK transaction logs can diverge from one another in some corner 
 cases. I have one such corner case listed below, could you please confirm if 
 my understanding is correct?
 
 Imagine a 5 srever ensemble (A,B,C,D,E). All the servers are @ zxid 9. A is 
 the leader and it starts a new PROPOSAL (@zxid 10). A writes the proposal to 
 the log, so A moves to zxid 10. Others haven't received the PROPOSAL yet and 
 A crashes. Now the following happens:
 
 1. B is elected as the newleader. B bumps up its in-mem zxid to 10. Since 
 other nodes are at the same zxid, it sends a SNAP so that the others can 
 rebuild their data tree. In-memory zxid of all other nodes moves to 10.  
 2.  A comes back now, it accepts B as the leader as soon as the leader (B) 
 and N/2 other nodes vouch for B as the leader. So A joins the ensemble. Every 
 zookeeper node is at zxid 10.
 
 3. A new request is submitted to B. B runs PROPOSAL and COMMIT phases and the 
 cluster moves up to zxid 11. But the transaction log of A is different from 
 that of everyone else now. So the transaction logs have diverged.
 
 Could you confirm if this can happen? Or am I reading the code wrong?
 
 Thanks
 Raghu
 
 






Divergence in ZK transaction logs in some corner cases?

2009-03-27 Thread rag...@yahoo.com

ZK gurus,

I think the ZK transaction logs can diverge from one another in some corner 
cases. I have one such corner case listed below, could you please confirm if my 
understanding is correct?

Imagine a 5 srever ensemble (A,B,C,D,E). All the servers are @ zxid 9. A is the 
leader and it starts a new PROPOSAL (@zxid 10). A writes the proposal to the 
log, so A moves to zxid 10. Others haven't received the PROPOSAL yet and A 
crashes. Now the following happens:

1. B is elected as the newleader. B bumps up its in-mem zxid to 10. Since other 
nodes are at the same zxid, it sends a SNAP so that the others can rebuild 
their data tree. In-memory zxid of all other nodes moves to 10.  

2.  A comes back now, it accepts B as the leader as soon as the leader (B) and 
N/2 other nodes vouch for B as the leader. So A joins the ensemble. Every 
zookeeper node is at zxid 10.

3. A new request is submitted to B. B runs PROPOSAL and COMMIT phases and the 
cluster moves up to zxid 11. But the transaction log of A is different from 
that of everyone else now. So the transaction logs have diverged.

Could you confirm if this can happen? Or am I reading the code wrong?

Thanks
Raghu





Incorrect implementation of QuorumCnxManager.haveDelivered()?

2009-03-25 Thread rag...@yahoo.com

Hello,

I am a ZooKeeper newbie, so pardon me if I am repeating questions that have 
been raised before.

I believe the implementation of QuorumCnxManager.haveDelivered() is incorrect. 
If I understand correctly, queueSendMap contains a queue of messages for each 
peer to which the local peer is trying to send election messages. When 
FastLeaderElection notices a timeout while polling for inbound messages, it 
checks to see if all the messages have been delivered by calling this function. 
So shouldn't this function actually check each queue in the hash map and return 
true if all of them are empty? This method is rather returning true the if just 
one of the queues is empty?

/**
 * Check if all queues are empty, indicating that all messages have been 
delivered.
 */
boolean haveDelivered() {
for (ArrayBlockingQueueByteBuffer queue : queueSendMap.values()) {
LOG.debug(Queue size:  + queue.size());
if (queue.size() == 0)
return true;
}

return false;
}

Also, could someone expain the reason behind maitaining a queue of messages for 
each peer in queueSendMap? Why do we need a per peer queue here? Since this is 
used during election, the local peer is not sending more than one message at a 
time to the remote peer. So the hash map needs to store just one message per 
remote peer?

-Raghu






Dynamic addition of servers to Zookeeper cluster

2009-03-13 Thread rag...@yahoo.com

ZooKeeper gurus,

Can I add servers dynamically to a ZooKeeper cluster? If I understand ZooKeepr 
cluster correctly, each server should know about other servers in the cluster 
during server start up. Does this mean that the cluster size is static once the 
cluster is running and a new server can be added to the cluster only by 
bringing down the cluster and restarting each server with the new server name 
included in each server's configuration file?

Thanks
Raghu