subject:"Re\: locking\/leader election and dealing with session loss"

I've seen 40s+.

Also, if combined with a network partition, the gc pause only needs 1/3 of
session timeout for the same effect to occur.

On Thu, 16 Jul 2015 15:58 Camille Fournier cami...@apache.org wrote:

 They can and have happened in prod to people. I started taking about it
 after hearing enough people complain about just this situation on twitter.
 If you are relying on very large jvm memory footprints a 30s gc pause can
 and should be expected. In general I think most people don't need to worry
 about this most of the time but it's one of those things that happens and
 the developers are almost always shocked. I'm a fan of being clear about
 edge cases, even rare ones, so that devs can make the right tradeoffs for
 their env.
 Of course there are a myriad theoretical possibilities. But I don’t
 believe any of what you’ve mentioned will happen in production. For any
 reasonable case, you can be guaranteed that no two processes will consider
 themselves lock holders at the same instant in time.

 -Jordan


 On July 16, 2015 at 7:58:06 AM, Ivan Kelly (iv...@apache.org) wrote:

 On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman 
 jor...@jordanzimmerman.com
 wrote:

  Are you really seeing 30s gc pauses in production? If so, then of course
  this could happen. However, if your application can tolerate a 30s pause
  (which is hard to believe) then your session timeout is too low. The
 point
  of the session timeout is to have enough coverage. So, if your app has 30
  seconds allowable pauses your session timeout would have to be much
 longer.
 
 GC is just an example. There's other ways the same scenario could happen.
 The machine could swap out the process due to load. Someone could do
 something stupid in the zookeeper event thread and the session expired
 event is delayed. The state update could have hit the ip stack during
 network partition, and the process then got wedged. The state update packet
 could have hit the network and been routed via the moon. The clock could
 break.

 If you are relying on a timer on the zk client to maintain a guarantee,
 then you really aren't giving any guarantee because the zk client doesn't
 have control over all the things that could go wrong.

 -Ivan

Re: locking/leader election and dealing with session loss

In case there's still doubt around this issue. I've written a demo app that
demonstrates the problem.

https://github.com/ivankelly/hanging-chad

-Ivan

On Wed, Jul 15, 2015 at 11:22 PM Alexander Shraer shra...@gmail.com wrote:

I disagree, ZooKeeper itself actually doesn't rely on timing for safety -
it won't get into an inconsistent state even if all timing assumptions fail
(except for the sync operation, which is then not guaranteed to return the
latest value, but that's a known issue that needs to be fixed).

On Wed, Jul 15, 2015 at 2:13 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

This property may hold if you make a lot of timing/synchrony assumptions

These assumptions and timing are intrinsic to using ZooKeeper. So, of
course I’m making these assumptions.

-Jordan

On July 15, 2015 at 3:57:12 PM, Alexander Shraer (shra...@gmail.com)
wrote:

This property may hold if you make a lot of timing/synchrony assumptions
-- agreeing on who holds the lock in an asynchronous distributed system
with failures is impossible, this is the FLP impossibility.

But even if it holds, this property is not very useful if the ZK client
itself doesn't have the application data. So one has to consider whether
it
is possible that the application sees a messages from two clients that
both
think are the leader in an order which contradicts the lock acquisition
order.

On Wed, Jul 15, 2015 at 1:26 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

I think we may be talking past each other here. My contention (and the
ZK docs agree BTW) is that, properly written and configured, at any
snapshot in time no two clients think they hold the same lock”. How your
application acts on that fact is another thing. You might need sequence
numbers, you might not.

-Jordan

On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com)
wrote:

Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper:
link

http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

it suggests 2 ways in which the storage can support lock generations and
proposes an alternative for the case where the storage can't be made
aware
of lock generations.

On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

Ivan, I just read the blog and I still don’t see how this can happen.
Sorry if I’m being dense. I’d appreciate a discussion on this. In your
blog
you state: when ZooKeeper tells you that you are leader, there’s no
guarantee that there isn’t another node that 'thinks' its the leader.”
However, given a long enough session time — I usually recommend 30–60
seconds, I don’t see how this can happen. The client itself determines
that
there is a network partition when there is no heartbeat success. The
heartbeat is a fraction of the session timeout. Once the heartbeat
fails,
the client must assume it no longer has the lock. Another client
cannot
take over the lock until, at minimum, session timeout. So, how then
can
there be two leaders?

-Jordan

On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote:

I blogged about this exact problem a couple of weeks ago [1]. I give
an
example of how split brain can happen in a resource under a zk lock
(Hbase
in this case). As Camille says, sequence numbers ftw. I'll add that
the
data store has to support them though, which not all do (in fact I've
yet
to see one in the wild that does). I've implemented a prototype that
works
with hbase[2] if you want to see what it looks like.

-Ivan

[1]

https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215
[2] https://github.com/ivankelly/hbase-exclusive-writer

On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com
wrote:

Jordan, I mean the client gives up the lock and stops working on the
shared
resource. So when zookeeper is unavailable, no one is working on any
shared
resource (because they cannot distinguish network partition from
zookeeper
DEAD scenario).

--
View this message in context:

http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman jor...@jordanzimmerman.com
wrote:

 Are you really seeing 30s gc pauses in production? If so, then of course
 this could happen. However, if your application can tolerate a 30s pause
 (which is hard to believe) then your session timeout is too low. The point
 of the session timeout is to have enough coverage. So, if your app has 30
 seconds allowable pauses your session timeout would have to be much longer.

GC is just an example. There's other ways the same scenario could happen.
The machine could swap out the process due to load. Someone could do
something stupid in the zookeeper event thread and the session expired
event is delayed. The state update could have hit the ip stack during
network partition, and the process then got wedged. The state update packet
could have hit the network and been routed via the moon. The clock could
break.

If you are relying on a timer on the zk client to maintain a guarantee,
then you really aren't giving any guarantee because the zk client doesn't
have control over all the things that could go wrong.

-Ivan

Re: locking/leader election and dealing with session loss

2015-07-16 Thread Jordan Zimmerman

Of course there are a myriad theoretical possibilities. But I don’t believe any 
of what you’ve mentioned will happen in production. For any reasonable case, 
you can be guaranteed that no two processes will consider themselves lock 
holders at the same instant in time.

-Jordan


On July 16, 2015 at 7:58:06 AM, Ivan Kelly (iv...@apache.org) wrote:

On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman jor...@jordanzimmerman.com  
wrote:  

 Are you really seeing 30s gc pauses in production? If so, then of course  
 this could happen. However, if your application can tolerate a 30s pause  
 (which is hard to believe) then your session timeout is too low. The point  
 of the session timeout is to have enough coverage. So, if your app has 30  
 seconds allowable pauses your session timeout would have to be much longer.  
  
GC is just an example. There's other ways the same scenario could happen.  
The machine could swap out the process due to load. Someone could do  
something stupid in the zookeeper event thread and the session expired  
event is delayed. The state update could have hit the ip stack during  
network partition, and the process then got wedged. The state update packet  
could have hit the network and been routed via the moon. The clock could  
break.  

If you are relying on a timer on the zk client to maintain a guarantee,  
then you really aren't giving any guarantee because the zk client doesn't  
have control over all the things that could go wrong.  

-Ivan

Re: locking/leader election and dealing with session loss

2015-07-16 Thread Camille Fournier

They can and have happened in prod to people. I started taking about it
after hearing enough people complain about just this situation on twitter.
If you are relying on very large jvm memory footprints a 30s gc pause can
and should be expected. In general I think most people don't need to worry
about this most of the time but it's one of those things that happens and
the developers are almost always shocked. I'm a fan of being clear about
edge cases, even rare ones, so that devs can make the right tradeoffs for
their env.
Of course there are a myriad theoretical possibilities. But I don’t believe
any of what you’ve mentioned will happen in production. For any reasonable
case, you can be guaranteed that no two processes will consider themselves
lock holders at the same instant in time.

-Jordan


On July 16, 2015 at 7:58:06 AM, Ivan Kelly (iv...@apache.org) wrote:

On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman jor...@jordanzimmerman.com

wrote:

 Are you really seeing 30s gc pauses in production? If so, then of course
 this could happen. However, if your application can tolerate a 30s pause
 (which is hard to believe) then your session timeout is too low. The point
 of the session timeout is to have enough coverage. So, if your app has 30
 seconds allowable pauses your session timeout would have to be much
longer.

GC is just an example. There's other ways the same scenario could happen.
The machine could swap out the process due to load. Someone could do
something stupid in the zookeeper event thread and the session expired
event is delayed. The state update could have hit the ip stack during
network partition, and the process then got wedged. The state update packet
could have hit the network and been routed via the moon. The clock could
break.

If you are relying on a timer on the zk client to maintain a guarantee,
then you really aren't giving any guarantee because the zk client doesn't
have control over all the things that could go wrong.

-Ivan

Re: locking/leader election and dealing with session loss

2015-07-16 Thread Jordan Zimmerman

And a new Curator Tech Note to match:

https://cwiki.apache.org/confluence/display/CURATOR/TN10

-JZ



On July 16, 2015 at 12:54:29 PM, Ivan Kelly (iv...@apache.org) wrote:

I've seen 40s+.  

Also, if combined with a network partition, the gc pause only needs 1/3 of  
session timeout for the same effect to occur.  

On Thu, 16 Jul 2015 15:58 Camille Fournier cami...@apache.org wrote:  

 They can and have happened in prod to people. I started taking about it  
 after hearing enough people complain about just this situation on twitter.  
 If you are relying on very large jvm memory footprints a 30s gc pause can  
 and should be expected. In general I think most people don't need to worry  
 about this most of the time but it's one of those things that happens and  
 the developers are almost always shocked. I'm a fan of being clear about  
 edge cases, even rare ones, so that devs can make the right tradeoffs for  
 their env.  
 Of course there are a myriad theoretical possibilities. But I don’t  
 believe any of what you’ve mentioned will happen in production. For any  
 reasonable case, you can be guaranteed that no two processes will consider  
 themselves lock holders at the same instant in time.  
  
 -Jordan  
  
  
 On July 16, 2015 at 7:58:06 AM, Ivan Kelly (iv...@apache.org) wrote:  
  
 On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman   
 jor...@jordanzimmerman.com  
 wrote:  
  
  Are you really seeing 30s gc pauses in production? If so, then of course  
  this could happen. However, if your application can tolerate a 30s pause  
  (which is hard to believe) then your session timeout is too low. The  
 point  
  of the session timeout is to have enough coverage. So, if your app has 30  
  seconds allowable pauses your session timeout would have to be much  
 longer.  
   
 GC is just an example. There's other ways the same scenario could happen.  
 The machine could swap out the process due to load. Someone could do  
 something stupid in the zookeeper event thread and the session expired  
 event is delayed. The state update could have hit the ip stack during  
 network partition, and the process then got wedged. The state update packet  
 could have hit the network and been routed via the moon. The clock could  
 break.  
  
 If you are relying on a timer on the zk client to maintain a guarantee,  
 then you really aren't giving any guarantee because the zk client doesn't  
 have control over all the things that could go wrong.  
  
 -Ivan

Re: locking/leader election and dealing with session loss

On Thu, Jul 16, 2015 at 7:57 PM Jordan Zimmerman jor...@jordanzimmerman.com
wrote:

 I think this conversation is converging.

 Summary: there is always an edge case where VM pauses might exceed your
 client heartbeat and cause a client misperception about it’s state for a
 short period of time once the VM unpauses. In practice, a tuned VM that has
 been running within known bounds for a reasonable period will not exhibit
 this behavior. Session timeout must match this known bounds in order to
 have consistent client state.

This is assuming that the conditions won't change, because if they do, the
known bounds go out the window.

Ultimately, if you rely on timeouts you have to evaluate how much risk you
are willing to take with your data and how much work you are willing to do
to mitigate this risk. This risk goes away if the leader cannot update any
state when a new leader appears. For this you need some sort of fencing
mechanism, such as sequence numbers. Personally, I would always go for the
latter option.

-Ivan

Re: locking/leader election and dealing with session loss

Thanks for the quick response Camille. If client A owns the lock, gets
disconnected due to network partition, it will not see the SESSION_EXPIRED
event until it is too late, i.e. client B has acquired the lock and done the
damage. Problem here is that client cannot distinguish network partition
from zookeeper ensemble in leader election state.



--
View this message in context: 
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581279.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

Jordan, I agree with your recommendation to guarantee single lock
owner/leader. I am trying to figure out how I can improve the availability
of the leader in a specific corner case, i.e. when zookeeper ensemble is
under going leader election or broken (happens often due to software push,
host replacement, etc). When ensemble is broken, it means that entire
application is DEAD (no leader for any shard/resource).



--
View this message in context: 
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581281.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

You should not commit suicide unless it goes into SESSION_EXPIRED state.
The quorum won't delete the ephemeral node immediately, it will only delete
it when the session expires. So if your client for whatever reason
disconnects and reconnects before the session expires, it will be fine.

On Wed, Jul 15, 2015 at 1:54 PM, Vikas Mehta vikasme...@gmail.com wrote:

When using zookeeper for leader election or distributed locking, my
assumption is that as soon as the lock owner sees session transition to
'CONNECTING' state, it should commit suicide to avoid the risk of multiple
owners. Please correct my assumption if I am wrong or there is a better way
to guarantee a single lock owner/leader.

If above assumption is correct, I am trying to figure out how I can improve
the availability of the application (leader/lock owner) when zookeeper
ensemble is broken (eg. undergoing zookeeper leader election for prolonged
period of time). Options I have considered:

1/ use multiple ensembles for leader election/locking to avoid SPOF
(complex
to implement)
2/ extend the zookeeper protocol to provide client more info on connection
loss, like zookeeper leader election in progress so that client can decide
when it is ok to not commit suicide and still guarantee a single
application
leader/lock owner. (haven't been able to prove that this will guarantee
single application leader/lock owner).

If this has been already answered or solved, please point me to the
post/doc.

--
View this message in context:
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

Once client A loses connection it must assume that it no longer has the lock 
(you could try to time the session but I think that’s a bad idea). Once you 
reconnect, you will know if your session is still active or not. When done 
correctly, there’s no chance that both A and B will think they own the lock at 
the same time.

-Jordan



On July 15, 2015 at 1:17:10 PM, Vikas Mehta (vikasme...@gmail.com) wrote:

Thanks for the quick response Camille. If client A owns the lock, gets  
disconnected due to network partition, it will not see the SESSION_EXPIRED  
event until it is too late, i.e. client B has acquired the lock and done the  
damage. Problem here is that client cannot distinguish network partition  
from zookeeper ensemble in leader election state.  



--  
View this message in context: 
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581279.html
  
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

To expand on my point, if you want to be able to continue to attempt to
make progress when the ZK is down, the act of getting a lock should also
cause the lock owner to get a sequence number that it can use to identify
the period of operation it is in. I believe that then, say, you get
sequence number 1. If you tag all of your requests with 1, if for any
reason you lose the lock and don't know it, and server #2 gets the lock, it
should get sequence #2. The resource should then reject all requests with
sequence below 2, so if any remaining requests tagged 1 are lying around
they should be rejected by the resource.
And there you have it: You can continue to make safe forward progress while
in uncertain state on the ZK side so long as the original lock holder is
available and the resource validates these things. If both the ZK itself go
down and the original lock holder goes down, you're still AWOL presumably.

On Wed, Jul 15, 2015 at 2:24 PM, Camille Fournier cami...@apache.org
wrote:

I thought that the client itself had a notion of the session timeout
internally that would conservatively let the client know that it was dead?
If not, then that's my faulty memory.

That being said, if you really care about the client not sending messages
when it does not have the lock, the resource under contention needs to
validate the messages it is receiving, though. You cannot guarantee that
just because a client believes it is connected and sends a message to
locked resource that the message will be received while the sender still
has the lock. If you don't care about this possibility then just assuming
you lose the lock when you are in any state other than connected is
adequate but just be aware that events such as long GC pauses and network
issues can cause you to access the resource improperly.

On Wed, Jul 15, 2015 at 2:19 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

Once client A loses connection it must assume that it no longer has the
lock (you could try to time the session but I think that’s a bad idea).
Once you reconnect, you will know if your session is still active or not.
When done correctly, there’s no chance that both A and B will think they
own the lock at the same time.

-Jordan

On July 15, 2015 at 1:17:10 PM, Vikas Mehta (vikasme...@gmail.com) wrote:

Thanks for the quick response Camille. If client A owns the lock, gets
disconnected due to network partition, it will not see the SESSION_EXPIRED
event until it is too late, i.e. client B has acquired the lock and done
the
damage. Problem here is that client cannot distinguish network partition
from zookeeper ensemble in leader election state.

--
View this message in context:
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581279.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

He’s talking about multiple writers. Given a reasonable session timeout, even a
GC shouldn’t matter. If the GC causes a heartbeat miss the client will get
SysDisconnected.

-Jordan

On July 15, 2015 at 2:05:41 PM, Camille Fournier (skami...@gmail.com) wrote:

If client a does a full gc immediately before sending a message that is
long enough to lose the lock, it will send the message out of order. You
cannot guarantee exclusive access without verification at the locked
resource.

C
On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com
wrote:

I don’t see how there’s a chance of multiple writers. Assuming a
reasonable session timeout:

* Client A gets the lock
* Client B watches Client A’s lock node
* Client A gets a network partition
* Client A will get a SysDisconnected before the session times out
* Client A must immediately assume it no longer has the lock
* Client A’s session times out
* Client A’s ephemeral node is deleted
* Client B’s watch fires
* Client B takes the lock
* Client A reconnects and gets SESSION_EXPIRED

Where’s the problem? This is how everyone uses ZooKeeper. There is 0
chance of multiple writers in this scenario.

On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com) wrote:

Camille, I don't have a central message store/processor that can guarantee
single writer (if I had one, it would reduce (still useful in reducing lock
contention, etc) the need/value of using zookeeper) and hence I am trying
to
minimize the chances of multiple writers (more or less trying to guarantee
this) while maximizing availability (not trying to solve CAP theorem), by
solving some specific issues that affect availability.

--
View this message in context:
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html

Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

OK - then that’s my recommendation. Anything else is unsafe for reasons 
described by others. That’s always been my recommendation with Curator or any 
use of ZooKeeper.

-Jordan



On July 15, 2015 at 2:16:03 PM, Vikas Mehta (vikasme...@gmail.com) wrote:

Jordan, I mean the client gives up the lock and stops working on the shared  
resource. So when zookeeper is unavailable, no one is working on any shared  
resource (because they cannot distinguish network partition from zookeeper  
DEAD scenario).  



--  
View this message in context: 
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html
  
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

Camille, I don't have a central message store/processor that can guarantee
single writer (if I had one, it would reduce (still useful in reducing lock
contention, etc) the need/value of using zookeeper) and hence I am trying to
minimize the chances of multiple writers (more or less trying to guarantee
this) while maximizing availability (not trying to solve CAP theorem), by
solving some specific issues that affect availability.



--
View this message in context: 
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

If client a does a full gc immediately before sending a message that is
long enough to lose the lock, it will send the message out of order. You
cannot guarantee exclusive access without verification at the locked
resource.

C
On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com
wrote:

 I don’t see how there’s a chance of multiple writers. Assuming a
 reasonable session timeout:

 * Client A gets the lock
 * Client B watches Client A’s lock node
 * Client A gets a network partition
 * Client A will get a SysDisconnected before the session times out
 * Client A must immediately assume it no longer has the lock
 * Client A’s session times out
 * Client A’s ephemeral node is deleted
 * Client B’s watch fires
 * Client B takes the lock
 * Client A reconnects and gets SESSION_EXPIRED

 Where’s the problem? This is how everyone uses ZooKeeper. There is 0
 chance of multiple writers in this scenario.



 On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com) wrote:

 Camille, I don't have a central message store/processor that can guarantee
 single writer (if I had one, it would reduce (still useful in reducing lock
 contention, etc) the need/value of using zookeeper) and hence I am trying
 to
 minimize the chances of multiple writers (more or less trying to guarantee
 this) while maximizing availability (not trying to solve CAP theorem), by
 solving some specific issues that affect availability.



 --
 View this message in context:
 http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html
 Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

Jordan, I mean the client gives up the lock and stops working on the shared
resource. So when zookeeper is unavailable, no one is working on any shared
resource (because they cannot distinguish network partition from zookeeper
DEAD scenario).



--
View this message in context: 
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

I didn’t mean to imply you don’t. But, I really need to understand this as it
changes many assumptions I have. Can anyone explain? If not, I’ll wait for your
explanation.

On July 15, 2015 at 2:23:33 PM, Camille Fournier (cami...@apache.org) wrote:

I'm in transit so I don't have time to explain this in detail but I promise
you I know what I'm talking about so take some time to think through what
I've said and work it out for yourself, it takes longer than five minutes
to work through it.

C
Even if messages go out of order how does that change the algorithm? Can
you explain? The session is maintained by means of a heartbeat. If a
heartbeat is missed then the client goes to Disconnected. Message ordering
in this case shouldn’t matter. Are you saying that message ordering can
cause two clients to think they have the lowest sequence number?

-Jordan

On July 15, 2015 at 2:12:26 PM, Camille Fournier (skami...@gmail.com) wrote:

I don't know what to tell you Jordan, but this is an observable phenomenon
and it can happen. It's relatively unlikely and rare but not impossible. If
you're interested in it I'd recommend reading the chubby paper out of
Google, where they discuss it in some detail.

C
On Jul 15, 2015 3:09 PM, Jordan Zimmerman jor...@jordanzimmerman.com
wrote:

He’s talking about multiple writers. Given a reasonable session timeout,
even a GC shouldn’t matter. If the GC causes a heartbeat miss the client
will get SysDisconnected.

-Jordan

On July 15, 2015 at 2:05:41 PM, Camille Fournier (skami...@gmail.com)
wrote:

C
On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com
wrote:

I don’t see how there’s a chance of multiple writers. Assuming a
reasonable session timeout:

Where’s the problem? This is how everyone uses ZooKeeper. There is 0
chance of multiple writers in this scenario.

On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com)
wrote:

Camille, I don't have a central message store/processor that can
guarantee
single writer (if I had one, it would reduce (still useful in reducing
lock
contention, etc) the need/value of using zookeeper) and hence I am trying
to
minimize the chances of multiple writers (more or less trying to
guarantee
this) while maximizing availability (not trying to solve CAP theorem), by
solving some specific issues that affect availability.

--
View this message in context:

http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html

Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

Ivan, I just read the blog and I still don’t see how this can happen. Sorry if
I’m being dense. I’d appreciate a discussion on this. In your blog you state:
when ZooKeeper tells you that you are leader, there’s no guarantee that there
isn’t another node that 'thinks' its the leader.” However, given a long enough
session time — I usually recommend 30–60 seconds, I don’t see how this can
happen. The client itself determines that there is a network partition when
there is no heartbeat success. The heartbeat is a fraction of the session
timeout. Once the heartbeat fails, the client must assume it no longer has the
lock. Another client cannot take over the lock until, at minimum, session
timeout. So, how then can there be two leaders?

-Jordan

On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote:

I blogged about this exact problem a couple of weeks ago [1]. I give an
example of how split brain can happen in a resource under a zk lock (Hbase
in this case). As Camille says, sequence numbers ftw. I'll add that the
data store has to support them though, which not all do (in fact I've yet
to see one in the wild that does). I've implemented a prototype that works
with hbase[2] if you want to see what it looks like.

-Ivan

[1]
https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215

[2] https://github.com/ivankelly/hbase-exclusive-writer

On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote:

Jordan, I mean the client gives up the lock and stops working on the shared
resource. So when zookeeper is unavailable, no one is working on any shared
resource (because they cannot distinguish network partition from zookeeper
DEAD scenario).

--
View this message in context:
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html

Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

C
On Jul 15, 2015 3:09 PM, Jordan Zimmerman jor...@jordanzimmerman.com
wrote:

He’s talking about multiple writers. Given a reasonable session timeout,
even a GC shouldn’t matter. If the GC causes a heartbeat miss the client
will get SysDisconnected.

-Jordan

On July 15, 2015 at 2:05:41 PM, Camille Fournier (skami...@gmail.com)
wrote:

C
On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com
wrote:

I don’t see how there’s a chance of multiple writers. Assuming a
reasonable session timeout:

Where’s the problem? This is how everyone uses ZooKeeper. There is 0
chance of multiple writers in this scenario.

On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com)
wrote:

Camille, I don't have a central message store/processor that can
guarantee
single writer (if I had one, it would reduce (still useful in reducing
lock
contention, etc) the need/value of using zookeeper) and hence I am
trying
to
minimize the chances of multiple writers (more or less trying to
guarantee
this) while maximizing availability (not trying to solve CAP theorem),
by
solving some specific issues that affect availability.

--
View this message in context:

http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

-Jordan

On July 15, 2015 at 2:12:26 PM, Camille Fournier (skami...@gmail.com) wrote:

C
On Jul 15, 2015 3:09 PM, Jordan Zimmerman jor...@jordanzimmerman.com
wrote:

He’s talking about multiple writers. Given a reasonable session timeout,
even a GC shouldn’t matter. If the GC causes a heartbeat miss the client
will get SysDisconnected.

-Jordan

On July 15, 2015 at 2:05:41 PM, Camille Fournier (skami...@gmail.com)
wrote:

C
On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com
wrote:

I don’t see how there’s a chance of multiple writers. Assuming a
reasonable session timeout:

Where’s the problem? This is how everyone uses ZooKeeper. There is 0
chance of multiple writers in this scenario.

On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com)
wrote:

Camille, I don't have a central message store/processor that can
guarantee
single writer (if I had one, it would reduce (still useful in reducing
lock
contention, etc) the need/value of using zookeeper) and hence I am trying
to
minimize the chances of multiple writers (more or less trying to
guarantee
this) while maximizing availability (not trying to solve CAP theorem), by
solving some specific issues that affect availability.

--
View this message in context:

http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

I blogged about this exact problem a couple of weeks ago [1]. I give an
example of how split brain can happen in a resource under a zk lock (Hbase
in this case). As Camille says, sequence numbers ftw. I'll add that the
data store has to support them though, which not all do (in fact I've yet
to see one in the wild that does). I've implemented a prototype that works
with hbase[2] if you want to see what it looks like.

-Ivan

[1]
https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215
[2] https://github.com/ivankelly/hbase-exclusive-writer

On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote:

 Jordan, I mean the client gives up the lock and stops working on the shared
 resource. So when zookeeper is unavailable, no one is working on any shared
 resource (because they cannot distinguish network partition from zookeeper
 DEAD scenario).



 --
 View this message in context:
 http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html
 Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

I don’t see how there’s a chance of multiple writers. Assuming a reasonable 
session timeout:

* Client A gets the lock
* Client B watches Client A’s lock node
* Client A gets a network partition
* Client A will get a SysDisconnected before the session times out
* Client A must immediately assume it no longer has the lock
* Client A’s session times out
* Client A’s ephemeral node is deleted
* Client B’s watch fires
* Client B takes the lock
* Client A reconnects and gets SESSION_EXPIRED

Where’s the problem? This is how everyone uses ZooKeeper. There is 0 chance of 
multiple writers in this scenario.



On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com) wrote:

Camille, I don't have a central message store/processor that can guarantee  
single writer (if I had one, it would reduce (still useful in reducing lock  
contention, etc) the need/value of using zookeeper) and hence I am trying to  
minimize the chances of multiple writers (more or less trying to guarantee  
this) while maximizing availability (not trying to solve CAP theorem), by  
solving some specific issues that affect availability.  



--  
View this message in context: 
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html
  
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

Jordan, We are using it like you describe and don't have a multi-writer
problem. Issue is:

* Zookeeper ensemble goes into leader election
* All clients see session transition to 'CONNECTING' state (SysDisconnected
in your sequence)
* All clients give up all the locks they owned on the resources they were
working with for a GLOBAL outage of the application.



--
View this message in context: 
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581287.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

+1 to what Camille is saying  suggestion to use generations

On Wed, Jul 15, 2015 at 12:04 PM, Camille Fournier skami...@gmail.com
wrote:

 If client a does a full gc immediately before sending a message that is
 long enough to lose the lock, it will send the message out of order. You
 cannot guarantee exclusive access without verification at the locked
 resource.

 C
 On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com
 wrote:

  I don’t see how there’s a chance of multiple writers. Assuming a
  reasonable session timeout:
 
  * Client A gets the lock
  * Client B watches Client A’s lock node
  * Client A gets a network partition
  * Client A will get a SysDisconnected before the session times out
  * Client A must immediately assume it no longer has the lock
  * Client A’s session times out
  * Client A’s ephemeral node is deleted
  * Client B’s watch fires
  * Client B takes the lock
  * Client A reconnects and gets SESSION_EXPIRED
 
  Where’s the problem? This is how everyone uses ZooKeeper. There is 0
  chance of multiple writers in this scenario.
 
 
 
  On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com)
 wrote:
 
  Camille, I don't have a central message store/processor that can
 guarantee
  single writer (if I had one, it would reduce (still useful in reducing
 lock
  contention, etc) the need/value of using zookeeper) and hence I am trying
  to
  minimize the chances of multiple writers (more or less trying to
 guarantee
  this) while maximizing availability (not trying to solve CAP theorem), by
  solving some specific issues that affect availability.
 
 
 
  --
  View this message in context:
 
 http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html
  Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

Camille, Agreed, but fixing application level bugs is a different set of
problems.



--
View this message in context: 
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581289.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

Even if messages go out of order how does that change the algorithm? Can you
explain? The session is maintained by means of a heartbeat. If a heartbeat is
missed then the client goes to Disconnected. Message ordering in this case
shouldn’t matter. Are you saying that message ordering can cause two clients to
think they have the lowest sequence number?

-Jordan

On July 15, 2015 at 2:12:26 PM, Camille Fournier (skami...@gmail.com) wrote:

I don't know what to tell you Jordan, but this is an observable phenomenon and
it can happen. It's relatively unlikely and rare but not impossible. If you're
interested in it I'd recommend reading the chubby paper out of Google, where
they discuss it in some detail.

On Jul 15, 2015 3:09 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote:
He’s talking about multiple writers. Given a reasonable session timeout, even a
GC shouldn’t matter. If the GC causes a heartbeat miss the client will get
SysDisconnected.

-Jordan

On July 15, 2015 at 2:05:41 PM, Camille Fournier (skami...@gmail.com) wrote:

C
On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com
wrote:

I don’t see how there’s a chance of multiple writers. Assuming a
reasonable session timeout:

Where’s the problem? This is how everyone uses ZooKeeper. There is 0
chance of multiple writers in this scenario.

On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com) wrote:

--
View this message in context:
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

Ivan, Thanks for the references. Great write up describing the problem and
the solution. I agree that for total ordering storage layer should provide
the guarantee.

However, in my scenario, I don't have a global/central storage system
(trying to build a globally (servers in different continents) distributed
storage system), I am minimizing the chances of multiple writers with longer
timeouts and eliminating the downtime during planned maintenance, but don't
have a way to deal with zookeeper ensemble being DEAD (or in leader
election), which wipes out the entire application.

Currently, we twiddle our thumbs when zookeeper or network has issues and
would like to find a way to improve the application availability during
zookeeper service interruptions.

Using multiple ensembles seems like a promising path, but I wanted to see if
anyone has thought about extending the error handling between zookeeper
client/server to deal with this issue.



--
View this message in context: 
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581299.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

I thought that the client itself had a notion of the session timeout
internally that would conservatively let the client know that it was dead?
If not, then that's my faulty memory.

On Wed, Jul 15, 2015 at 2:19 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

-Jordan

On July 15, 2015 at 1:17:10 PM, Vikas Mehta (vikasme...@gmail.com) wrote:

Re: locking/leader election and dealing with session loss

This property may hold if you make a lot of timing/synchrony assumptions --
agreeing on who holds the lock in an asynchronous distributed system with
failures is impossible, this is the FLP impossibility.

But even if it holds, this property is not very useful if the ZK client
itself doesn't have the application data. So one has to consider whether it
is possible that the application sees a messages from two clients that both
think are the leader in an order which contradicts the lock acquisition
order.

On Wed, Jul 15, 2015 at 1:26 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

I think we may be talking past each other here. My contention (and the ZK
docs agree BTW) is that, properly written and configured, at any
snapshot in time no two clients think they hold the same lock”. How your
application acts on that fact is another thing. You might need sequence
numbers, you might not.

-Jordan

On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com)
wrote:

Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper:
link

http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

it suggests 2 ways in which the storage can support lock generations and
proposes an alternative for the case where the storage can't be made aware
of lock generations.

On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

Ivan, I just read the blog and I still don’t see how this can happen.
Sorry if I’m being dense. I’d appreciate a discussion on this. In your
blog
you state: when ZooKeeper tells you that you are leader, there’s no
guarantee that there isn’t another node that 'thinks' its the leader.”
However, given a long enough session time — I usually recommend 30–60
seconds, I don’t see how this can happen. The client itself determines
that
there is a network partition when there is no heartbeat success. The
heartbeat is a fraction of the session timeout. Once the heartbeat
fails,
the client must assume it no longer has the lock. Another client cannot
take over the lock until, at minimum, session timeout. So, how then can
there be two leaders?

-Jordan

On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote:

I blogged about this exact problem a couple of weeks ago [1]. I give an
example of how split brain can happen in a resource under a zk lock
(Hbase
in this case). As Camille says, sequence numbers ftw. I'll add that the
data store has to support them though, which not all do (in fact I've
yet
to see one in the wild that does). I've implemented a prototype that
works
with hbase[2] if you want to see what it looks like.

-Ivan

[1]

https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215
[2] https://github.com/ivankelly/hbase-exclusive-writer

On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com
wrote:

--
View this message in context:

http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

This property may hold if you make a lot of timing/synchrony assumptions
These assumptions and timing are intrinsic to using ZooKeeper. So, of course
I’m making these assumptions.

-Jordan

On July 15, 2015 at 3:57:12 PM, Alexander Shraer (shra...@gmail.com) wrote:

This property may hold if you make a lot of timing/synchrony assumptions --
agreeing on who holds the lock in an asynchronous distributed system with
failures is impossible, this is the FLP impossibility.

But even if it holds, this property is not very useful if the ZK client itself
doesn't have the application data. So one has to consider whether it is
possible that the application sees a messages from two clients that both think
are the leader in an order which contradicts the lock acquisition order.

On Wed, Jul 15, 2015 at 1:26 PM, Jordan Zimmerman jor...@jordanzimmerman.com
wrote:
I think we may be talking past each other here. My contention (and the ZK docs
agree BTW) is that, properly written and configured, at any snapshot in time
no two clients think they hold the same lock”. How your application acts on
that fact is another thing. You might need sequence numbers, you might not.

-Jordan

On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com) wrote:

Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper:
link
http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

it suggests 2 ways in which the storage can support lock generations and
proposes an alternative for the case where the storage can't be made aware
of lock generations.

On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

Ivan, I just read the blog and I still don’t see how this can happen.
Sorry if I’m being dense. I’d appreciate a discussion on this. In your blog
you state: when ZooKeeper tells you that you are leader, there’s no
guarantee that there isn’t another node that 'thinks' its the leader.”
However, given a long enough session time — I usually recommend 30–60
seconds, I don’t see how this can happen. The client itself determines that
there is a network partition when there is no heartbeat success. The
heartbeat is a fraction of the session timeout. Once the heartbeat fails,
the client must assume it no longer has the lock. Another client cannot
take over the lock until, at minimum, session timeout. So, how then can
there be two leaders?

-Jordan

On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote:

I blogged about this exact problem a couple of weeks ago [1]. I give an
example of how split brain can happen in a resource under a zk lock (Hbase
in this case). As Camille says, sequence numbers ftw. I'll add that the
data store has to support them though, which not all do (in fact I've yet
to see one in the wild that does). I've implemented a prototype that works
with hbase[2] if you want to see what it looks like.

-Ivan

[1]

https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215
[2] https://github.com/ivankelly/hbase-exclusive-writer

On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote:

--
View this message in context:

http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

at any snapshot in time no two clients think they hold the same lock”
According to the ZK service. But communication between the service and
client takes time.

-Ivan

On Wed, Jul 15, 2015 at 10:54 PM Ivan Kelly iv...@apache.org wrote:

Jordan, imagine you have a node which is leader using the hbase example. A
client makes some request to the leader, which processes the request, lines
up a write to the state in hbase, and promptly goes into a 30 second gc
pause just before it flushes the socket. During the 30 second pause another
node takes over as leader and starts writing to the state. Now, when the
pause ends, what will stop the write from the first leader being flushed to
the socket and then hitting hbase?

-Ivan

On Wed, Jul 15, 2015 at 10:26 PM Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

I think we may be talking past each other here. My contention (and the ZK
docs agree BTW) is that, properly written and configured, at any
snapshot in time no two clients think they hold the same lock”. How your
application acts on that fact is another thing. You might need sequence
numbers, you might not.

-Jordan

On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com)
wrote:

Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper:
link

http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

it suggests 2 ways in which the storage can support lock generations and
proposes an alternative for the case where the storage can't be made
aware
of lock generations.

On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

Ivan, I just read the blog and I still don’t see how this can happen.
Sorry if I’m being dense. I’d appreciate a discussion on this. In your
blog
you state: when ZooKeeper tells you that you are leader, there’s no
guarantee that there isn’t another node that 'thinks' its the leader.”
However, given a long enough session time — I usually recommend 30–60
seconds, I don’t see how this can happen. The client itself determines
that
there is a network partition when there is no heartbeat success. The
heartbeat is a fraction of the session timeout. Once the heartbeat
fails,
the client must assume it no longer has the lock. Another client cannot
take over the lock until, at minimum, session timeout. So, how then can
there be two leaders?

-Jordan

On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote:

I blogged about this exact problem a couple of weeks ago [1]. I give an
example of how split brain can happen in a resource under a zk lock
(Hbase
in this case). As Camille says, sequence numbers ftw. I'll add that the
data store has to support them though, which not all do (in fact I've
yet
to see one in the wild that does). I've implemented a prototype that
works
with hbase[2] if you want to see what it looks like.

-Ivan

[1]

https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215
[2] https://github.com/ivankelly/hbase-exclusive-writer

On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com
wrote:

--
View this message in context:

http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

It doesn’t matter that communication between the service the client takes time.
The client can determine that it is no longer the lock holder independently of
the server.

-Jordan

On July 15, 2015 at 3:55:40 PM, Ivan Kelly (iv...@apache.org) wrote:

at any snapshot in time no two clients think they hold the same lock”
According to the ZK service. But communication between the service and
client takes time.

-Ivan

On Wed, Jul 15, 2015 at 10:54 PM Ivan Kelly iv...@apache.org wrote:

-Ivan

On Wed, Jul 15, 2015 at 10:26 PM Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

I think we may be talking past each other here. My contention (and the ZK
docs agree BTW) is that, properly written and configured, at any
snapshot in time no two clients think they hold the same lock”. How your
application acts on that fact is another thing. You might need sequence
numbers, you might not.

-Jordan

On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com)
wrote:

Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper:
link

http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

it suggests 2 ways in which the storage can support lock generations and
proposes an alternative for the case where the storage can't be made
aware
of lock generations.

On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

Ivan, I just read the blog and I still don’t see how this can happen.
Sorry if I’m being dense. I’d appreciate a discussion on this. In your
blog
you state: when ZooKeeper tells you that you are leader, there’s no
guarantee that there isn’t another node that 'thinks' its the leader.”
However, given a long enough session time — I usually recommend 30–60
seconds, I don’t see how this can happen. The client itself determines
that
there is a network partition when there is no heartbeat success. The
heartbeat is a fraction of the session timeout. Once the heartbeat
fails,
the client must assume it no longer has the lock. Another client cannot
take over the lock until, at minimum, session timeout. So, how then can
there be two leaders?

-Jordan

On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote:

I blogged about this exact problem a couple of weeks ago [1]. I give an
example of how split brain can happen in a resource under a zk lock
(Hbase
in this case). As Camille says, sequence numbers ftw. I'll add that the
data store has to support them though, which not all do (in fact I've
yet
to see one in the wild that does). I've implemented a prototype that
works
with hbase[2] if you want to see what it looks like.

-Ivan

[1]

https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215

[2] https://github.com/ivankelly/hbase-exclusive-writer

On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com
wrote:

--
View this message in context:

http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html

Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

On Wed, Jul 15, 2015 at 2:13 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

This property may hold if you make a lot of timing/synchrony assumptions

These assumptions and timing are intrinsic to using ZooKeeper. So, of
course I’m making these assumptions.

-Jordan

On July 15, 2015 at 3:57:12 PM, Alexander Shraer (shra...@gmail.com)
wrote:

But even if it holds, this property is not very useful if the ZK client
itself doesn't have the application data. So one has to consider whether it
is possible that the application sees a messages from two clients that both
think are the leader in an order which contradicts the lock acquisition
order.

On Wed, Jul 15, 2015 at 1:26 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

-Jordan

On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com)
wrote:

Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper:
link

http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

it suggests 2 ways in which the storage can support lock generations and
proposes an alternative for the case where the storage can't be made aware
of lock generations.

On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

Ivan, I just read the blog and I still don’t see how this can happen.
Sorry if I’m being dense. I’d appreciate a discussion on this. In your
blog
you state: when ZooKeeper tells you that you are leader, there’s no
guarantee that there isn’t another node that 'thinks' its the leader.”
However, given a long enough session time — I usually recommend 30–60
seconds, I don’t see how this can happen. The client itself determines
that
there is a network partition when there is no heartbeat success. The
heartbeat is a fraction of the session timeout. Once the heartbeat
fails,
the client must assume it no longer has the lock. Another client cannot
take over the lock until, at minimum, session timeout. So, how then can
there be two leaders?

-Jordan

On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote:

I blogged about this exact problem a couple of weeks ago [1]. I give an
example of how split brain can happen in a resource under a zk lock
(Hbase
in this case). As Camille says, sequence numbers ftw. I'll add that the
data store has to support them though, which not all do (in fact I've
yet
to see one in the wild that does). I've implemented a prototype that
works
with hbase[2] if you want to see what it looks like.

-Ivan

[1]

https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215
[2] https://github.com/ivankelly/hbase-exclusive-writer

On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com
wrote:

--
View this message in context:

http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper:
link
http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

it suggests 2 ways in which the storage can support lock generations and
proposes an alternative for the case where the storage can't be made aware
of lock generations.

On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

Ivan, I just read the blog and I still don’t see how this can happen.
Sorry if I’m being dense. I’d appreciate a discussion on this. In your blog
you state: when ZooKeeper tells you that you are leader, there’s no
guarantee that there isn’t another node that 'thinks' its the leader.”
However, given a long enough session time — I usually recommend 30–60
seconds, I don’t see how this can happen. The client itself determines that
there is a network partition when there is no heartbeat success. The
heartbeat is a fraction of the session timeout. Once the heartbeat fails,
the client must assume it no longer has the lock. Another client cannot
take over the lock until, at minimum, session timeout. So, how then can
there be two leaders?

-Jordan

On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote:

I blogged about this exact problem a couple of weeks ago [1]. I give an
example of how split brain can happen in a resource under a zk lock (Hbase
in this case). As Camille says, sequence numbers ftw. I'll add that the
data store has to support them though, which not all do (in fact I've yet
to see one in the wild that does). I've implemented a prototype that works
with hbase[2] if you want to see what it looks like.

-Ivan

[1]

https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215
[2] https://github.com/ivankelly/hbase-exclusive-writer

On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote:

--
View this message in context:

http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

I think we may be talking past each other here. My contention (and the ZK docs
agree BTW) is that, properly written and configured, at any snapshot in time
no two clients think they hold the same lock”. How your application acts on
that fact is another thing. You might need sequence numbers, you might not.

-Jordan

On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com) wrote:

Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper:
link
http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

it suggests 2 ways in which the storage can support lock generations and
proposes an alternative for the case where the storage can't be made aware
of lock generations.

On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

Ivan, I just read the blog and I still don’t see how this can happen.
Sorry if I’m being dense. I’d appreciate a discussion on this. In your blog
you state: when ZooKeeper tells you that you are leader, there’s no
guarantee that there isn’t another node that 'thinks' its the leader.”
However, given a long enough session time — I usually recommend 30–60
seconds, I don’t see how this can happen. The client itself determines that
there is a network partition when there is no heartbeat success. The
heartbeat is a fraction of the session timeout. Once the heartbeat fails,
the client must assume it no longer has the lock. Another client cannot
take over the lock until, at minimum, session timeout. So, how then can
there be two leaders?

-Jordan

On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote:

I blogged about this exact problem a couple of weeks ago [1]. I give an
example of how split brain can happen in a resource under a zk lock (Hbase
in this case). As Camille says, sequence numbers ftw. I'll add that the
data store has to support them though, which not all do (in fact I've yet
to see one in the wild that does). I've implemented a prototype that works
with hbase[2] if you want to see what it looks like.

-Ivan

[1]

https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215

[2] https://github.com/ivankelly/hbase-exclusive-writer

On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote:

--
View this message in context:

http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html

Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

To expand a bit on the zookeeper client/server handshake part, one possible
option is to improve the situation for planned maintenance, i.e. before
leader is taken down (it knows it is going down), it tells the clients to
continue until zookeeper service is restored (there are a lot of gotchas and
edge conditions to think about here and I being a zookeeper noob haven't
been able to prove whether this will work or not).



--
View this message in context: 
http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581302.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss

-Ivan

On Wed, Jul 15, 2015 at 10:26 PM Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

I think we may be talking past each other here. My contention (and the ZK
docs agree BTW) is that, properly written and configured, at any
snapshot in time no two clients think they hold the same lock”. How your
application acts on that fact is another thing. You might need sequence
numbers, you might not.

-Jordan

On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com)
wrote:

Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper:
link

http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

it suggests 2 ways in which the storage can support lock generations and
proposes an alternative for the case where the storage can't be made aware
of lock generations.

On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman
jor...@jordanzimmerman.com wrote:

Ivan, I just read the blog and I still don’t see how this can happen.
Sorry if I’m being dense. I’d appreciate a discussion on this. In your
blog
you state: when ZooKeeper tells you that you are leader, there’s no
guarantee that there isn’t another node that 'thinks' its the leader.”
However, given a long enough session time — I usually recommend 30–60
seconds, I don’t see how this can happen. The client itself determines
that
there is a network partition when there is no heartbeat success. The
heartbeat is a fraction of the session timeout. Once the heartbeat
fails,
the client must assume it no longer has the lock. Another client cannot
take over the lock until, at minimum, session timeout. So, how then can
there be two leaders?

-Jordan

On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote:

I blogged about this exact problem a couple of weeks ago [1]. I give an
example of how split brain can happen in a resource under a zk lock
(Hbase
in this case). As Camille says, sequence numbers ftw. I'll add that the
data store has to support them though, which not all do (in fact I've
yet
to see one in the wild that does). I've implemented a prototype that
works
with hbase[2] if you want to see what it looks like.

-Ivan

[1]

https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215
[2] https://github.com/ivankelly/hbase-exclusive-writer

On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com
wrote:

--
View this message in context:

http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html
Sent from the zookeeper-user mailing list archive at Nabble.com.

Re: locking/leader election and dealing with session loss