Re: locking/leader election and dealing with session loss
I've seen 40s+. Also, if combined with a network partition, the gc pause only needs 1/3 of session timeout for the same effect to occur. On Thu, 16 Jul 2015 15:58 Camille Fournier cami...@apache.org wrote: They can and have happened in prod to people. I started taking about it after hearing enough people complain about just this situation on twitter. If you are relying on very large jvm memory footprints a 30s gc pause can and should be expected. In general I think most people don't need to worry about this most of the time but it's one of those things that happens and the developers are almost always shocked. I'm a fan of being clear about edge cases, even rare ones, so that devs can make the right tradeoffs for their env. Of course there are a myriad theoretical possibilities. But I don’t believe any of what you’ve mentioned will happen in production. For any reasonable case, you can be guaranteed that no two processes will consider themselves lock holders at the same instant in time. -Jordan On July 16, 2015 at 7:58:06 AM, Ivan Kelly (iv...@apache.org) wrote: On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman jor...@jordanzimmerman.com wrote: Are you really seeing 30s gc pauses in production? If so, then of course this could happen. However, if your application can tolerate a 30s pause (which is hard to believe) then your session timeout is too low. The point of the session timeout is to have enough coverage. So, if your app has 30 seconds allowable pauses your session timeout would have to be much longer. GC is just an example. There's other ways the same scenario could happen. The machine could swap out the process due to load. Someone could do something stupid in the zookeeper event thread and the session expired event is delayed. The state update could have hit the ip stack during network partition, and the process then got wedged. The state update packet could have hit the network and been routed via the moon. The clock could break. If you are relying on a timer on the zk client to maintain a guarantee, then you really aren't giving any guarantee because the zk client doesn't have control over all the things that could go wrong. -Ivan
Re: locking/leader election and dealing with session loss
In case there's still doubt around this issue. I've written a demo app that demonstrates the problem. https://github.com/ivankelly/hanging-chad -Ivan On Wed, Jul 15, 2015 at 11:22 PM Alexander Shraer shra...@gmail.com wrote: I disagree, ZooKeeper itself actually doesn't rely on timing for safety - it won't get into an inconsistent state even if all timing assumptions fail (except for the sync operation, which is then not guaranteed to return the latest value, but that's a known issue that needs to be fixed). On Wed, Jul 15, 2015 at 2:13 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: This property may hold if you make a lot of timing/synchrony assumptions These assumptions and timing are intrinsic to using ZooKeeper. So, of course I’m making these assumptions. -Jordan On July 15, 2015 at 3:57:12 PM, Alexander Shraer (shra...@gmail.com) wrote: This property may hold if you make a lot of timing/synchrony assumptions -- agreeing on who holds the lock in an asynchronous distributed system with failures is impossible, this is the FLP impossibility. But even if it holds, this property is not very useful if the ZK client itself doesn't have the application data. So one has to consider whether it is possible that the application sees a messages from two clients that both think are the leader in an order which contradicts the lock acquisition order. On Wed, Jul 15, 2015 at 1:26 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: I think we may be talking past each other here. My contention (and the ZK docs agree BTW) is that, properly written and configured, at any snapshot in time no two clients think they hold the same lock”. How your application acts on that fact is another thing. You might need sequence numbers, you might not. -Jordan On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com) wrote: Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper: link http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf it suggests 2 ways in which the storage can support lock generations and proposes an alternative for the case where the storage can't be made aware of lock generations. On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: Ivan, I just read the blog and I still don’t see how this can happen. Sorry if I’m being dense. I’d appreciate a discussion on this. In your blog you state: when ZooKeeper tells you that you are leader, there’s no guarantee that there isn’t another node that 'thinks' its the leader.” However, given a long enough session time — I usually recommend 30–60 seconds, I don’t see how this can happen. The client itself determines that there is a network partition when there is no heartbeat success. The heartbeat is a fraction of the session timeout. Once the heartbeat fails, the client must assume it no longer has the lock. Another client cannot take over the lock until, at minimum, session timeout. So, how then can there be two leaders? -Jordan On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote: I blogged about this exact problem a couple of weeks ago [1]. I give an example of how split brain can happen in a resource under a zk lock (Hbase in this case). As Camille says, sequence numbers ftw. I'll add that the data store has to support them though, which not all do (in fact I've yet to see one in the wild that does). I've implemented a prototype that works with hbase[2] if you want to see what it looks like. -Ivan [1] https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215 [2] https://github.com/ivankelly/hbase-exclusive-writer On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote: Jordan, I mean the client gives up the lock and stops working on the shared resource. So when zookeeper is unavailable, no one is working on any shared resource (because they cannot distinguish network partition from zookeeper DEAD scenario). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman jor...@jordanzimmerman.com wrote: Are you really seeing 30s gc pauses in production? If so, then of course this could happen. However, if your application can tolerate a 30s pause (which is hard to believe) then your session timeout is too low. The point of the session timeout is to have enough coverage. So, if your app has 30 seconds allowable pauses your session timeout would have to be much longer. GC is just an example. There's other ways the same scenario could happen. The machine could swap out the process due to load. Someone could do something stupid in the zookeeper event thread and the session expired event is delayed. The state update could have hit the ip stack during network partition, and the process then got wedged. The state update packet could have hit the network and been routed via the moon. The clock could break. If you are relying on a timer on the zk client to maintain a guarantee, then you really aren't giving any guarantee because the zk client doesn't have control over all the things that could go wrong. -Ivan
Re: locking/leader election and dealing with session loss
Of course there are a myriad theoretical possibilities. But I don’t believe any of what you’ve mentioned will happen in production. For any reasonable case, you can be guaranteed that no two processes will consider themselves lock holders at the same instant in time. -Jordan On July 16, 2015 at 7:58:06 AM, Ivan Kelly (iv...@apache.org) wrote: On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman jor...@jordanzimmerman.com wrote: Are you really seeing 30s gc pauses in production? If so, then of course this could happen. However, if your application can tolerate a 30s pause (which is hard to believe) then your session timeout is too low. The point of the session timeout is to have enough coverage. So, if your app has 30 seconds allowable pauses your session timeout would have to be much longer. GC is just an example. There's other ways the same scenario could happen. The machine could swap out the process due to load. Someone could do something stupid in the zookeeper event thread and the session expired event is delayed. The state update could have hit the ip stack during network partition, and the process then got wedged. The state update packet could have hit the network and been routed via the moon. The clock could break. If you are relying on a timer on the zk client to maintain a guarantee, then you really aren't giving any guarantee because the zk client doesn't have control over all the things that could go wrong. -Ivan
Re: locking/leader election and dealing with session loss
They can and have happened in prod to people. I started taking about it after hearing enough people complain about just this situation on twitter. If you are relying on very large jvm memory footprints a 30s gc pause can and should be expected. In general I think most people don't need to worry about this most of the time but it's one of those things that happens and the developers are almost always shocked. I'm a fan of being clear about edge cases, even rare ones, so that devs can make the right tradeoffs for their env. Of course there are a myriad theoretical possibilities. But I don’t believe any of what you’ve mentioned will happen in production. For any reasonable case, you can be guaranteed that no two processes will consider themselves lock holders at the same instant in time. -Jordan On July 16, 2015 at 7:58:06 AM, Ivan Kelly (iv...@apache.org) wrote: On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman jor...@jordanzimmerman.com wrote: Are you really seeing 30s gc pauses in production? If so, then of course this could happen. However, if your application can tolerate a 30s pause (which is hard to believe) then your session timeout is too low. The point of the session timeout is to have enough coverage. So, if your app has 30 seconds allowable pauses your session timeout would have to be much longer. GC is just an example. There's other ways the same scenario could happen. The machine could swap out the process due to load. Someone could do something stupid in the zookeeper event thread and the session expired event is delayed. The state update could have hit the ip stack during network partition, and the process then got wedged. The state update packet could have hit the network and been routed via the moon. The clock could break. If you are relying on a timer on the zk client to maintain a guarantee, then you really aren't giving any guarantee because the zk client doesn't have control over all the things that could go wrong. -Ivan
Re: locking/leader election and dealing with session loss
And a new Curator Tech Note to match: https://cwiki.apache.org/confluence/display/CURATOR/TN10 -JZ On July 16, 2015 at 12:54:29 PM, Ivan Kelly (iv...@apache.org) wrote: I've seen 40s+. Also, if combined with a network partition, the gc pause only needs 1/3 of session timeout for the same effect to occur. On Thu, 16 Jul 2015 15:58 Camille Fournier cami...@apache.org wrote: They can and have happened in prod to people. I started taking about it after hearing enough people complain about just this situation on twitter. If you are relying on very large jvm memory footprints a 30s gc pause can and should be expected. In general I think most people don't need to worry about this most of the time but it's one of those things that happens and the developers are almost always shocked. I'm a fan of being clear about edge cases, even rare ones, so that devs can make the right tradeoffs for their env. Of course there are a myriad theoretical possibilities. But I don’t believe any of what you’ve mentioned will happen in production. For any reasonable case, you can be guaranteed that no two processes will consider themselves lock holders at the same instant in time. -Jordan On July 16, 2015 at 7:58:06 AM, Ivan Kelly (iv...@apache.org) wrote: On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman jor...@jordanzimmerman.com wrote: Are you really seeing 30s gc pauses in production? If so, then of course this could happen. However, if your application can tolerate a 30s pause (which is hard to believe) then your session timeout is too low. The point of the session timeout is to have enough coverage. So, if your app has 30 seconds allowable pauses your session timeout would have to be much longer. GC is just an example. There's other ways the same scenario could happen. The machine could swap out the process due to load. Someone could do something stupid in the zookeeper event thread and the session expired event is delayed. The state update could have hit the ip stack during network partition, and the process then got wedged. The state update packet could have hit the network and been routed via the moon. The clock could break. If you are relying on a timer on the zk client to maintain a guarantee, then you really aren't giving any guarantee because the zk client doesn't have control over all the things that could go wrong. -Ivan
Re: locking/leader election and dealing with session loss
On Thu, Jul 16, 2015 at 7:57 PM Jordan Zimmerman jor...@jordanzimmerman.com wrote: I think this conversation is converging. Summary: there is always an edge case where VM pauses might exceed your client heartbeat and cause a client misperception about it’s state for a short period of time once the VM unpauses. In practice, a tuned VM that has been running within known bounds for a reasonable period will not exhibit this behavior. Session timeout must match this known bounds in order to have consistent client state. This is assuming that the conditions won't change, because if they do, the known bounds go out the window. Ultimately, if you rely on timeouts you have to evaluate how much risk you are willing to take with your data and how much work you are willing to do to mitigate this risk. This risk goes away if the leader cannot update any state when a new leader appears. For this you need some sort of fencing mechanism, such as sequence numbers. Personally, I would always go for the latter option. -Ivan
Re: locking/leader election and dealing with session loss
Thanks for the quick response Camille. If client A owns the lock, gets disconnected due to network partition, it will not see the SESSION_EXPIRED event until it is too late, i.e. client B has acquired the lock and done the damage. Problem here is that client cannot distinguish network partition from zookeeper ensemble in leader election state. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581279.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
Jordan, I agree with your recommendation to guarantee single lock owner/leader. I am trying to figure out how I can improve the availability of the leader in a specific corner case, i.e. when zookeeper ensemble is under going leader election or broken (happens often due to software push, host replacement, etc). When ensemble is broken, it means that entire application is DEAD (no leader for any shard/resource). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581281.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
You should not commit suicide unless it goes into SESSION_EXPIRED state. The quorum won't delete the ephemeral node immediately, it will only delete it when the session expires. So if your client for whatever reason disconnects and reconnects before the session expires, it will be fine. C On Wed, Jul 15, 2015 at 1:54 PM, Vikas Mehta vikasme...@gmail.com wrote: When using zookeeper for leader election or distributed locking, my assumption is that as soon as the lock owner sees session transition to 'CONNECTING' state, it should commit suicide to avoid the risk of multiple owners. Please correct my assumption if I am wrong or there is a better way to guarantee a single lock owner/leader. If above assumption is correct, I am trying to figure out how I can improve the availability of the application (leader/lock owner) when zookeeper ensemble is broken (eg. undergoing zookeeper leader election for prolonged period of time). Options I have considered: 1/ use multiple ensembles for leader election/locking to avoid SPOF (complex to implement) 2/ extend the zookeeper protocol to provide client more info on connection loss, like zookeeper leader election in progress so that client can decide when it is ok to not commit suicide and still guarantee a single application leader/lock owner. (haven't been able to prove that this will guarantee single application leader/lock owner). If this has been already answered or solved, please point me to the post/doc. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
Once client A loses connection it must assume that it no longer has the lock (you could try to time the session but I think that’s a bad idea). Once you reconnect, you will know if your session is still active or not. When done correctly, there’s no chance that both A and B will think they own the lock at the same time. -Jordan On July 15, 2015 at 1:17:10 PM, Vikas Mehta (vikasme...@gmail.com) wrote: Thanks for the quick response Camille. If client A owns the lock, gets disconnected due to network partition, it will not see the SESSION_EXPIRED event until it is too late, i.e. client B has acquired the lock and done the damage. Problem here is that client cannot distinguish network partition from zookeeper ensemble in leader election state. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581279.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
To expand on my point, if you want to be able to continue to attempt to make progress when the ZK is down, the act of getting a lock should also cause the lock owner to get a sequence number that it can use to identify the period of operation it is in. I believe that then, say, you get sequence number 1. If you tag all of your requests with 1, if for any reason you lose the lock and don't know it, and server #2 gets the lock, it should get sequence #2. The resource should then reject all requests with sequence below 2, so if any remaining requests tagged 1 are lying around they should be rejected by the resource. And there you have it: You can continue to make safe forward progress while in uncertain state on the ZK side so long as the original lock holder is available and the resource validates these things. If both the ZK itself go down and the original lock holder goes down, you're still AWOL presumably. C On Wed, Jul 15, 2015 at 2:24 PM, Camille Fournier cami...@apache.org wrote: I thought that the client itself had a notion of the session timeout internally that would conservatively let the client know that it was dead? If not, then that's my faulty memory. That being said, if you really care about the client not sending messages when it does not have the lock, the resource under contention needs to validate the messages it is receiving, though. You cannot guarantee that just because a client believes it is connected and sends a message to locked resource that the message will be received while the sender still has the lock. If you don't care about this possibility then just assuming you lose the lock when you are in any state other than connected is adequate but just be aware that events such as long GC pauses and network issues can cause you to access the resource improperly. C On Wed, Jul 15, 2015 at 2:19 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: Once client A loses connection it must assume that it no longer has the lock (you could try to time the session but I think that’s a bad idea). Once you reconnect, you will know if your session is still active or not. When done correctly, there’s no chance that both A and B will think they own the lock at the same time. -Jordan On July 15, 2015 at 1:17:10 PM, Vikas Mehta (vikasme...@gmail.com) wrote: Thanks for the quick response Camille. If client A owns the lock, gets disconnected due to network partition, it will not see the SESSION_EXPIRED event until it is too late, i.e. client B has acquired the lock and done the damage. Problem here is that client cannot distinguish network partition from zookeeper ensemble in leader election state. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581279.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
He’s talking about multiple writers. Given a reasonable session timeout, even a GC shouldn’t matter. If the GC causes a heartbeat miss the client will get SysDisconnected. -Jordan On July 15, 2015 at 2:05:41 PM, Camille Fournier (skami...@gmail.com) wrote: If client a does a full gc immediately before sending a message that is long enough to lose the lock, it will send the message out of order. You cannot guarantee exclusive access without verification at the locked resource. C On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: I don’t see how there’s a chance of multiple writers. Assuming a reasonable session timeout: * Client A gets the lock * Client B watches Client A’s lock node * Client A gets a network partition * Client A will get a SysDisconnected before the session times out * Client A must immediately assume it no longer has the lock * Client A’s session times out * Client A’s ephemeral node is deleted * Client B’s watch fires * Client B takes the lock * Client A reconnects and gets SESSION_EXPIRED Where’s the problem? This is how everyone uses ZooKeeper. There is 0 chance of multiple writers in this scenario. On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com) wrote: Camille, I don't have a central message store/processor that can guarantee single writer (if I had one, it would reduce (still useful in reducing lock contention, etc) the need/value of using zookeeper) and hence I am trying to minimize the chances of multiple writers (more or less trying to guarantee this) while maximizing availability (not trying to solve CAP theorem), by solving some specific issues that affect availability. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
OK - then that’s my recommendation. Anything else is unsafe for reasons described by others. That’s always been my recommendation with Curator or any use of ZooKeeper. -Jordan On July 15, 2015 at 2:16:03 PM, Vikas Mehta (vikasme...@gmail.com) wrote: Jordan, I mean the client gives up the lock and stops working on the shared resource. So when zookeeper is unavailable, no one is working on any shared resource (because they cannot distinguish network partition from zookeeper DEAD scenario). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
Camille, I don't have a central message store/processor that can guarantee single writer (if I had one, it would reduce (still useful in reducing lock contention, etc) the need/value of using zookeeper) and hence I am trying to minimize the chances of multiple writers (more or less trying to guarantee this) while maximizing availability (not trying to solve CAP theorem), by solving some specific issues that affect availability. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
If client a does a full gc immediately before sending a message that is long enough to lose the lock, it will send the message out of order. You cannot guarantee exclusive access without verification at the locked resource. C On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: I don’t see how there’s a chance of multiple writers. Assuming a reasonable session timeout: * Client A gets the lock * Client B watches Client A’s lock node * Client A gets a network partition * Client A will get a SysDisconnected before the session times out * Client A must immediately assume it no longer has the lock * Client A’s session times out * Client A’s ephemeral node is deleted * Client B’s watch fires * Client B takes the lock * Client A reconnects and gets SESSION_EXPIRED Where’s the problem? This is how everyone uses ZooKeeper. There is 0 chance of multiple writers in this scenario. On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com) wrote: Camille, I don't have a central message store/processor that can guarantee single writer (if I had one, it would reduce (still useful in reducing lock contention, etc) the need/value of using zookeeper) and hence I am trying to minimize the chances of multiple writers (more or less trying to guarantee this) while maximizing availability (not trying to solve CAP theorem), by solving some specific issues that affect availability. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
Jordan, I mean the client gives up the lock and stops working on the shared resource. So when zookeeper is unavailable, no one is working on any shared resource (because they cannot distinguish network partition from zookeeper DEAD scenario). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
I didn’t mean to imply you don’t. But, I really need to understand this as it changes many assumptions I have. Can anyone explain? If not, I’ll wait for your explanation. On July 15, 2015 at 2:23:33 PM, Camille Fournier (cami...@apache.org) wrote: I'm in transit so I don't have time to explain this in detail but I promise you I know what I'm talking about so take some time to think through what I've said and work it out for yourself, it takes longer than five minutes to work through it. C Even if messages go out of order how does that change the algorithm? Can you explain? The session is maintained by means of a heartbeat. If a heartbeat is missed then the client goes to Disconnected. Message ordering in this case shouldn’t matter. Are you saying that message ordering can cause two clients to think they have the lowest sequence number? -Jordan On July 15, 2015 at 2:12:26 PM, Camille Fournier (skami...@gmail.com) wrote: I don't know what to tell you Jordan, but this is an observable phenomenon and it can happen. It's relatively unlikely and rare but not impossible. If you're interested in it I'd recommend reading the chubby paper out of Google, where they discuss it in some detail. C On Jul 15, 2015 3:09 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: He’s talking about multiple writers. Given a reasonable session timeout, even a GC shouldn’t matter. If the GC causes a heartbeat miss the client will get SysDisconnected. -Jordan On July 15, 2015 at 2:05:41 PM, Camille Fournier (skami...@gmail.com) wrote: If client a does a full gc immediately before sending a message that is long enough to lose the lock, it will send the message out of order. You cannot guarantee exclusive access without verification at the locked resource. C On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: I don’t see how there’s a chance of multiple writers. Assuming a reasonable session timeout: * Client A gets the lock * Client B watches Client A’s lock node * Client A gets a network partition * Client A will get a SysDisconnected before the session times out * Client A must immediately assume it no longer has the lock * Client A’s session times out * Client A’s ephemeral node is deleted * Client B’s watch fires * Client B takes the lock * Client A reconnects and gets SESSION_EXPIRED Where’s the problem? This is how everyone uses ZooKeeper. There is 0 chance of multiple writers in this scenario. On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com) wrote: Camille, I don't have a central message store/processor that can guarantee single writer (if I had one, it would reduce (still useful in reducing lock contention, etc) the need/value of using zookeeper) and hence I am trying to minimize the chances of multiple writers (more or less trying to guarantee this) while maximizing availability (not trying to solve CAP theorem), by solving some specific issues that affect availability. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
Ivan, I just read the blog and I still don’t see how this can happen. Sorry if I’m being dense. I’d appreciate a discussion on this. In your blog you state: when ZooKeeper tells you that you are leader, there’s no guarantee that there isn’t another node that 'thinks' its the leader.” However, given a long enough session time — I usually recommend 30–60 seconds, I don’t see how this can happen. The client itself determines that there is a network partition when there is no heartbeat success. The heartbeat is a fraction of the session timeout. Once the heartbeat fails, the client must assume it no longer has the lock. Another client cannot take over the lock until, at minimum, session timeout. So, how then can there be two leaders? -Jordan On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote: I blogged about this exact problem a couple of weeks ago [1]. I give an example of how split brain can happen in a resource under a zk lock (Hbase in this case). As Camille says, sequence numbers ftw. I'll add that the data store has to support them though, which not all do (in fact I've yet to see one in the wild that does). I've implemented a prototype that works with hbase[2] if you want to see what it looks like. -Ivan [1] https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215 [2] https://github.com/ivankelly/hbase-exclusive-writer On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote: Jordan, I mean the client gives up the lock and stops working on the shared resource. So when zookeeper is unavailable, no one is working on any shared resource (because they cannot distinguish network partition from zookeeper DEAD scenario). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
I don't know what to tell you Jordan, but this is an observable phenomenon and it can happen. It's relatively unlikely and rare but not impossible. If you're interested in it I'd recommend reading the chubby paper out of Google, where they discuss it in some detail. C On Jul 15, 2015 3:09 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: He’s talking about multiple writers. Given a reasonable session timeout, even a GC shouldn’t matter. If the GC causes a heartbeat miss the client will get SysDisconnected. -Jordan On July 15, 2015 at 2:05:41 PM, Camille Fournier (skami...@gmail.com) wrote: If client a does a full gc immediately before sending a message that is long enough to lose the lock, it will send the message out of order. You cannot guarantee exclusive access without verification at the locked resource. C On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: I don’t see how there’s a chance of multiple writers. Assuming a reasonable session timeout: * Client A gets the lock * Client B watches Client A’s lock node * Client A gets a network partition * Client A will get a SysDisconnected before the session times out * Client A must immediately assume it no longer has the lock * Client A’s session times out * Client A’s ephemeral node is deleted * Client B’s watch fires * Client B takes the lock * Client A reconnects and gets SESSION_EXPIRED Where’s the problem? This is how everyone uses ZooKeeper. There is 0 chance of multiple writers in this scenario. On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com) wrote: Camille, I don't have a central message store/processor that can guarantee single writer (if I had one, it would reduce (still useful in reducing lock contention, etc) the need/value of using zookeeper) and hence I am trying to minimize the chances of multiple writers (more or less trying to guarantee this) while maximizing availability (not trying to solve CAP theorem), by solving some specific issues that affect availability. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
I'm in transit so I don't have time to explain this in detail but I promise you I know what I'm talking about so take some time to think through what I've said and work it out for yourself, it takes longer than five minutes to work through it. C Even if messages go out of order how does that change the algorithm? Can you explain? The session is maintained by means of a heartbeat. If a heartbeat is missed then the client goes to Disconnected. Message ordering in this case shouldn’t matter. Are you saying that message ordering can cause two clients to think they have the lowest sequence number? -Jordan On July 15, 2015 at 2:12:26 PM, Camille Fournier (skami...@gmail.com) wrote: I don't know what to tell you Jordan, but this is an observable phenomenon and it can happen. It's relatively unlikely and rare but not impossible. If you're interested in it I'd recommend reading the chubby paper out of Google, where they discuss it in some detail. C On Jul 15, 2015 3:09 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: He’s talking about multiple writers. Given a reasonable session timeout, even a GC shouldn’t matter. If the GC causes a heartbeat miss the client will get SysDisconnected. -Jordan On July 15, 2015 at 2:05:41 PM, Camille Fournier (skami...@gmail.com) wrote: If client a does a full gc immediately before sending a message that is long enough to lose the lock, it will send the message out of order. You cannot guarantee exclusive access without verification at the locked resource. C On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: I don’t see how there’s a chance of multiple writers. Assuming a reasonable session timeout: * Client A gets the lock * Client B watches Client A’s lock node * Client A gets a network partition * Client A will get a SysDisconnected before the session times out * Client A must immediately assume it no longer has the lock * Client A’s session times out * Client A’s ephemeral node is deleted * Client B’s watch fires * Client B takes the lock * Client A reconnects and gets SESSION_EXPIRED Where’s the problem? This is how everyone uses ZooKeeper. There is 0 chance of multiple writers in this scenario. On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com) wrote: Camille, I don't have a central message store/processor that can guarantee single writer (if I had one, it would reduce (still useful in reducing lock contention, etc) the need/value of using zookeeper) and hence I am trying to minimize the chances of multiple writers (more or less trying to guarantee this) while maximizing availability (not trying to solve CAP theorem), by solving some specific issues that affect availability. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
I blogged about this exact problem a couple of weeks ago [1]. I give an example of how split brain can happen in a resource under a zk lock (Hbase in this case). As Camille says, sequence numbers ftw. I'll add that the data store has to support them though, which not all do (in fact I've yet to see one in the wild that does). I've implemented a prototype that works with hbase[2] if you want to see what it looks like. -Ivan [1] https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215 [2] https://github.com/ivankelly/hbase-exclusive-writer On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote: Jordan, I mean the client gives up the lock and stops working on the shared resource. So when zookeeper is unavailable, no one is working on any shared resource (because they cannot distinguish network partition from zookeeper DEAD scenario). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
I don’t see how there’s a chance of multiple writers. Assuming a reasonable session timeout: * Client A gets the lock * Client B watches Client A’s lock node * Client A gets a network partition * Client A will get a SysDisconnected before the session times out * Client A must immediately assume it no longer has the lock * Client A’s session times out * Client A’s ephemeral node is deleted * Client B’s watch fires * Client B takes the lock * Client A reconnects and gets SESSION_EXPIRED Where’s the problem? This is how everyone uses ZooKeeper. There is 0 chance of multiple writers in this scenario. On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com) wrote: Camille, I don't have a central message store/processor that can guarantee single writer (if I had one, it would reduce (still useful in reducing lock contention, etc) the need/value of using zookeeper) and hence I am trying to minimize the chances of multiple writers (more or less trying to guarantee this) while maximizing availability (not trying to solve CAP theorem), by solving some specific issues that affect availability. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
Jordan, We are using it like you describe and don't have a multi-writer problem. Issue is: * Zookeeper ensemble goes into leader election * All clients see session transition to 'CONNECTING' state (SysDisconnected in your sequence) * All clients give up all the locks they owned on the resources they were working with for a GLOBAL outage of the application. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581287.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
+1 to what Camille is saying suggestion to use generations On Wed, Jul 15, 2015 at 12:04 PM, Camille Fournier skami...@gmail.com wrote: If client a does a full gc immediately before sending a message that is long enough to lose the lock, it will send the message out of order. You cannot guarantee exclusive access without verification at the locked resource. C On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: I don’t see how there’s a chance of multiple writers. Assuming a reasonable session timeout: * Client A gets the lock * Client B watches Client A’s lock node * Client A gets a network partition * Client A will get a SysDisconnected before the session times out * Client A must immediately assume it no longer has the lock * Client A’s session times out * Client A’s ephemeral node is deleted * Client B’s watch fires * Client B takes the lock * Client A reconnects and gets SESSION_EXPIRED Where’s the problem? This is how everyone uses ZooKeeper. There is 0 chance of multiple writers in this scenario. On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com) wrote: Camille, I don't have a central message store/processor that can guarantee single writer (if I had one, it would reduce (still useful in reducing lock contention, etc) the need/value of using zookeeper) and hence I am trying to minimize the chances of multiple writers (more or less trying to guarantee this) while maximizing availability (not trying to solve CAP theorem), by solving some specific issues that affect availability. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
Camille, Agreed, but fixing application level bugs is a different set of problems. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581289.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
Even if messages go out of order how does that change the algorithm? Can you explain? The session is maintained by means of a heartbeat. If a heartbeat is missed then the client goes to Disconnected. Message ordering in this case shouldn’t matter. Are you saying that message ordering can cause two clients to think they have the lowest sequence number? -Jordan On July 15, 2015 at 2:12:26 PM, Camille Fournier (skami...@gmail.com) wrote: I don't know what to tell you Jordan, but this is an observable phenomenon and it can happen. It's relatively unlikely and rare but not impossible. If you're interested in it I'd recommend reading the chubby paper out of Google, where they discuss it in some detail. C On Jul 15, 2015 3:09 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: He’s talking about multiple writers. Given a reasonable session timeout, even a GC shouldn’t matter. If the GC causes a heartbeat miss the client will get SysDisconnected. -Jordan On July 15, 2015 at 2:05:41 PM, Camille Fournier (skami...@gmail.com) wrote: If client a does a full gc immediately before sending a message that is long enough to lose the lock, it will send the message out of order. You cannot guarantee exclusive access without verification at the locked resource. C On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: I don’t see how there’s a chance of multiple writers. Assuming a reasonable session timeout: * Client A gets the lock * Client B watches Client A’s lock node * Client A gets a network partition * Client A will get a SysDisconnected before the session times out * Client A must immediately assume it no longer has the lock * Client A’s session times out * Client A’s ephemeral node is deleted * Client B’s watch fires * Client B takes the lock * Client A reconnects and gets SESSION_EXPIRED Where’s the problem? This is how everyone uses ZooKeeper. There is 0 chance of multiple writers in this scenario. On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com) wrote: Camille, I don't have a central message store/processor that can guarantee single writer (if I had one, it would reduce (still useful in reducing lock contention, etc) the need/value of using zookeeper) and hence I am trying to minimize the chances of multiple writers (more or less trying to guarantee this) while maximizing availability (not trying to solve CAP theorem), by solving some specific issues that affect availability. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
Ivan, Thanks for the references. Great write up describing the problem and the solution. I agree that for total ordering storage layer should provide the guarantee. However, in my scenario, I don't have a global/central storage system (trying to build a globally (servers in different continents) distributed storage system), I am minimizing the chances of multiple writers with longer timeouts and eliminating the downtime during planned maintenance, but don't have a way to deal with zookeeper ensemble being DEAD (or in leader election), which wipes out the entire application. Currently, we twiddle our thumbs when zookeeper or network has issues and would like to find a way to improve the application availability during zookeeper service interruptions. Using multiple ensembles seems like a promising path, but I wanted to see if anyone has thought about extending the error handling between zookeeper client/server to deal with this issue. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581299.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
I thought that the client itself had a notion of the session timeout internally that would conservatively let the client know that it was dead? If not, then that's my faulty memory. That being said, if you really care about the client not sending messages when it does not have the lock, the resource under contention needs to validate the messages it is receiving, though. You cannot guarantee that just because a client believes it is connected and sends a message to locked resource that the message will be received while the sender still has the lock. If you don't care about this possibility then just assuming you lose the lock when you are in any state other than connected is adequate but just be aware that events such as long GC pauses and network issues can cause you to access the resource improperly. C On Wed, Jul 15, 2015 at 2:19 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: Once client A loses connection it must assume that it no longer has the lock (you could try to time the session but I think that’s a bad idea). Once you reconnect, you will know if your session is still active or not. When done correctly, there’s no chance that both A and B will think they own the lock at the same time. -Jordan On July 15, 2015 at 1:17:10 PM, Vikas Mehta (vikasme...@gmail.com) wrote: Thanks for the quick response Camille. If client A owns the lock, gets disconnected due to network partition, it will not see the SESSION_EXPIRED event until it is too late, i.e. client B has acquired the lock and done the damage. Problem here is that client cannot distinguish network partition from zookeeper ensemble in leader election state. -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581279.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
This property may hold if you make a lot of timing/synchrony assumptions -- agreeing on who holds the lock in an asynchronous distributed system with failures is impossible, this is the FLP impossibility. But even if it holds, this property is not very useful if the ZK client itself doesn't have the application data. So one has to consider whether it is possible that the application sees a messages from two clients that both think are the leader in an order which contradicts the lock acquisition order. On Wed, Jul 15, 2015 at 1:26 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: I think we may be talking past each other here. My contention (and the ZK docs agree BTW) is that, properly written and configured, at any snapshot in time no two clients think they hold the same lock”. How your application acts on that fact is another thing. You might need sequence numbers, you might not. -Jordan On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com) wrote: Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper: link http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf it suggests 2 ways in which the storage can support lock generations and proposes an alternative for the case where the storage can't be made aware of lock generations. On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: Ivan, I just read the blog and I still don’t see how this can happen. Sorry if I’m being dense. I’d appreciate a discussion on this. In your blog you state: when ZooKeeper tells you that you are leader, there’s no guarantee that there isn’t another node that 'thinks' its the leader.” However, given a long enough session time — I usually recommend 30–60 seconds, I don’t see how this can happen. The client itself determines that there is a network partition when there is no heartbeat success. The heartbeat is a fraction of the session timeout. Once the heartbeat fails, the client must assume it no longer has the lock. Another client cannot take over the lock until, at minimum, session timeout. So, how then can there be two leaders? -Jordan On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote: I blogged about this exact problem a couple of weeks ago [1]. I give an example of how split brain can happen in a resource under a zk lock (Hbase in this case). As Camille says, sequence numbers ftw. I'll add that the data store has to support them though, which not all do (in fact I've yet to see one in the wild that does). I've implemented a prototype that works with hbase[2] if you want to see what it looks like. -Ivan [1] https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215 [2] https://github.com/ivankelly/hbase-exclusive-writer On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote: Jordan, I mean the client gives up the lock and stops working on the shared resource. So when zookeeper is unavailable, no one is working on any shared resource (because they cannot distinguish network partition from zookeeper DEAD scenario). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
This property may hold if you make a lot of timing/synchrony assumptions These assumptions and timing are intrinsic to using ZooKeeper. So, of course I’m making these assumptions. -Jordan On July 15, 2015 at 3:57:12 PM, Alexander Shraer (shra...@gmail.com) wrote: This property may hold if you make a lot of timing/synchrony assumptions -- agreeing on who holds the lock in an asynchronous distributed system with failures is impossible, this is the FLP impossibility. But even if it holds, this property is not very useful if the ZK client itself doesn't have the application data. So one has to consider whether it is possible that the application sees a messages from two clients that both think are the leader in an order which contradicts the lock acquisition order. On Wed, Jul 15, 2015 at 1:26 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: I think we may be talking past each other here. My contention (and the ZK docs agree BTW) is that, properly written and configured, at any snapshot in time no two clients think they hold the same lock”. How your application acts on that fact is another thing. You might need sequence numbers, you might not. -Jordan On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com) wrote: Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper: link http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf it suggests 2 ways in which the storage can support lock generations and proposes an alternative for the case where the storage can't be made aware of lock generations. On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: Ivan, I just read the blog and I still don’t see how this can happen. Sorry if I’m being dense. I’d appreciate a discussion on this. In your blog you state: when ZooKeeper tells you that you are leader, there’s no guarantee that there isn’t another node that 'thinks' its the leader.” However, given a long enough session time — I usually recommend 30–60 seconds, I don’t see how this can happen. The client itself determines that there is a network partition when there is no heartbeat success. The heartbeat is a fraction of the session timeout. Once the heartbeat fails, the client must assume it no longer has the lock. Another client cannot take over the lock until, at minimum, session timeout. So, how then can there be two leaders? -Jordan On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote: I blogged about this exact problem a couple of weeks ago [1]. I give an example of how split brain can happen in a resource under a zk lock (Hbase in this case). As Camille says, sequence numbers ftw. I'll add that the data store has to support them though, which not all do (in fact I've yet to see one in the wild that does). I've implemented a prototype that works with hbase[2] if you want to see what it looks like. -Ivan [1] https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215 [2] https://github.com/ivankelly/hbase-exclusive-writer On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote: Jordan, I mean the client gives up the lock and stops working on the shared resource. So when zookeeper is unavailable, no one is working on any shared resource (because they cannot distinguish network partition from zookeeper DEAD scenario). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
at any snapshot in time no two clients think they hold the same lock” According to the ZK service. But communication between the service and client takes time. -Ivan On Wed, Jul 15, 2015 at 10:54 PM Ivan Kelly iv...@apache.org wrote: Jordan, imagine you have a node which is leader using the hbase example. A client makes some request to the leader, which processes the request, lines up a write to the state in hbase, and promptly goes into a 30 second gc pause just before it flushes the socket. During the 30 second pause another node takes over as leader and starts writing to the state. Now, when the pause ends, what will stop the write from the first leader being flushed to the socket and then hitting hbase? -Ivan On Wed, Jul 15, 2015 at 10:26 PM Jordan Zimmerman jor...@jordanzimmerman.com wrote: I think we may be talking past each other here. My contention (and the ZK docs agree BTW) is that, properly written and configured, at any snapshot in time no two clients think they hold the same lock”. How your application acts on that fact is another thing. You might need sequence numbers, you might not. -Jordan On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com) wrote: Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper: link http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf it suggests 2 ways in which the storage can support lock generations and proposes an alternative for the case where the storage can't be made aware of lock generations. On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: Ivan, I just read the blog and I still don’t see how this can happen. Sorry if I’m being dense. I’d appreciate a discussion on this. In your blog you state: when ZooKeeper tells you that you are leader, there’s no guarantee that there isn’t another node that 'thinks' its the leader.” However, given a long enough session time — I usually recommend 30–60 seconds, I don’t see how this can happen. The client itself determines that there is a network partition when there is no heartbeat success. The heartbeat is a fraction of the session timeout. Once the heartbeat fails, the client must assume it no longer has the lock. Another client cannot take over the lock until, at minimum, session timeout. So, how then can there be two leaders? -Jordan On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote: I blogged about this exact problem a couple of weeks ago [1]. I give an example of how split brain can happen in a resource under a zk lock (Hbase in this case). As Camille says, sequence numbers ftw. I'll add that the data store has to support them though, which not all do (in fact I've yet to see one in the wild that does). I've implemented a prototype that works with hbase[2] if you want to see what it looks like. -Ivan [1] https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215 [2] https://github.com/ivankelly/hbase-exclusive-writer On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote: Jordan, I mean the client gives up the lock and stops working on the shared resource. So when zookeeper is unavailable, no one is working on any shared resource (because they cannot distinguish network partition from zookeeper DEAD scenario). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
It doesn’t matter that communication between the service the client takes time. The client can determine that it is no longer the lock holder independently of the server. -Jordan On July 15, 2015 at 3:55:40 PM, Ivan Kelly (iv...@apache.org) wrote: at any snapshot in time no two clients think they hold the same lock” According to the ZK service. But communication between the service and client takes time. -Ivan On Wed, Jul 15, 2015 at 10:54 PM Ivan Kelly iv...@apache.org wrote: Jordan, imagine you have a node which is leader using the hbase example. A client makes some request to the leader, which processes the request, lines up a write to the state in hbase, and promptly goes into a 30 second gc pause just before it flushes the socket. During the 30 second pause another node takes over as leader and starts writing to the state. Now, when the pause ends, what will stop the write from the first leader being flushed to the socket and then hitting hbase? -Ivan On Wed, Jul 15, 2015 at 10:26 PM Jordan Zimmerman jor...@jordanzimmerman.com wrote: I think we may be talking past each other here. My contention (and the ZK docs agree BTW) is that, properly written and configured, at any snapshot in time no two clients think they hold the same lock”. How your application acts on that fact is another thing. You might need sequence numbers, you might not. -Jordan On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com) wrote: Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper: link http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf it suggests 2 ways in which the storage can support lock generations and proposes an alternative for the case where the storage can't be made aware of lock generations. On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: Ivan, I just read the blog and I still don’t see how this can happen. Sorry if I’m being dense. I’d appreciate a discussion on this. In your blog you state: when ZooKeeper tells you that you are leader, there’s no guarantee that there isn’t another node that 'thinks' its the leader.” However, given a long enough session time — I usually recommend 30–60 seconds, I don’t see how this can happen. The client itself determines that there is a network partition when there is no heartbeat success. The heartbeat is a fraction of the session timeout. Once the heartbeat fails, the client must assume it no longer has the lock. Another client cannot take over the lock until, at minimum, session timeout. So, how then can there be two leaders? -Jordan On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote: I blogged about this exact problem a couple of weeks ago [1]. I give an example of how split brain can happen in a resource under a zk lock (Hbase in this case). As Camille says, sequence numbers ftw. I'll add that the data store has to support them though, which not all do (in fact I've yet to see one in the wild that does). I've implemented a prototype that works with hbase[2] if you want to see what it looks like. -Ivan [1] https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215 [2] https://github.com/ivankelly/hbase-exclusive-writer On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote: Jordan, I mean the client gives up the lock and stops working on the shared resource. So when zookeeper is unavailable, no one is working on any shared resource (because they cannot distinguish network partition from zookeeper DEAD scenario). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
I disagree, ZooKeeper itself actually doesn't rely on timing for safety - it won't get into an inconsistent state even if all timing assumptions fail (except for the sync operation, which is then not guaranteed to return the latest value, but that's a known issue that needs to be fixed). On Wed, Jul 15, 2015 at 2:13 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: This property may hold if you make a lot of timing/synchrony assumptions These assumptions and timing are intrinsic to using ZooKeeper. So, of course I’m making these assumptions. -Jordan On July 15, 2015 at 3:57:12 PM, Alexander Shraer (shra...@gmail.com) wrote: This property may hold if you make a lot of timing/synchrony assumptions -- agreeing on who holds the lock in an asynchronous distributed system with failures is impossible, this is the FLP impossibility. But even if it holds, this property is not very useful if the ZK client itself doesn't have the application data. So one has to consider whether it is possible that the application sees a messages from two clients that both think are the leader in an order which contradicts the lock acquisition order. On Wed, Jul 15, 2015 at 1:26 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: I think we may be talking past each other here. My contention (and the ZK docs agree BTW) is that, properly written and configured, at any snapshot in time no two clients think they hold the same lock”. How your application acts on that fact is another thing. You might need sequence numbers, you might not. -Jordan On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com) wrote: Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper: link http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf it suggests 2 ways in which the storage can support lock generations and proposes an alternative for the case where the storage can't be made aware of lock generations. On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: Ivan, I just read the blog and I still don’t see how this can happen. Sorry if I’m being dense. I’d appreciate a discussion on this. In your blog you state: when ZooKeeper tells you that you are leader, there’s no guarantee that there isn’t another node that 'thinks' its the leader.” However, given a long enough session time — I usually recommend 30–60 seconds, I don’t see how this can happen. The client itself determines that there is a network partition when there is no heartbeat success. The heartbeat is a fraction of the session timeout. Once the heartbeat fails, the client must assume it no longer has the lock. Another client cannot take over the lock until, at minimum, session timeout. So, how then can there be two leaders? -Jordan On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote: I blogged about this exact problem a couple of weeks ago [1]. I give an example of how split brain can happen in a resource under a zk lock (Hbase in this case). As Camille says, sequence numbers ftw. I'll add that the data store has to support them though, which not all do (in fact I've yet to see one in the wild that does). I've implemented a prototype that works with hbase[2] if you want to see what it looks like. -Ivan [1] https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215 [2] https://github.com/ivankelly/hbase-exclusive-writer On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote: Jordan, I mean the client gives up the lock and stops working on the shared resource. So when zookeeper is unavailable, no one is working on any shared resource (because they cannot distinguish network partition from zookeeper DEAD scenario). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper: link http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf it suggests 2 ways in which the storage can support lock generations and proposes an alternative for the case where the storage can't be made aware of lock generations. On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: Ivan, I just read the blog and I still don’t see how this can happen. Sorry if I’m being dense. I’d appreciate a discussion on this. In your blog you state: when ZooKeeper tells you that you are leader, there’s no guarantee that there isn’t another node that 'thinks' its the leader.” However, given a long enough session time — I usually recommend 30–60 seconds, I don’t see how this can happen. The client itself determines that there is a network partition when there is no heartbeat success. The heartbeat is a fraction of the session timeout. Once the heartbeat fails, the client must assume it no longer has the lock. Another client cannot take over the lock until, at minimum, session timeout. So, how then can there be two leaders? -Jordan On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote: I blogged about this exact problem a couple of weeks ago [1]. I give an example of how split brain can happen in a resource under a zk lock (Hbase in this case). As Camille says, sequence numbers ftw. I'll add that the data store has to support them though, which not all do (in fact I've yet to see one in the wild that does). I've implemented a prototype that works with hbase[2] if you want to see what it looks like. -Ivan [1] https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215 [2] https://github.com/ivankelly/hbase-exclusive-writer On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote: Jordan, I mean the client gives up the lock and stops working on the shared resource. So when zookeeper is unavailable, no one is working on any shared resource (because they cannot distinguish network partition from zookeeper DEAD scenario). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
I think we may be talking past each other here. My contention (and the ZK docs agree BTW) is that, properly written and configured, at any snapshot in time no two clients think they hold the same lock”. How your application acts on that fact is another thing. You might need sequence numbers, you might not. -Jordan On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com) wrote: Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper: link http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf it suggests 2 ways in which the storage can support lock generations and proposes an alternative for the case where the storage can't be made aware of lock generations. On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: Ivan, I just read the blog and I still don’t see how this can happen. Sorry if I’m being dense. I’d appreciate a discussion on this. In your blog you state: when ZooKeeper tells you that you are leader, there’s no guarantee that there isn’t another node that 'thinks' its the leader.” However, given a long enough session time — I usually recommend 30–60 seconds, I don’t see how this can happen. The client itself determines that there is a network partition when there is no heartbeat success. The heartbeat is a fraction of the session timeout. Once the heartbeat fails, the client must assume it no longer has the lock. Another client cannot take over the lock until, at minimum, session timeout. So, how then can there be two leaders? -Jordan On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote: I blogged about this exact problem a couple of weeks ago [1]. I give an example of how split brain can happen in a resource under a zk lock (Hbase in this case). As Camille says, sequence numbers ftw. I'll add that the data store has to support them though, which not all do (in fact I've yet to see one in the wild that does). I've implemented a prototype that works with hbase[2] if you want to see what it looks like. -Ivan [1] https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215 [2] https://github.com/ivankelly/hbase-exclusive-writer On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote: Jordan, I mean the client gives up the lock and stops working on the shared resource. So when zookeeper is unavailable, no one is working on any shared resource (because they cannot distinguish network partition from zookeeper DEAD scenario). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
To expand a bit on the zookeeper client/server handshake part, one possible option is to improve the situation for planned maintenance, i.e. before leader is taken down (it knows it is going down), it tells the clients to continue until zookeeper service is restored (there are a lot of gotchas and edge conditions to think about here and I being a zookeeper noob haven't been able to prove whether this will work or not). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581302.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
Jordan, imagine you have a node which is leader using the hbase example. A client makes some request to the leader, which processes the request, lines up a write to the state in hbase, and promptly goes into a 30 second gc pause just before it flushes the socket. During the 30 second pause another node takes over as leader and starts writing to the state. Now, when the pause ends, what will stop the write from the first leader being flushed to the socket and then hitting hbase? -Ivan On Wed, Jul 15, 2015 at 10:26 PM Jordan Zimmerman jor...@jordanzimmerman.com wrote: I think we may be talking past each other here. My contention (and the ZK docs agree BTW) is that, properly written and configured, at any snapshot in time no two clients think they hold the same lock”. How your application acts on that fact is another thing. You might need sequence numbers, you might not. -Jordan On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com) wrote: Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper: link http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf it suggests 2 ways in which the storage can support lock generations and proposes an alternative for the case where the storage can't be made aware of lock generations. On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: Ivan, I just read the blog and I still don’t see how this can happen. Sorry if I’m being dense. I’d appreciate a discussion on this. In your blog you state: when ZooKeeper tells you that you are leader, there’s no guarantee that there isn’t another node that 'thinks' its the leader.” However, given a long enough session time — I usually recommend 30–60 seconds, I don’t see how this can happen. The client itself determines that there is a network partition when there is no heartbeat success. The heartbeat is a fraction of the session timeout. Once the heartbeat fails, the client must assume it no longer has the lock. Another client cannot take over the lock until, at minimum, session timeout. So, how then can there be two leaders? -Jordan On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote: I blogged about this exact problem a couple of weeks ago [1]. I give an example of how split brain can happen in a resource under a zk lock (Hbase in this case). As Camille says, sequence numbers ftw. I'll add that the data store has to support them though, which not all do (in fact I've yet to see one in the wild that does). I've implemented a prototype that works with hbase[2] if you want to see what it looks like. -Ivan [1] https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215 [2] https://github.com/ivankelly/hbase-exclusive-writer On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote: Jordan, I mean the client gives up the lock and stops working on the shared resource. So when zookeeper is unavailable, no one is working on any shared resource (because they cannot distinguish network partition from zookeeper DEAD scenario). -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html Sent from the zookeeper-user mailing list archive at Nabble.com.
Re: locking/leader election and dealing with session loss
On Wed, Jul 15, 2015 at 11:14 PM Jordan Zimmerman jor...@jordanzimmerman.com wrote: This property may hold if you make a lot of timing/synchrony assumptions These assumptions and timing are intrinsic to using ZooKeeper. So, of course I’m making these assumptions. Zookeeper does not rely on timing for correctness. Sure, it uses timeouts to make sure it can make progress, but these can be arbitrary. -Ivan