Re: zookeeper on ec2
What is your client timeout? It may be too low. also see this section on handling recoverable errors: http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling connection loss in particular needs special care since: When a ZooKeeper client loses a connection to the ZooKeeper server there may be some requests in flight; we don't know where they were in their flight at the time of the connection loss. Patrick Satish Bhatti wrote: I have recently started running on EC2 and am seeing quite a few ConnectionLoss exceptions. Should I just catch these and retry? Since I assume that eventually, if the shit truly hits the fan, I will get a SessionExpired? Satish On Mon, Jul 6, 2009 at 11:35 AM, Ted Dunning ted.dunn...@gmail.com wrote: We have used EC2 quite a bit for ZK. The basic lessons that I have learned include: a) EC2's biggest advantage after scaling and elasticity was conformity of configuration. Since you are bringing machines up and down all the time, they begin to act more like programs and you wind up with boot scripts that give you a very predictable environment. Nice. b) EC2 interconnect has a lot more going on than in a dedicated VLAN. That can make the ZK servers appear a bit less connected. You have to plan for ConnectionLoss events. c) for highest reliability, I switched to large instances. On reflection, I think that was helpful, but less important than I thought at the time. d) increasing and decreasing cluster size is nearly painless and is easily scriptable. To decrease, do a rolling update on the survivors to update their configuration. Then take down the instance you want to lose. To increase, do a rolling update starting with the new instances to update the configuration to include all of the machines. The rolling update should bounce each ZK with several seconds between each bounce. Rescaling the cluster takes less than a minute which makes it comparable to EC2 instance boot time (about 30 seconds for the Alestic ubuntu instance that we used plus about 20 seconds for additional configuration). On Mon, Jul 6, 2009 at 4:45 AM, David Graf david.g...@28msec.com wrote: Hello I wanna set up a zookeeper ensemble on amazon's ec2 service. In my system, zookeeper is used to run a locking service and to generate unique id's. Currently, for testing purposes, I am only running one instance. Now, I need to set up an ensemble to protect my system against crashes. The ec2 services has some differences to a normal server farm. E.g. the data saved on the file system of an ec2 instance is lost if the instance crashes. In the documentation of zookeeper, I have read that zookeeper saves snapshots of the in-memory data in the file system. Is that needed for recovery? Logically, it would be much easier for me if this is not the case. Additionally, ec2 brings the advantage that serves can be switch on and off dynamically dependent on the load, traffic, etc. Can this advantage be utilized for a zookeeper ensemble? Is it possible to add a zookeeper server dynamically to an ensemble? E.g. dependent on the in-memory load? David
Re: zookeeper on ec2
Hi Satish, Connectionloss is a little trickier than just retrying blindly. Please read the following sections on this - http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling And the programmers guide: http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperProgrammers.html To learn more about how to handle CONNECTIONLOSS. The idea is that that blindly retrying would create problems with CONNECTIONLOSS, since a CONNECTIONLOSS does NOT necessarily mean that the zookepeer operation that you were executing failed to execute. It might be possible that this operation went through the servers. Since, this has been a constant source of confusion for everyone who starts using zookeeper we are working on a fix ZOOKEEPER-22 which will take care of this problem and programmers would not have to worry about CONNECTIONLOSS handling. Thanks mahadev On 9/1/09 4:13 PM, Satish Bhatti cthd2...@gmail.com wrote: I have recently started running on EC2 and am seeing quite a few ConnectionLoss exceptions. Should I just catch these and retry? Since I assume that eventually, if the shit truly hits the fan, I will get a SessionExpired? Satish On Mon, Jul 6, 2009 at 11:35 AM, Ted Dunning ted.dunn...@gmail.com wrote: We have used EC2 quite a bit for ZK. The basic lessons that I have learned include: a) EC2's biggest advantage after scaling and elasticity was conformity of configuration. Since you are bringing machines up and down all the time, they begin to act more like programs and you wind up with boot scripts that give you a very predictable environment. Nice. b) EC2 interconnect has a lot more going on than in a dedicated VLAN. That can make the ZK servers appear a bit less connected. You have to plan for ConnectionLoss events. c) for highest reliability, I switched to large instances. On reflection, I think that was helpful, but less important than I thought at the time. d) increasing and decreasing cluster size is nearly painless and is easily scriptable. To decrease, do a rolling update on the survivors to update their configuration. Then take down the instance you want to lose. To increase, do a rolling update starting with the new instances to update the configuration to include all of the machines. The rolling update should bounce each ZK with several seconds between each bounce. Rescaling the cluster takes less than a minute which makes it comparable to EC2 instance boot time (about 30 seconds for the Alestic ubuntu instance that we used plus about 20 seconds for additional configuration). On Mon, Jul 6, 2009 at 4:45 AM, David Graf david.g...@28msec.com wrote: Hello I wanna set up a zookeeper ensemble on amazon's ec2 service. In my system, zookeeper is used to run a locking service and to generate unique id's. Currently, for testing purposes, I am only running one instance. Now, I need to set up an ensemble to protect my system against crashes. The ec2 services has some differences to a normal server farm. E.g. the data saved on the file system of an ec2 instance is lost if the instance crashes. In the documentation of zookeeper, I have read that zookeeper saves snapshots of the in-memory data in the file system. Is that needed for recovery? Logically, it would be much easier for me if this is not the case. Additionally, ec2 brings the advantage that serves can be switch on and off dynamically dependent on the load, traffic, etc. Can this advantage be utilized for a zookeeper ensemble? Is it possible to add a zookeeper server dynamically to an ensemble? E.g. dependent on the in-memory load? David
Re: zookeeper on ec2
I'm not very familiar with ec2 environment, are you doing any monitoring? In particular network connectivity btw nodes? Sounds like networking issues btw nodes (I'm assuming you've also looked at stuff like this http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting and verified that you are not swapping (see gc pressure), etc...) Patrick Satish Bhatti wrote: Session timeout is 30 seconds. On Tue, Sep 1, 2009 at 4:26 PM, Patrick Hunt ph...@apache.org wrote: What is your client timeout? It may be too low. also see this section on handling recoverable errors: http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling connection loss in particular needs special care since: When a ZooKeeper client loses a connection to the ZooKeeper server there may be some requests in flight; we don't know where they were in their flight at the time of the connection loss. Patrick Satish Bhatti wrote: I have recently started running on EC2 and am seeing quite a few ConnectionLoss exceptions. Should I just catch these and retry? Since I assume that eventually, if the shit truly hits the fan, I will get a SessionExpired? Satish On Mon, Jul 6, 2009 at 11:35 AM, Ted Dunning ted.dunn...@gmail.com wrote: We have used EC2 quite a bit for ZK. The basic lessons that I have learned include: a) EC2's biggest advantage after scaling and elasticity was conformity of configuration. Since you are bringing machines up and down all the time, they begin to act more like programs and you wind up with boot scripts that give you a very predictable environment. Nice. b) EC2 interconnect has a lot more going on than in a dedicated VLAN. That can make the ZK servers appear a bit less connected. You have to plan for ConnectionLoss events. c) for highest reliability, I switched to large instances. On reflection, I think that was helpful, but less important than I thought at the time. d) increasing and decreasing cluster size is nearly painless and is easily scriptable. To decrease, do a rolling update on the survivors to update their configuration. Then take down the instance you want to lose. To increase, do a rolling update starting with the new instances to update the configuration to include all of the machines. The rolling update should bounce each ZK with several seconds between each bounce. Rescaling the cluster takes less than a minute which makes it comparable to EC2 instance boot time (about 30 seconds for the Alestic ubuntu instance that we used plus about 20 seconds for additional configuration). On Mon, Jul 6, 2009 at 4:45 AM, David Graf david.g...@28msec.com wrote: Hello I wanna set up a zookeeper ensemble on amazon's ec2 service. In my system, zookeeper is used to run a locking service and to generate unique id's. Currently, for testing purposes, I am only running one instance. Now, I need to set up an ensemble to protect my system against crashes. The ec2 services has some differences to a normal server farm. E.g. the data saved on the file system of an ec2 instance is lost if the instance crashes. In the documentation of zookeeper, I have read that zookeeper saves snapshots of the in-memory data in the file system. Is that needed for recovery? Logically, it would be much easier for me if this is not the case. Additionally, ec2 brings the advantage that serves can be switch on and off dynamically dependent on the load, traffic, etc. Can this advantage be utilized for a zookeeper ensemble? Is it possible to add a zookeeper server dynamically to an ensemble? E.g. dependent on the in-memory load? David
Re: zookeeper on ec2
For my initial testing I am running with a single ZooKeeper server, i.e. the ensemble only has one server. Not sure if this is exacerbating the problem? I will check out the trouble shooting link you sent me. On Tue, Sep 1, 2009 at 5:01 PM, Patrick Hunt ph...@apache.org wrote: I'm not very familiar with ec2 environment, are you doing any monitoring? In particular network connectivity btw nodes? Sounds like networking issues btw nodes (I'm assuming you've also looked at stuff like this http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting and verified that you are not swapping (see gc pressure), etc...) Patrick Satish Bhatti wrote: Session timeout is 30 seconds. On Tue, Sep 1, 2009 at 4:26 PM, Patrick Hunt ph...@apache.org wrote: What is your client timeout? It may be too low. also see this section on handling recoverable errors: http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling connection loss in particular needs special care since: When a ZooKeeper client loses a connection to the ZooKeeper server there may be some requests in flight; we don't know where they were in their flight at the time of the connection loss. Patrick Satish Bhatti wrote: I have recently started running on EC2 and am seeing quite a few ConnectionLoss exceptions. Should I just catch these and retry? Since I assume that eventually, if the shit truly hits the fan, I will get a SessionExpired? Satish On Mon, Jul 6, 2009 at 11:35 AM, Ted Dunning ted.dunn...@gmail.com wrote: We have used EC2 quite a bit for ZK. The basic lessons that I have learned include: a) EC2's biggest advantage after scaling and elasticity was conformity of configuration. Since you are bringing machines up and down all the time, they begin to act more like programs and you wind up with boot scripts that give you a very predictable environment. Nice. b) EC2 interconnect has a lot more going on than in a dedicated VLAN. That can make the ZK servers appear a bit less connected. You have to plan for ConnectionLoss events. c) for highest reliability, I switched to large instances. On reflection, I think that was helpful, but less important than I thought at the time. d) increasing and decreasing cluster size is nearly painless and is easily scriptable. To decrease, do a rolling update on the survivors to update their configuration. Then take down the instance you want to lose. To increase, do a rolling update starting with the new instances to update the configuration to include all of the machines. The rolling update should bounce each ZK with several seconds between each bounce. Rescaling the cluster takes less than a minute which makes it comparable to EC2 instance boot time (about 30 seconds for the Alestic ubuntu instance that we used plus about 20 seconds for additional configuration). On Mon, Jul 6, 2009 at 4:45 AM, David Graf david.g...@28msec.com wrote: Hello I wanna set up a zookeeper ensemble on amazon's ec2 service. In my system, zookeeper is used to run a locking service and to generate unique id's. Currently, for testing purposes, I am only running one instance. Now, I need to set up an ensemble to protect my system against crashes. The ec2 services has some differences to a normal server farm. E.g. the data saved on the file system of an ec2 instance is lost if the instance crashes. In the documentation of zookeeper, I have read that zookeeper saves snapshots of the in-memory data in the file system. Is that needed for recovery? Logically, it would be much easier for me if this is not the case. Additionally, ec2 brings the advantage that serves can be switch on and off dynamically dependent on the load, traffic, etc. Can this advantage be utilized for a zookeeper ensemble? Is it possible to add a zookeeper server dynamically to an ensemble? E.g. dependent on the in-memory load? David
Re: zookeeper on ec2
Depends on what your tests are. Are they pretty simple/light? then probably network issue. Heavy load testing? then might be the server/client, might be the network. easiest thing is to run a ping test while running your zk test and see if pings are getting through (and latency). You should also review your client/server logs for any information during the CLoss. Ted Dunning would be a good resource - he runs ZK inside ec2 and has alot of experience with it. Patrick Satish Bhatti wrote: For my initial testing I am running with a single ZooKeeper server, i.e. the ensemble only has one server. Not sure if this is exacerbating the problem? I will check out the trouble shooting link you sent me. On Tue, Sep 1, 2009 at 5:01 PM, Patrick Hunt ph...@apache.org wrote: I'm not very familiar with ec2 environment, are you doing any monitoring? In particular network connectivity btw nodes? Sounds like networking issues btw nodes (I'm assuming you've also looked at stuff like this http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting and verified that you are not swapping (see gc pressure), etc...) Patrick Satish Bhatti wrote: Session timeout is 30 seconds. On Tue, Sep 1, 2009 at 4:26 PM, Patrick Hunt ph...@apache.org wrote: What is your client timeout? It may be too low. also see this section on handling recoverable errors: http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling connection loss in particular needs special care since: When a ZooKeeper client loses a connection to the ZooKeeper server there may be some requests in flight; we don't know where they were in their flight at the time of the connection loss. Patrick Satish Bhatti wrote: I have recently started running on EC2 and am seeing quite a few ConnectionLoss exceptions. Should I just catch these and retry? Since I assume that eventually, if the shit truly hits the fan, I will get a SessionExpired? Satish On Mon, Jul 6, 2009 at 11:35 AM, Ted Dunning ted.dunn...@gmail.com wrote: We have used EC2 quite a bit for ZK. The basic lessons that I have learned include: a) EC2's biggest advantage after scaling and elasticity was conformity of configuration. Since you are bringing machines up and down all the time, they begin to act more like programs and you wind up with boot scripts that give you a very predictable environment. Nice. b) EC2 interconnect has a lot more going on than in a dedicated VLAN. That can make the ZK servers appear a bit less connected. You have to plan for ConnectionLoss events. c) for highest reliability, I switched to large instances. On reflection, I think that was helpful, but less important than I thought at the time. d) increasing and decreasing cluster size is nearly painless and is easily scriptable. To decrease, do a rolling update on the survivors to update their configuration. Then take down the instance you want to lose. To increase, do a rolling update starting with the new instances to update the configuration to include all of the machines. The rolling update should bounce each ZK with several seconds between each bounce. Rescaling the cluster takes less than a minute which makes it comparable to EC2 instance boot time (about 30 seconds for the Alestic ubuntu instance that we used plus about 20 seconds for additional configuration). On Mon, Jul 6, 2009 at 4:45 AM, David Graf david.g...@28msec.com wrote: Hello I wanna set up a zookeeper ensemble on amazon's ec2 service. In my system, zookeeper is used to run a locking service and to generate unique id's. Currently, for testing purposes, I am only running one instance. Now, I need to set up an ensemble to protect my system against crashes. The ec2 services has some differences to a normal server farm. E.g. the data saved on the file system of an ec2 instance is lost if the instance crashes. In the documentation of zookeeper, I have read that zookeeper saves snapshots of the in-memory data in the file system. Is that needed for recovery? Logically, it would be much easier for me if this is not the case. Additionally, ec2 brings the advantage that serves can be switch on and off dynamically dependent on the load, traffic, etc. Can this advantage be utilized for a zookeeper ensemble? Is it possible to add a zookeeper server dynamically to an ensemble? E.g. dependent on the in-memory load? David
Re: zookeeper on ec2
Can you enable verboseGC and look at the tenuring distribution and times for GC? On Tue, Sep 1, 2009 at 5:54 PM, Satish Bhatti cthd2...@gmail.com wrote: Parallel/Serial. inf...@domu-12-31-39-06-3d-d1:/opt/ir/agent/infact-installs/aaa/infact$ iostat Linux 2.6.18-xenU-ec2-v1.0 (domU-12-31-39-06-3D-D1) 09/01/2009 _x86_64_ avg-cpu: %user %nice %system %iowait %steal %idle 66.110.001.542.96 20.309.08 Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda2460.83 410.02 12458.18 40499322 1230554928 sdc 0.00 0.00 0.00 96 0 sda1 0.53 5.01 4.89 495338 482592 On Tue, Sep 1, 2009 at 5:46 PM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi satish, what GC are you using? Is it ConcurrentMarkSweep or Parallel/Serial? Also, how is your disk usage on this machine? Can you check your iostat numbers? Thanks mahadev On 9/1/09 5:15 PM, Satish Bhatti cthd2...@gmail.com wrote: GC Time: 11.628 seconds on PS MarkSweep (389 collections)5 minutes on PS scavenge( 7,636 collections) It's been running for about 48 hours. On Tue, Sep 1, 2009 at 5:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: Do you have long GC delays? On Tue, Sep 1, 2009 at 4:51 PM, Satish Bhatti cthd2...@gmail.com wrote: Session timeout is 30 seconds. On Tue, Sep 1, 2009 at 4:26 PM, Patrick Hunt ph...@apache.org wrote: What is your client timeout? It may be too low. also see this section on handling recoverable errors: http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling connection loss in particular needs special care since: When a ZooKeeper client loses a connection to the ZooKeeper server there may be some requests in flight; we don't know where they were in their flight at the time of the connection loss. Patrick Satish Bhatti wrote: I have recently started running on EC2 and am seeing quite a few ConnectionLoss exceptions. Should I just catch these and retry? Since I assume that eventually, if the shit truly hits the fan, I will get a SessionExpired? Satish On Mon, Jul 6, 2009 at 11:35 AM, Ted Dunning ted.dunn...@gmail.com wrote: We have used EC2 quite a bit for ZK. The basic lessons that I have learned include: a) EC2's biggest advantage after scaling and elasticity was conformity of configuration. Since you are bringing machines up and down all the time, they begin to act more like programs and you wind up with boot scripts that give you a very predictable environment. Nice. b) EC2 interconnect has a lot more going on than in a dedicated VLAN. That can make the ZK servers appear a bit less connected. You have to plan for ConnectionLoss events. c) for highest reliability, I switched to large instances. On reflection, I think that was helpful, but less important than I thought at the time. d) increasing and decreasing cluster size is nearly painless and is easily scriptable. To decrease, do a rolling update on the survivors to update their configuration. Then take down the instance you want to lose. To increase, do a rolling update starting with the new instances to update the configuration to include all of the machines. The rolling update should bounce each ZK with several seconds between each bounce. Rescaling the cluster takes less than a minute which makes it comparable to EC2 instance boot time (about 30 seconds for the Alestic ubuntu instance that we used plus about 20 seconds for additional configuration). On Mon, Jul 6, 2009 at 4:45 AM, David Graf david.g...@28msec.com wrote: Hello I wanna set up a zookeeper ensemble on amazon's ec2 service. In my system, zookeeper is used to run a locking service and to generate unique id's. Currently, for testing purposes, I am only running one instance. Now, I need to set up an ensemble to protect my system against crashes. The ec2 services has some differences to a normal server farm. E.g. the data saved on the file system of an ec2 instance is lost if the instance crashes. In the documentation of zookeeper, I have read that zookeeper saves snapshots of the in-memory data in the file system. Is that needed for recovery? Logically, it would be much easier for me if this is not the case. Additionally, ec2 brings the advantage that serves can be switch on and off dynamically dependent on the load, traffic, etc. Can this advantage be utilized for a zookeeper ensemble? Is it possible to add a zookeeper server dynamically to an ensemble? E.g. dependent on the in-memory
Re: zookeeper on ec2
Henry Robinson wrote: Effectively, EC2 does not introduce any new failure modes but potentially exacerbates some existing ones. If a majority of EC2 nodes fail (in the sense that their hard drive images cannot be recovered), there is no way to restart the cluster, and persistence is lost. As you say, this is highly unlikely. If, for some reason, the quorums are set such that only a single node failure could bring down the quorum (bad design, but plausible), this failure is more likely. This is not strictly true. The cluster cannot recover _automatically_ if failures n, where ensemble size is 2n+1. However you can recover manually as long as at least 1 snap and trailing logs can be recovered. We can even recover if the latest snapshots are corrupted, as long as we can recover a snap from some previous time t and all logs subsequent to t. EC2 just ups the stakes - crash failures are now potentially more dangerous (bugs, packet corruption, rack local hardware failures etc all could cause crash failures). It is common to assume that, notwithstanding a significant physical event that wipes a number of hard drives, writes that are written stay written. This assumption is sometimes false given certain choices of filesystem. EC2 just gives us a few more ways for that not to be true. I think it's more possible than one might expect to have a lagging minority left behind - say they are partitioned from the majority by a malfunctioning switch. They might all be lagging already as a result. Care must be taken not to bring up another follower on the minority side to make it a majority, else there are split-brain issues as well as the possibility of lost transactions. Again, not *too* likely to happen in the wild, but these permanently running services have a nasty habit of exploring the edge cases... To be explicit, you can cause any ZK cluster to back-track in time by doing the following: ... f) add new members of the cluster Which is why care needs to be taken that the ensemble can't be expanded with a current quorum. Dynamic membership doesn't save us when a majority fails - the existence of a quorum is a liveness condition for ZK. To help with the liveness issue we can sacrifice a little safety (see, e.g. vector clock ordered timestamps in Dynamo), but I think that ZK is aimed at safety first, liveness second. Not that you were advocating changing that, I'm just articulating why correctness is extremely important from my perspective. Henry At this point, you will have lost the transactions from (b), but I really, really am not going to worry about this happening either by plan or by accident. Without steps (e) and (f), the cluster will tell you that it knows something is wrong and that it cannot elect a leader. If you don't have *exact* coincidence of the survivor set and the set of laggards, then you won't have any data loss at all. You have to decide if this is too much risk for you. My feeling is that it is OK level of correctness for conventional weapon fire control, but not for nuclear weapons safeguards. Since my apps are considerably less sensitive than either of those, I am not much worried. On Mon, Jul 6, 2009 at 12:40 PM, Henry Robinson he...@cloudera.com wrote: It seems like there is a correctness issue: if a majority of servers fail, with the remaining minority lagging the leader for some reason, won't the ensemble's current state be forever lost?
Re: zookeeper on ec2
Hi Ted, b) EC2 interconnect has a lot more going on than in a dedicated VLAN. That can make the ZK servers appear a bit less connected. You have to plan for ConnectionLoss events. Interesting. c) for highest reliability, I switched to large instances. On reflection, I think that was helpful, but less important than I thought at the time. Besides the fact that there are more resources for ZooKeeper, this likely helps as well because it reduces the number of systems competing for the real hardware. d) increasing and decreasing cluster size is nearly painless and is easily scriptable. To decrease, do a rolling update on the survivors to update (...) Quite interesting indeed. I guess the work that Henry is pushing on these couple of JIRA tickets will greatly facilitate this. Do you mind if I ask you a couple of questions on this: Do you have any kind of performance data about how much load ZK can take under this environment? Have you tried to put the log and snapshot files under EBS? -- Gustavo Niemeyer http://niemeyer.net
Re: zookeeper on ec2
On Jul 6, 2009, at 15:40 , Henry Robinson wrote: This is an interesting way of doing things. It seems like there is a correctness issue: if a majority of servers fail, with the remaining minority lagging the leader for some reason, won't the ensemble's current state be forever lost? This is akin to a majority of servers failing and never recovering. ZK relies on the eventual liveness of a majority of its servers; with EC2 it seems possible that that property might not be satisfied. I think you are absolutely correct. However, my understanding of EC2 failure modes is that even though there is no guarantee that a particular instance's disk will survive a failure, it is very possible to observe EC2 nodes that fail temporarily (such as rebooting). In these cases, the instance's disk typically does survive, and when it comes back it will have the same contents. It is only permanent EC2 failures where the disk is gone (eg. hardware failure, or Amazon decides to pull it for some other reason). Thus, this looks a lot like running your own machines in your own data center to me. Soft failures will recover, hardware failures won't. The only difference is that if you were running the machines yourself, and you ran into some weird issue where you had hardware failures across a majority of your Zookeeper ensemble, you could physically move the disks to recover the state. If this happens in EC2, you will have to do some sort of manual repair where you forcibly restart Zookeeper using the state of one of the surviving members. Some Zookeeper operations may be lost in this case. However, we are talking about a situation that seems exceedingly rare. No matter what kind of system you are running, serious non-recoverable failures will happen, so I don't see this to be an impediment for running Zookeeper or other quorum systems in EC2. That said, I haven't run enough EC2 instances for a long enough period of time to observe any serious failures or recoveries. If anyone has more detailed information, I would love to hear about it. Evan Jones -- Evan Jones http://evanjones.ca/
Re: zookeeper on ec2
On Mon, Jul 6, 2009 at 12:58 PM, Gustavo Niemeyer gust...@niemeyer.netwrote: can make the ZK servers appear a bit less connected. You have to plan for ConnectionLoss events. Interesting. Note that most of these seem to be related to client issues, especially GC. If you configure in such a way as to get long pauses, you will see connection loss. The default configuration for ZK is for a pretty short (5 seconds) timeout that is pretty easy to exceed with out-of-the-box GC params on the client side. c) for highest reliability, I switched to large instances. On reflection, I think that was helpful, but less important than I thought at the time. Besides the fact that there are more resources for ZooKeeper, this likely helps as well because it reduces the number of systems competing for the real hardware. Yes, but I think that this is less significant than I expected. Small instances have pretty dedicated access to their core. Disk contention is a bit of an issue, but not much. d) increasing and decreasing cluster size is nearly painless and is easily scriptable. To decrease, do a rolling update on the survivors to update (...) Quite interesting indeed. I guess the work that Henry is pushing on these couple of JIRA tickets will greatly facilitate this. Absolutely. I was still very surprised at how small the pain is in the current world. Do you have any kind of performance data about how much load ZK can take under this environment? Only barely. Partly with an eye toward system diagnostics, and partly to cause ZK to have something to do, I reported a wide swath of data available from /proc into ZK every few seconds for all of my servers. This lead to a few dozen transactions per second and ultimately helped me discover and understand some of the connection issues for clients. ZK seemed pretty darned stable through all of this. The only instability that I saw was caused by excessive amounts of data in ZK itself. As I neared the (small) amount of memory I had allocated for Zk use, I would see servers go into paroxysms of GC, but the cluster functionality was impaired to a very surprisingly small degree. Have you tried to put the log and snapshot files under EBS? No. I considered it, but I wanted fewer moving parts rather than more. Doing that would make the intricate and unlikely failure mode that Henry asked about even less likely, but I don't know if it would increase or decrease the probability of any kind of failure. The observed failure modes for ZK in EC2 were completely dominated by our (my) own failings (such as letting too much data accumulate).
Re: zookeeper on ec2
Hi again, (...) ZK seemed pretty darned stable through all of this. Sounds like a nice test, and it's great to hear that ZooKeeper works well there. The only instability that I saw was caused by excessive amounts of data in ZK itself. As I neared the (small) amount of memory I had allocated for Zk use, I would see servers go into paroxysms of GC, but the cluster functionality was impaired to a very surprisingly small degree. Cool, makes sense. No. I considered it, but I wanted fewer moving parts rather than more. Doing that would make the intricate and unlikely failure mode that Henry asked about even less likely, but I don't know if it would increase or decrease the probability of any kind of failure. Yeah, I guess it depends a bit on the system architecture too. If the system is designed in such a way that ZK is keeping track of coordination data which must be resumed after a full stop of the system, having it stored in persistent data would prevent important loss of information. If ZK is really just coordinating ephemeral data (e.g. locks), then if the whole system goes down, it's ok to just allow it to start up again in an empty state. The observed failure modes for ZK in EC2 were completely dominated by our (my) own failings (such as letting too much data accumulate). Details always take a few iterations to get really right. Thanks for this data Ted. -- Gustavo Niemeyer http://niemeyer.net
Re: zookeeper on ec2
On Mon, Jul 6, 2009 at 10:16 PM, Ted Dunning ted.dunn...@gmail.com wrote: No. This should not cause data loss. As soon as ZK cannot replicate changes to a majority of machines, it refuses to take any more changes. This is old ground and is required for correctness in the face of network partition. It is conceivable (barely) that *exactly* the minority that were behind were the survivors, but this is almost equivalent to a complete failure of the cluster choreographed in such a way that a few nodes come back from the dead just afterwards. That could cause the state to not include some completed transactions to disappear, but at this level of massive failure, we have the same issues with any cluster. Effectively, EC2 does not introduce any new failure modes but potentially exacerbates some existing ones. If a majority of EC2 nodes fail (in the sense that their hard drive images cannot be recovered), there is no way to restart the cluster, and persistence is lost. As you say, this is highly unlikely. If, for some reason, the quorums are set such that only a single node failure could bring down the quorum (bad design, but plausible), this failure is more likely. EC2 just ups the stakes - crash failures are now potentially more dangerous (bugs, packet corruption, rack local hardware failures etc all could cause crash failures). It is common to assume that, notwithstanding a significant physical event that wipes a number of hard drives, writes that are written stay written. This assumption is sometimes false given certain choices of filesystem. EC2 just gives us a few more ways for that not to be true. I think it's more possible than one might expect to have a lagging minority left behind - say they are partitioned from the majority by a malfunctioning switch. They might all be lagging already as a result. Care must be taken not to bring up another follower on the minority side to make it a majority, else there are split-brain issues as well as the possibility of lost transactions. Again, not *too* likely to happen in the wild, but these permanently running services have a nasty habit of exploring the edge cases... To be explicit, you can cause any ZK cluster to back-track in time by doing the following: ... f) add new members of the cluster Which is why care needs to be taken that the ensemble can't be expanded with a current quorum. Dynamic membership doesn't save us when a majority fails - the existence of a quorum is a liveness condition for ZK. To help with the liveness issue we can sacrifice a little safety (see, e.g. vector clock ordered timestamps in Dynamo), but I think that ZK is aimed at safety first, liveness second. Not that you were advocating changing that, I'm just articulating why correctness is extremely important from my perspective. Henry At this point, you will have lost the transactions from (b), but I really, really am not going to worry about this happening either by plan or by accident. Without steps (e) and (f), the cluster will tell you that it knows something is wrong and that it cannot elect a leader. If you don't have *exact* coincidence of the survivor set and the set of laggards, then you won't have any data loss at all. You have to decide if this is too much risk for you. My feeling is that it is OK level of correctness for conventional weapon fire control, but not for nuclear weapons safeguards. Since my apps are considerably less sensitive than either of those, I am not much worried. On Mon, Jul 6, 2009 at 12:40 PM, Henry Robinson he...@cloudera.com wrote: It seems like there is a correctness issue: if a majority of servers fail, with the remaining minority lagging the leader for some reason, won't the ensemble's current state be forever lost?