Hi, I would like to discuss deployment scenarios of apps and zookeeper in AWS. I have been trying to find info about this, but haven’t found too much yet.
We have been working on better redundancy of our apps - hundreds of VMs - running 24x7, and zookeeper is one component we introduced last year. While it is working fine, there are some little tricks and missing things in the current setup. I would really like to hear more how others are configuring their apps and zookeeper in AWS. Trying to summarise our current setup: - One autoscalinggroup (ASG) for the 5 zookeeper servers per application - the ASG will replace an instance by a fresh one if it goes bad. - the zookeeper servers each have their assigned elastic-ip, and their zoo.cfg lists these elastic ips in the server.N lines. not the names, the IP addresses directly. We can swap a zookeeper-VM by terminating it, and have the ASG create a new one, and once it is assigned the freed elastic-ip, it joins the zookeeper-cluster. - the zookeepers security group explicitly allows those 5 elastic-ips for the 2888 and 3888 ports, plus the SGs of our app-servers - the image we use for the zookeeper ASG contains a little extra service which takes care of automatically assigning the configured elastic IPs to its ASG members. So when an new server boots up, the remaining ASG members will set the missing elastic-ip to the new instance, and it will startup zookeeper and join the cluster. The same image is used for all apps-deployments - with the userdata of the ASG telling what elastic-ips and some other details. One image to have a redundant self-healing zookeeper-cluster per application. - the application servers are spread across different SGs depending on needs and their roles, and the connectstring is configured by logical names like zookeeperX.<app.domainname.<http://domainname.com>org> for X=1,2,3,4,5. We added manually mappings for these to the ec2-public hostname of the elastic-ips - like ec2-A-B-C-D.compute-1.amazonaws.com<http://ec2-A-B-C-D.compute-1.amazonaws.com> with A.B.C.D being the corresponding elastic-ip. This has the great benefit, that all our application VMs when looking up these logical zookeeperX.app.domain.org<http://zookeeperN.app.domain.org> will resolve it to the current private-ip of that zookeeper-server instance, and when connecting the SG will allow it through. (if we use the A.B.C.D directly, we would need to provision each application-vm explicitly in the SG of the zookeeper-cluster - hundreds servers which are changing somewhat from week to week. - we use curator for leader-election to pick what server is doing what role, and we run some 5-10% more servers than roles we need. Each server holds on to its role until it lost its session, and another spare-server jumps in to take over. So if an app servers goes bad (eg. ebs, networking, or it just disappears), one of the others jump in to take over. - we changed the curator’s leaderlatch somewhat to hang-on to the leadership during suspend events. Waiting for the reconnect or lost events. A leadership role is an expensive thing due to the high amount of state and data-caching in each server - which is needed for performance. This means that when one of the zookeeper-servers goes bad, its not that about one 5th of our servers loose their role - they have some 30 seconds to reconnect to the remaining servers and continue their session their. The current issues we have are the following: - A while back there was a networking issue in AWS which caused traffic between the zookeeper-servers to be partially blocking for some minutes. The zookeeper cluster lost its leader, and re-election failed. The App came to a grinding halt. Not good. We have been working on adding keep-alive packets to the election ports between the servers which we identified as a working solution for that issue. We simulate the problems via iptables. We hope get that patch submitted in the near future for consideration. This has been reported a while back with discussions on the best way going forward. eg. https://issues.apache.org/jira/browse/ZOOKEEPER-1748 (we would prefer application level keepalive packets, in stead of lower level tcpkeepalive socket options.) - While the replacement of a zookeeper VM instance works great, there is one remaining issue: how do the applications VMs know about the changed name-to-ip relation? zookeeperX.app.domain.org<http://zookeeperX.app.domain.org> is no longer mapping to the same private IP anymore - the replacement VM has a different IP. We tried to work around this by changing the connectstring to a shuffled replacement, but that expired the sessions, and thus cause the leaderlatches to close, and in some cases some servers could not get their old role back as some spares got there first. We now have a prototype working where we use a special HostProvider implementation which resolves from name to IP when next() is called instead of on construction as the default StaticHostProvider does. This means that after the mapping changes, the zookeeper client has the new private IP address to connect to. In addition, this is not ending the zookeeper session, so the leader-latches remain. (we use a sessiontime of about 1-2 minutes). This solution requires a small fix and addition to the ZooKeeper class to enable passing a custom HostProvider. see: https://issues.apache.org/jira/browse/ZOOKEEPER-2107 Hope this helps others running on AWS, and please share you experiences ? thanks Robert
