Hi All, I wanted to get some feedback about running ZooKeeper on VM's within public clouds. If you have experience with this could you share please? What issues have you run into? Were you able to overcome the issues and how? At the end of the day, were you able to get this to work reliably?
Some of the issues we know we need to worry about: 1. Making sure replicas are in different 'availability zones'. Without this your VM's might even be running on the same physical machine. 2. Lack of fixed IP I believe typically in clouds every VM is allocated a new IP so if you're e.g. upgrading a cluster, you can't keep the existing IP's for the new VM's. Our solution is to use our cloud provider's support for getting a set of fixed IP's which can be dynamically bound to whichever VM's we want. (aka "portable ip" on SoftLayer, I believe there is similar support on other providers). It's probably the case that dynamic reconfig opens up new options, but it will be a while before this is supported in a stable version. We prefer to use a stable Zookeeper, unless there is feedback that the pro's of using the more recent ZK versions are larger than the cons. 3. Isolation from other VM's on same physical machine. It seems especially important to good decent performance for the log disk. Can be partially dealt with by allocating the log to a non-local disk with guaranteed IOP's, as is supported by some providers. 4. Write caching of disk I/O. Making sure there are no layers which cache disk writes so they do not really reach the disk even though they have been acknowledged. Perhaps its not that big of an issue given the provider might have backup power? What are your thoughts here? 5. Clock-related issues on VM's. It seems people have seen VM clocks skipping ahead or even going backwards, which caused e.g. ZooKeeper session disconnection. We're not entirely clear what exactly we need to do to avoid this. Any help/pointer are appreciated. Might be less of an issue in the more recent ZK versions but, again, these are not yet stable. c.f. https://issues.apache.org/jira/browse/ZOOKEEPER-1616 Any additional issues to look out for? Thanks, Guy
