Hello all, It appears that ZooKeeper is subject to the linux leap seconds bug that has caused problems with Cassandra and other services. At least, I discovered that after 6 hours of trying to figure out why my cluster wasn't giving me a quorum.
A link to the kernel bug report is at https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d As far as what you might see in your logs, I saw a lost quorum, insanely high load on my servers, and when I shut down zookeeper to bring it back up, one machine would report a read timeout during leader election, then report that the server told it to shut down. After that, it would forever be stuck in the LOOKING phase, while another machine might be stuck in any other phase of the election. The fix is simple, though. Just stop ZooKeeper, execute date -s "`date`" or restart your ntp daemon, then start zookeeper back up. you MUST restart zookeeper, otherwise, the election state doesn't recover (or, at least, it didn't recover for me) Hope this helps save someone else the 7 hours of agony I just went through. Scott Fines
