I have a related question: what's the behavior of a cluster of 3 when one is down? I've tried it and a leader is elected, but are there any other caveats for this situation?
.. Adam On Tue, Jan 12, 2010 at 2:40 PM, Patrick Hunt <ph...@apache.org> wrote: > 12 servers? That's alot, if you dont' mind my asking why so many? Typically > we recommend 5 - that way you can have one down for maintenance and still > have a failure that doesn't bring down the cluster. > > The "electing a leader" is probably the restarted machine attempting to > re-join the ensemble (it should join as a follower if you have a leader > already elected, given that it's xid is behind the existing leader.) Hard to > tell though without the logs. > > You might also be seeing the initLimit exceeded, is the data you are storing > in ZK large? Or perhaps network connectivity is slow? > http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions > again the logs would give some insight on this. > > > Patrick > > Nick Bailey wrote: >> >> We are running zookeeper 3.1.0 >> >> Recently we noticed the cpu usage on our machines becoming >> increasingly high and we believe the cause is >> >> https://issues.apache.org/jira/browse/ZOOKEEPER-427 >> >> However our solution when we noticed the problem was to kill the >> zookeeper process and restart it. >> >> After doing that though it looks like the newly restarted zookeeper >> server is continually attempting to elect a leader even though one >> already exists. >> >> The process responses with 'imok' when asked, but the stat command >> returns 'ZooKeeperServer not running'. >> >> I belive that killing the current leader should trigger all servers >> to do an election and solve the problem, but I'm not sure. Should >> that be the course of action in this situation? >> >> Also we have 12 servers, but 5 are currently not running according to >> stat. So I guess this isn't a problem unless we lose another one. >> We have plans to upgrade zookeeper to solve the cpu issue but haven't >> been able to do that yet. >> >> Any help appreciated, Nick Bailey >> >