Doh - that makes total sense. For whatever reason I thought with 2 servers you couldn't get a majority :P
On Tue, Jan 12, 2010 at 3:17 PM, Henry Robinson <he...@cloudera.com> wrote: > Hi Adam - > > As long as a quorum of servers is running, ZK will be live. With majority > quorums, 2/3 is enough to keep going. In general, if fewer than half your > nodes have failed, ZK will keep on keeping on. > > The main concern with a cluster of 2/3 machines is that a single further > failure will bring down the whole cluster. > > Henry > > 2010/1/12 Adam Rosien <a...@rosien.net> > >> I have a related question: what's the behavior of a cluster of 3 when >> one is down? I've tried it and a leader is elected, but are there any >> other caveats for this situation? >> >> .. Adam >> >> On Tue, Jan 12, 2010 at 2:40 PM, Patrick Hunt <ph...@apache.org> wrote: >> > 12 servers? That's alot, if you dont' mind my asking why so many? >> Typically >> > we recommend 5 - that way you can have one down for maintenance and still >> > have a failure that doesn't bring down the cluster. >> > >> > The "electing a leader" is probably the restarted machine attempting to >> > re-join the ensemble (it should join as a follower if you have a leader >> > already elected, given that it's xid is behind the existing leader.) Hard >> to >> > tell though without the logs. >> > >> > You might also be seeing the initLimit exceeded, is the data you are >> storing >> > in ZK large? Or perhaps network connectivity is slow? >> > >> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions >> > again the logs would give some insight on this. >> > >> > >> > Patrick >> > >> > Nick Bailey wrote: >> >> >> >> We are running zookeeper 3.1.0 >> >> >> >> Recently we noticed the cpu usage on our machines becoming >> >> increasingly high and we believe the cause is >> >> >> >> https://issues.apache.org/jira/browse/ZOOKEEPER-427 >> >> >> >> However our solution when we noticed the problem was to kill the >> >> zookeeper process and restart it. >> >> >> >> After doing that though it looks like the newly restarted zookeeper >> >> server is continually attempting to elect a leader even though one >> >> already exists. >> >> >> >> The process responses with 'imok' when asked, but the stat command >> >> returns 'ZooKeeperServer not running'. >> >> >> >> I belive that killing the current leader should trigger all servers >> >> to do an election and solve the problem, but I'm not sure. Should >> >> that be the course of action in this situation? >> >> >> >> Also we have 12 servers, but 5 are currently not running according to >> >> stat. So I guess this isn't a problem unless we lose another one. >> >> We have plans to upgrade zookeeper to solve the cpu issue but haven't >> >> been able to do that yet. >> >> >> >> Any help appreciated, Nick Bailey >> >> >> > >> >