Hi Cameron, Which version of ZK are you using? Also, if you can share logs, then it might be easier for us to help you out.
-Flavio > -----Original Message----- > From: Cameron McKenzie [mailto:[email protected]] > Sent: 30 April 2014 08:44 > To: [email protected] > Subject: ZOOKEEPER-900 / 901 / 1678 > > ZooKeeper users, > Does anyone know the status of these issues? They don't seem to have had > anything done to them since late 2010? > > I think that we're experiencing the same issue currently. If we have a 3 node > cluster for example, and 1 of these nodes is completely dead (i.e the entire > host is not contactable due to a power outage), I would expect that a > quorum could still be formed, but this does not appear to be the case. > > I haven't delved into the code too much, but it appears that blocking IO is > being used for the connect. This doesn't respect the socket SO timeout being > set, so it means that the connect() call can block for some arbitrary amount > of > time (based on the OS level TCP settings?). This in turn means that leader > election will fail because it times out before the socket connect does, even > though there are enough live hosts present to form a quorum. > > This seems like a fairly fundamental problem, unless I'm missing something. > If a single host goes down due to a power failure for example, it can prevent > any further hosts joining the cluster. In addition, if after a power failure, > enough hosts come back online to form a quorum, but some don't, that a > quorum may still not be able to be formed. > cheers > Cam
