ZooKeeper users, Does anyone know the status of these issues? They don't seem to have had anything done to them since late 2010?
I think that we're experiencing the same issue currently. If we have a 3 node cluster for example, and 1 of these nodes is completely dead (i.e the entire host is not contactable due to a power outage), I would expect that a quorum could still be formed, but this does not appear to be the case. I haven't delved into the code too much, but it appears that blocking IO is being used for the connect. This doesn't respect the socket SO timeout being set, so it means that the connect() call can block for some arbitrary amount of time (based on the OS level TCP settings?). This in turn means that leader election will fail because it times out before the socket connect does, even though there are enough live hosts present to form a quorum. This seems like a fairly fundamental problem, unless I'm missing something. If a single host goes down due to a power failure for example, it can prevent any further hosts joining the cluster. In addition, if after a power failure, enough hosts come back online to form a quorum, but some don't, that a quorum may still not be able to be formed. cheers Cam
