Just as a supporting note, from what I read, to support n simultaneous failures we need 2n+1 nodes. In this case, we need 5 nodes to operate correctly. Might be a good idea to capture this formula and if more than n failures occur, write the appropriate flags which can then be used for the right recovery state.
Cheers <k/> |-----Original Message----- |From: Benjamin Reed [mailto:br...@yahoo-inc.com] |Sent: Wednesday, December 17, 2008 11:48 AM |To: firstname.lastname@example.org |Subject: RE: What happens when a server loses all its state? | |Thomas, | |in the scenario you give you have two simultaneous failures with 3 |nodes, so it will not recover correctly. A is failed because it is not |up. B has failed because it lost all its data. | |it would be good for ZooKeeper to not come up in that scenario. perhaps |what we need is something similar to your safe state proposal. basically |a server that has forgotten everything should not be allowed to vote in |the leader election. that would avoid your scenario. we just need to put |a flag file in the data directory to say that the data is valid and thus |can vote. | |ben |________________________________________ |From: thomas.john...@sun.com [thomas.john...@sun.com] |Sent: Tuesday, December 16, 2008 4:02 PM |To: email@example.com |Subject: Re: What happens when a server loses all its state? | |Mahadev Konar wrote: |> Hi Thomas, |> |> |> |> |>> More generally, is it a safe assumption to make that the ZooKeeper |>> service will maintain all its guarantees if a minority of servers |lose |>> persistent state (due to bad disks, etc) and restart at some point in |>> the future? |>> |> Yes that is true. |> |> |Great - thanks Mahadev. | |Not to drag this on more than necessary, please bear with me for one |more example of 'amnesia' that comes to mind. I have a set of ZooKeeper |servers A, B, C. |- C is currently not running, A is the leader, B is the follower. |- A proposes zxid1 to A and B, both acknowledge. |- A asks A to commit (which it persists), but before the same commit |request reaches B, all servers go down (say a power failure). |- Later, B and C come up (A is slow to reboot), but B has lost all state |due to disk failure. |- C becomes the new leader and perhaps continues with some more new |transactions. | |Likely I'm misunderstanding the protocol, but have I effectively lost |zxid1 at this point? What would happen when A comes back up? | |Thanks.