At the current time, you need an odd number of masters as there is an assumption built into the replicated that the number of masters = 2*quorum - 1. This assumption is present when bootstrapping the log from no data.
To recover from this, you need to run an odd number of masters, and set your quorum correctly. For example, 3 masters with quorum 2, or 5 masters with quorum 3. It is safe to wipe the replica logs before doing this. There are some outstanding tickets to clean this up: https://issues.apache.org/jira/browse/MESOS-1465 https://issues.apache.org/jira/browse/MESOS-1546 We'd like to have the configuration be explicit about the total number of masters, so that the assumption need not be made. On Tue, Jul 22, 2014 at 2:40 AM, Tomas Barton <[email protected]> wrote: > Hi, > > what is the best way to upgrade Mesos cluster from 0.18 to 0.19? I've > tried to read all documentation before doing actual upgrade, but I still > don't understand a few things. > > What should be the quorum size? > > The --help says that "It is imperative to set this value to be a majority > of masters i.e., quorum > (number of masters)/2" > > I have 4 Mesos masters, which would mean that quorum > 2 -> quorum=3, > right? > > The recover.cpp says that: "we allow a replica in EMPTY status to become > VOTING immediately if it finds ALL (i.e., 2 * quorum - 1) replicas are in > EMPTY status" > So, with quorum = 3 I would need 5 Mesos masters (that's just not clear > from the mesos-master --help). > > quorum=1, mesos-masters=1 > quorum=2, mesos-masters=3 > quorum=3, mesos-masters=5 > quorum=4, mesos-masters=7 > > Is is possible to have non-even number of Mesos masters? or is it just a > bad idea? > > With 4 masters I got into a situation when: > > master 1: > I0722 11:35:40.708562 12689 replica.cpp:638] Replica in VOTING status > received a broadcasted recover request > > master 2: > I0722 11:36:37.593647 7754 replica.cpp:638] Replica in EMPTY status > received a broadcasted recover request > > master 3: > I0722 11:35:14.102762 26701 recover.cpp:188] Received a recover response > from a replica in STARTING status > > master 4: > I0722 11:35:54.284169 32056 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0722 11:35:54.284425 32050 recover.cpp:188] Received a recover response > from a replica in STARTING status > I0722 11:35:54.284788 32057 recover.cpp:188] Received a recover response > from a replica in VOTING status > I0722 11:35:54.285127 32050 recover.cpp:188] Received a recover response > from a replica in EMPTY status > > And the election algorithm ends up in an endless loop. How can I recover > from this? Delete all replica logs from master disk? Start with quorum=1 > and increment number of masters? > > Thanks, > Tomas >

