Golden Rule : Don't use even numbers of members with quorum systems. You need a quorum to function so with 2 masters and quorum=2, you can't ever take a member down. With 2 masters and quorum=1, you're asking for "split brain".
(this is exactly the same with zookeeper by the way, it's also a quorum system) If you have 1 master, quorum=1 if you have 3 masters, quorum=2 if you have 5 masters, quorum=3 and so on. Try that and see if it helps. On 7 November 2014 09:42, sujinzhao <[email protected]> wrote: > In fact, I also tried with launching 2 masters on two separate machines, at > first, one of them was successfully elected as a leader, and both of them > printed several lines of messages: > > Replica in EMPTY status received a broadcasted recover request > Received a recover response from a replica in EMPTY status > > then the leader master aborted after outputing errors: > > Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins > *** Check failure stack trace: *** > @ 0x7f3c1ea105cd google::LogMessage::Fail() > .............................. > > and next, the second master became the new leader, it also tried to recovery > from the registrar, but also failed and printed errors before aborted: > > Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins > *** Check failure stack trace: *** > @ 0x7f3c1ea105cd google::LogMessage::Fail() > ............................... > > So I guess that's not problems of zookeeper, it's the elected leader can not > recover from registrar, could somebody be kind to illustrate some principles > of mesos registry, or give me some suggestions? > > THANKS. > > "david.j.palaitis" <[email protected]>编写: > > > With a single master, you should not set quorum=2 > > > -------- Original message -------- > From: sujinzhao <[email protected]> > Date:11/06/2014 4:01 PM (GMT-05:00) > To: [email protected] > Cc: > Subject: Problems of running mesos-0.20.0 with zookeeper > > Hi,all, > > I set up zookeeper service with three machines zoo1, zoo2, zoo3, and also > installed 1 mesos master and 2 slaves on another three nodes, I tried to run > master and slaves with: > ./mesos-master.sh --ip=master-ip > --zk=zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos --quorum=2 > > ./mesos-slave.sh --ip=slave-ip > --master=zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos > > I also created the /mesos znode before running the above commands, but I got > the following error: > > Recovering from registrar > Recovering registrar > Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins > *** Check failure stack trace: *** > @ 0x7f3c1ea105cd google::LogMessage::Fail() > ............................... > > after reading the master log, I found that before causing error, master has > already been elected successfully, but the leader failed in recovering from > registrar, so I guess this error has little relationship with zookeeper. > > after googleing I found that other people also encountered this problem, but > with no solution, I also exclude the possible reason of ssh between > master/slave and zookeeper servers with no password. > > So, could somebody be kindly to tell me how to solve this error? any > suggestions will be appreciated. > > THANKS.

