In addition to what Dick said, you need to make sure that you have a quorum
of masters *online* in order for a master to recover correctly. This means
you'll want to run the master under a tool (e.g. Monit) that restarts it
promptly upon failure.

You'll want to do this for the slaves as well.

On Thu, Nov 6, 2014 at 11:36 PM, Dick Davies <d...@hellooperator.net> wrote:

> Golden Rule : Don't use even numbers of members with quorum systems.
>
> You need a quorum to function so with 2 masters and quorum=2, you can't
> ever take a member down. With 2 masters and quorum=1, you're asking
> for "split brain".
>
> (this is exactly the same with zookeeper by the way, it's also a quorum
> system)
>
> If you have 1 master, quorum=1
> if you have 3 masters, quorum=2
> if you have 5 masters, quorum=3
>
> and so on. Try that and see if it helps.
>
>
> On 7 November 2014 09:42, sujinzhao <sujinz...@gmail.com> wrote:
> > In fact, I also tried with launching 2 masters on two separate machines,
> at
> > first, one of them was successfully elected as a leader, and both of them
> > printed several lines of messages:
> >
> > Replica in EMPTY status received a broadcasted recover request
> > Received a recover response from a replica in EMPTY status
> >
> > then the leader master aborted after outputing errors:
> >
> > Recovery failed: Failed to recover registrar: Failed to perform fetch
> within
> > 1mins
> > *** Check failure stack trace: ***
> > @ 0x7f3c1ea105cd google::LogMessage::Fail()
> > ..............................
> >
> > and next, the second master became the new leader, it also tried to
> recovery
> > from the registrar, but also failed and printed errors before aborted:
> >
> > Recovery failed: Failed to recover registrar: Failed to perform fetch
> within
> > 1mins
> > *** Check failure stack trace: ***
> > @ 0x7f3c1ea105cd google::LogMessage::Fail()
> > ...............................
> >
> > So I guess that's not problems of zookeeper, it's the elected leader can
> not
> > recover from registrar, could somebody be kind to illustrate some
> principles
> > of mesos registry, or give me some suggestions?
> >
> > THANKS.
> >
> > "david.j.palaitis" <david.j.palai...@gmail.com>编写:
> >
> >
> > With a single master,  you should not set quorum=2
> >
> >
> > -------- Original message --------
> > From: sujinzhao <sujinz...@gmail.com>
> > Date:11/06/2014 4:01 PM (GMT-05:00)
> > To: user@mesos.apache.org
> > Cc:
> > Subject: Problems of running mesos-0.20.0 with zookeeper
> >
> > Hi,all,
> >
> > I set up zookeeper service with three machines zoo1, zoo2, zoo3, and also
> > installed 1 mesos master and 2 slaves on another three nodes, I tried to
> run
> > master and slaves with:
> > ./mesos-master.sh --ip=master-ip
> > --zk=zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos --quorum=2
> >
> > ./mesos-slave.sh --ip=slave-ip
> > --master=zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos
> >
> > I also created the /mesos znode before running the above commands, but I
> got
> > the following error:
> >
> > Recovering from registrar
> > Recovering registrar
> > Recovery failed: Failed to recover registrar: Failed to perform fetch
> within
> > 1mins
> > *** Check failure stack trace: ***
> >     @  0x7f3c1ea105cd google::LogMessage::Fail()
> > ...............................
> >
> > after reading the master log, I found that before causing error, master
> has
> > already been elected successfully, but the leader failed in recovering
> from
> > registrar, so I guess this error has little relationship with zookeeper.
> >
> > after googleing I found that other people also encountered this problem,
> but
> > with no solution, I also exclude the possible reason of ssh between
> > master/slave and zookeeper servers with no password.
> >
> > So, could somebody be kindly to tell me how to solve this error? any
> > suggestions will be appreciated.
> >
> > THANKS.
>

Reply via email to