Hi Jakub,

quickly from logs

ZK 10.0.1.25 and 10.0.1.28 goes down then after a while 10.0.1.213 is
also reported as down -> ZK lost availability
mesos master is trying to reconnect to ZK but has not luck -> after a
while zookeepers goes up  and then down
then stack trace from mesos registrar -  cannot recover - cannot reach quorum

your problem should be related to ZK - it seems like after node
shutdown ZK cluster is unable to reach quorum and kill itself -
zookeeper/exhibitor logs should give you the answer

what does exhibitor report?




2015-07-08 22:17 GMT+02:00 Jakub Veverka <[email protected]>:
> Hi Guys,
>
> We have mesos stack up and running and I've started testing what happens
> when I shut down one node from cluster. The result was that healing of
> cluster took 10~20 minutes and we were hoping for not more than instand or
> max 1 minute long recovery.
>
> Here is summarized our setup:
>
> We are running 4 CoreOS hosts.
> Each host is capable of running every mesos component but always only once
> per node.
> Every mesos component is running as docker container:
> - Each host is running mesos slave.
> - 3 instances of zookeeper (3.4.6) - managed by exhibitor
> - 3 instances of mesos-master (0.22.1)
> - 2 instances of marathon (0.8.2)
>
> The behavior after one node is removed is:
> - mesos masters start failing, sometimes master is elected but it doesn't
> have any slaves or tasks, later this master fails as well.
> - mesos slave - once there was task hanging in marathon even though slave
> was dead for long time and task was unhealty - probably related to this
> issue - https://github.com/mesosphere/marathon/issues/1279
> - mesos master keeps failing and re-electing leader for about 10 minutes.
>
> I've googled a while and it seems that recommeded concept is to run separate
> master and slave nodes
> (http://open.mesosphere.com/getting-started/datacenter/install/).
> Should this solve our issue?
>
> I am also attaching mesos-master logs from all hosts running mesos master.
>
> Thanks for any advice,
> Jakub

Reply via email to