Hi Jakub, quickly from logs
ZK 10.0.1.25 and 10.0.1.28 goes down then after a while 10.0.1.213 is also reported as down -> ZK lost availability mesos master is trying to reconnect to ZK but has not luck -> after a while zookeepers goes up and then down then stack trace from mesos registrar - cannot recover - cannot reach quorum your problem should be related to ZK - it seems like after node shutdown ZK cluster is unable to reach quorum and kill itself - zookeeper/exhibitor logs should give you the answer what does exhibitor report? 2015-07-08 22:17 GMT+02:00 Jakub Veverka <[email protected]>: > Hi Guys, > > We have mesos stack up and running and I've started testing what happens > when I shut down one node from cluster. The result was that healing of > cluster took 10~20 minutes and we were hoping for not more than instand or > max 1 minute long recovery. > > Here is summarized our setup: > > We are running 4 CoreOS hosts. > Each host is capable of running every mesos component but always only once > per node. > Every mesos component is running as docker container: > - Each host is running mesos slave. > - 3 instances of zookeeper (3.4.6) - managed by exhibitor > - 3 instances of mesos-master (0.22.1) > - 2 instances of marathon (0.8.2) > > The behavior after one node is removed is: > - mesos masters start failing, sometimes master is elected but it doesn't > have any slaves or tasks, later this master fails as well. > - mesos slave - once there was task hanging in marathon even though slave > was dead for long time and task was unhealty - probably related to this > issue - https://github.com/mesosphere/marathon/issues/1279 > - mesos master keeps failing and re-electing leader for about 10 minutes. > > I've googled a while and it seems that recommeded concept is to run separate > master and slave nodes > (http://open.mesosphere.com/getting-started/datacenter/install/). > Should this solve our issue? > > I am also attaching mesos-master logs from all hosts running mesos master. > > Thanks for any advice, > Jakub

