Geoffroy, could you please provide master logs (both from killed and taking over masters)?
On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley < geoffroy.jabou...@gmail.com> wrote: > Hello > > we are facing some unexpecting issues when testing high availability > behaviors of our mesos cluster. > > *Our use case:* > > *State*: the mesos cluster is up (3 machines), 1 docker task is running > on each slave (started from marathon) > > *Action*: stop the mesos master leader process > > *Expected*: mesos master leader has changed, *active tasks remain > unchanged* > > *Seen*: mesos master leader has changed, *all active tasks are now FAILED > but docker containers are still running*, marathon detects FAILED tasks > and starts new tasks. We end with 2 docker containers running on each > machine, but only one is linked to a RUNNING mesos task. > > > Is the seen behavior correct? > > Have we misunderstood the high availability concept? We thought that doing > this use case would not have any impact on the current cluster state > (except leader re-election) > > Thanks in advance for your help > Regards > > --------------------------------------------------- > > our setup is the following: > 3 identical mesos nodes with: > + zookeeper > + docker 1.5 > + mesos master 0.21.1 configured in HA mode > + mesos slave 0.21.1 configured with checkpointing, strict and > reconnect > + marathon 0.8.0 configured in HA mode with checkpointing > > --------------------------------------------------- > > Command lines: > > > *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181, > 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050 > --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19 > --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos > > *mesos-slave* > /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181, > 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos > --executor_registration_timeout=5mins --hostname=10.195.30.19 > --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect > --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443] > > *marathon* > java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64 > -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp > /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000 > --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080 > --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port > 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181, > 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181, > 10.195.30.20:2181,10.195.30.21:2181/mesos >