Hello
we are facing some unexpecting issues when testing high availability
behaviors of our mesos cluster.
*Our use case:*
*State*: the mesos cluster is up (3 machines), 1 docker task is running on
each slave (started from marathon)
*Action*: stop the mesos master leader process
*Expected*: mesos master leader has changed, *active tasks remain unchanged*
*Seen*: mesos master leader has changed, *all active tasks are now FAILED
but docker containers are still running*, marathon detects FAILED tasks and
starts new tasks. We end with 2 docker containers running on each machine,
but only one is linked to a RUNNING mesos task.
Is the seen behavior correct?
Have we misunderstood the high availability concept? We thought that doing
this use case would not have any impact on the current cluster state
(except leader re-election)
Thanks in advance for your help
Regards
---------------------------------------------------
our setup is the following:
3 identical mesos nodes with:
+ zookeeper
+ docker 1.5
+ mesos master 0.21.1 configured in HA mode
+ mesos slave 0.21.1 configured with checkpointing, strict and reconnect
+ marathon 0.8.0 configured in HA mode with checkpointing
---------------------------------------------------
Command lines:
*mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
--cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
--quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
*mesos-slave*
/usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
--executor_registration_timeout=5mins --hostname=10.195.30.19
--ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
--recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
*marathon*
java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
-Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
/usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
--local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
--hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,10.195.30.20:2181
,10.195.30.21:2181/mesos