Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

Alex Rukletsov Fri, 06 Mar 2015 10:13:02 -0800

Geoffroy,

could you please provide master logs (both from killed and taking over
masters)?


On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
geoffroy.jabou...@gmail.com> wrote:

> Hello
>
> we are facing some unexpecting issues when testing high availability
> behaviors of our mesos cluster.
>
> *Our use case:*
>
> *State*: the mesos cluster is up (3 machines), 1 docker task is running
> on each slave (started from marathon)
>
> *Action*: stop the mesos master leader process
>
> *Expected*: mesos master leader has changed, *active tasks remain
> unchanged*
>
> *Seen*: mesos master leader has changed, *all active tasks are now FAILED
> but docker containers are still running*, marathon detects FAILED tasks
> and starts new tasks. We end with 2 docker containers running on each
> machine, but only one is linked to a RUNNING mesos task.
>
>
> Is the seen behavior correct?
>
> Have we misunderstood the high availability concept? We thought that doing
> this use case would not have any impact on the current cluster state
> (except leader re-election)
>
> Thanks in advance for your help
> Regards
>
> ---------------------------------------------------
>
> our setup is the following:
> 3 identical mesos nodes with:
>     + zookeeper
>     + docker 1.5
>     + mesos master 0.21.1 configured in HA mode
>     + mesos slave 0.21.1 configured with checkpointing, strict and
> reconnect
>     + marathon 0.8.0 configured in HA mode with checkpointing
>
> ---------------------------------------------------
>
> Command lines:
>
>
> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>
> *mesos-slave*
> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
> 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
> --executor_registration_timeout=5mins --hostname=10.195.30.19
> --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>
> *marathon*
> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
> 10.195.30.20:2181,10.195.30.21:2181/mesos
>

Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

Reply via email to