This is certainly not the expected/desired behavior when failing over a
mesos master in HA mode. In addition to the master logs Alex requested, can
you also provide relevant portions of the slave logs for these tasks? If
the slave processes themselves never failed over, checkpointing and slave
recovery should be irrelevant. Are you running the mesos-slave itself
inside a Docker, or any other non-traditional setup?

FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
defaults to "reconnect", and --strict defaults to true, so none of those
are necessary.

On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <[email protected]> wrote:

> Geoffroy,
>
> could you please provide master logs (both from killed and taking over
> masters)?
>
> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
> [email protected]> wrote:
>
>> Hello
>>
>> we are facing some unexpecting issues when testing high availability
>> behaviors of our mesos cluster.
>>
>> *Our use case:*
>>
>> *State*: the mesos cluster is up (3 machines), 1 docker task is running
>> on each slave (started from marathon)
>>
>> *Action*: stop the mesos master leader process
>>
>> *Expected*: mesos master leader has changed, *active tasks remain
>> unchanged*
>>
>> *Seen*: mesos master leader has changed, *all active tasks are now
>> FAILED but docker containers are still running*, marathon detects FAILED
>> tasks and starts new tasks. We end with 2 docker containers running on each
>> machine, but only one is linked to a RUNNING mesos task.
>>
>>
>> Is the seen behavior correct?
>>
>> Have we misunderstood the high availability concept? We thought that
>> doing this use case would not have any impact on the current cluster state
>> (except leader re-election)
>>
>> Thanks in advance for your help
>> Regards
>>
>> ---------------------------------------------------
>>
>> our setup is the following:
>> 3 identical mesos nodes with:
>>     + zookeeper
>>     + docker 1.5
>>     + mesos master 0.21.1 configured in HA mode
>>     + mesos slave 0.21.1 configured with checkpointing, strict and
>> reconnect
>>     + marathon 0.8.0 configured in HA mode with checkpointing
>>
>> ---------------------------------------------------
>>
>> Command lines:
>>
>>
>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>>
>> *mesos-slave*
>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
>> 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
>> --executor_registration_timeout=5mins --hostname=10.195.30.19
>> --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
>> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
>>
>> *marathon*
>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
>> --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
>> 10.195.30.20:2181,10.195.30.21:2181/mesos
>>
>
>

Reply via email to