Re: Task Checkpointing with Mesos, Marathon and Docker containers

Connor Doyle Tue, 25 Nov 2014 09:32:29 -0800

Hi Geoffroy,

For the Marathon instances, in all released version of Marathon you must supply 
the --checkpoint flag to turn on task checkpointing for the framework.  We've 
changed the default to true starting with the next release.


There is a bug in Mesos where the FrameworkInfo does not get updated when a 
framework re-registers.  This means that if you shut down Marathon and restart 
it with --checkpoint, the Mesos master (with the same FrameworkId, which 
Marathon picks up from ZK) will ignore the new setting.  For reference, here is 
the design doc to address that: 
https://cwiki.apache.org/confluence/display/MESOS/Design+doc%3A+Updating+Framework+Info

Fortunately, there is an easy workaround.

1) Shut down Marathon (tasks keep running)
2) Restart the leading Mesos master (tasks keep running)
3) Start Marathon with --checkpoint enabled

This works by clearing the Mesos master's in-memory state.  It is rebuilt as 
the slave nodes and frameworks re-register.

Please report back if this doesn't solve the issue for you.
--
Connor


> On Nov 25, 2014, at 07:43, Geoffroy Jabouley <[email protected]> 
> wrote:
> 
> Hello
> 
> i am currently trying to activate checkpointing for my Mesos cloud.
> 
> Starting from an application running in a docker container on the cluster, 
> launched from marathon, my use cases are the followings:
> 
> UC1: kill the marathon service, then restart after 2 minutes.
> Expected: the mesos task is still active, the docker container is running. 
> When the marathon service restarts, it get backs its tasks.
> 
> Result: OK
> 
> 
> UC2: kill the mesos slave, then restart after 2 minutes.
> Expected: the mesos task remains active, the docker container is running. 
> When the mesos slave service restarts, it get backs its tasks. Marathon does 
> not show error.
> 
> Results: task get status LOST when slave is killed. Docker container still 
> running.  Marathon detects the application went down and spawn a new one on 
> another available mesos slave. When the slave restarts, it kills the previous 
> running container and start a new one. So i end up with 2 applications on my 
> cluster, one spawn by Marathon, and another orphan one.
> 
> 
> Is this behavior normal? Can you please explain what i am doing wrong?
> 
> -----------------------------------------------------------------------------------------------------------
> 
> Here is the configuration i have come so far:
> Mesos 0.19.1 (not dockerized)
> Marathon 0.6.1 (not dockerized)
> Docker 1.3 + Deimos 0.4.2
> 
> Mesos master is started:
> /usr/local/sbin/mesos-master --zk=zk://...:2181/mesos --port=5050 
> --log_dir=/var/log/mesos --cluster=CLUSTER_POC --hostname=... --ip=... 
> --quorum=1 --work_dir=/var/lib/mesos
> 
> Mesos slave is started:
> /usr/local/sbin/mesos-slave --master=zk://...:2181/mesos 
> --log_dir=/var/log/mesos --checkpoint=true 
> --containerizer_path=/usr/local/bin/deimos 
> --executor_registration_timeout=5mins --hostname=... --ip=... 
> --isolation=external --recover=reconnect --recovery_timeout=120mins 
> --strict=true
> 
> Marathon is started:
> java -Xmx512m -Djava.library.path=/usr/local/lib 
> -Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp 
> /usr/local/bin/marathon mesosphere.marathon.Main --zk zk://...:2181/marathon 
> --master zk://...:2181/mesos --local_port_min 30000 --hostname ... 
> --event_subscriber http_callback --http_port 8080 --task_launch_timeout 
> 300000 --local_port_max 40000 --ha --checkpoint
> 
> 
> 
>

Re: Task Checkpointing with Mesos, Marathon and Docker containers

Reply via email to