Re: Task Checkpointing with Mesos, Marathon and Docker containers

Geoffroy Jabouley Wed, 26 Nov 2014 01:38:53 -0800

Hello all

thanks for your answers.


Is there a way of configuring this 75s timeout for slave reconnection?

I think that my problem is that as the task status is lost:
- marathon framework detects the loss and start another instance
- mesos-slave, when restarting, detects the lost task and restart a new one

==> 2 tasks on mesos cluster, 2 running docker containers, 1 app instance
in marathon


So a solution would be to extend the 75s timeout. I thought that my command
lines for starting the cluster were fine, but it seems incomplete...

I would like to be able to shutdown a mesos-slave for maintenance without
altering the current tasks.

2014-11-25 18:30 GMT+01:00 Connor Doyle <[email protected]>:

> Hi Geoffroy,
>
> For the Marathon instances, in all released version of Marathon you must
> supply the --checkpoint flag to turn on task checkpointing for the
> framework.  We've changed the default to true starting with the next
> release.
>
> There is a bug in Mesos where the FrameworkInfo does not get updated when
> a framework re-registers.  This means that if you shut down Marathon and
> restart it with --checkpoint, the Mesos master (with the same FrameworkId,
> which Marathon picks up from ZK) will ignore the new setting.  For
> reference, here is the design doc to address that:
> https://cwiki.apache.org/confluence/display/MESOS/Design+doc%3A+Updating+Framework+Info
>
> Fortunately, there is an easy workaround.
>
> 1) Shut down Marathon (tasks keep running)
> 2) Restart the leading Mesos master (tasks keep running)
> 3) Start Marathon with --checkpoint enabled
>
> This works by clearing the Mesos master's in-memory state.  It is rebuilt
> as the slave nodes and frameworks re-register.
>
> Please report back if this doesn't solve the issue for you.
> --
> Connor
>
>
> > On Nov 25, 2014, at 07:43, Geoffroy Jabouley <
> [email protected]> wrote:
> >
> > Hello
> >
> > i am currently trying to activate checkpointing for my Mesos cloud.
> >
> > Starting from an application running in a docker container on the
> cluster, launched from marathon, my use cases are the followings:
> >
> > UC1: kill the marathon service, then restart after 2 minutes.
> > Expected: the mesos task is still active, the docker container is
> running. When the marathon service restarts, it get backs its tasks.
> >
> > Result: OK
> >
> >
> > UC2: kill the mesos slave, then restart after 2 minutes.
> > Expected: the mesos task remains active, the docker container is
> running. When the mesos slave service restarts, it get backs its tasks.
> Marathon does not show error.
> >
> > Results: task get status LOST when slave is killed. Docker container
> still running.  Marathon detects the application went down and spawn a new
> one on another available mesos slave. When the slave restarts, it kills the
> previous running container and start a new one. So i end up with 2
> applications on my cluster, one spawn by Marathon, and another orphan one.
> >
> >
> > Is this behavior normal? Can you please explain what i am doing wrong?
> >
> >
> -----------------------------------------------------------------------------------------------------------
> >
> > Here is the configuration i have come so far:
> > Mesos 0.19.1 (not dockerized)
> > Marathon 0.6.1 (not dockerized)
> > Docker 1.3 + Deimos 0.4.2
> >
> > Mesos master is started:
> > /usr/local/sbin/mesos-master --zk=zk://...:2181/mesos --port=5050
> --log_dir=/var/log/mesos --cluster=CLUSTER_POC --hostname=... --ip=...
> --quorum=1 --work_dir=/var/lib/mesos
> >
> > Mesos slave is started:
> > /usr/local/sbin/mesos-slave --master=zk://...:2181/mesos
> --log_dir=/var/log/mesos --checkpoint=true
> --containerizer_path=/usr/local/bin/deimos
> --executor_registration_timeout=5mins --hostname=... --ip=...
> --isolation=external --recover=reconnect --recovery_timeout=120mins
> --strict=true
> >
> > Marathon is started:
> > java -Xmx512m -Djava.library.path=/usr/local/lib
> -Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp
> /usr/local/bin/marathon mesosphere.marathon.Main --zk
> zk://...:2181/marathon --master zk://...:2181/mesos --local_port_min 30000
> --hostname ... --event_subscriber http_callback --http_port 8080
> --task_launch_timeout 300000 --local_port_max 40000 --ha --checkpoint
> >
> >
> >
> >
>
>

Re: Task Checkpointing with Mesos, Marathon and Docker containers

Reply via email to