Hello all thanks for your answers.
Is there a way of configuring this 75s timeout for slave reconnection? I think that my problem is that as the task status is lost: - marathon framework detects the loss and start another instance - mesos-slave, when restarting, detects the lost task and restart a new one ==> 2 tasks on mesos cluster, 2 running docker containers, 1 app instance in marathon So a solution would be to extend the 75s timeout. I thought that my command lines for starting the cluster were fine, but it seems incomplete... I would like to be able to shutdown a mesos-slave for maintenance without altering the current tasks. 2014-11-25 18:30 GMT+01:00 Connor Doyle <[email protected]>: > Hi Geoffroy, > > For the Marathon instances, in all released version of Marathon you must > supply the --checkpoint flag to turn on task checkpointing for the > framework. We've changed the default to true starting with the next > release. > > There is a bug in Mesos where the FrameworkInfo does not get updated when > a framework re-registers. This means that if you shut down Marathon and > restart it with --checkpoint, the Mesos master (with the same FrameworkId, > which Marathon picks up from ZK) will ignore the new setting. For > reference, here is the design doc to address that: > https://cwiki.apache.org/confluence/display/MESOS/Design+doc%3A+Updating+Framework+Info > > Fortunately, there is an easy workaround. > > 1) Shut down Marathon (tasks keep running) > 2) Restart the leading Mesos master (tasks keep running) > 3) Start Marathon with --checkpoint enabled > > This works by clearing the Mesos master's in-memory state. It is rebuilt > as the slave nodes and frameworks re-register. > > Please report back if this doesn't solve the issue for you. > -- > Connor > > > > On Nov 25, 2014, at 07:43, Geoffroy Jabouley < > [email protected]> wrote: > > > > Hello > > > > i am currently trying to activate checkpointing for my Mesos cloud. > > > > Starting from an application running in a docker container on the > cluster, launched from marathon, my use cases are the followings: > > > > UC1: kill the marathon service, then restart after 2 minutes. > > Expected: the mesos task is still active, the docker container is > running. When the marathon service restarts, it get backs its tasks. > > > > Result: OK > > > > > > UC2: kill the mesos slave, then restart after 2 minutes. > > Expected: the mesos task remains active, the docker container is > running. When the mesos slave service restarts, it get backs its tasks. > Marathon does not show error. > > > > Results: task get status LOST when slave is killed. Docker container > still running. Marathon detects the application went down and spawn a new > one on another available mesos slave. When the slave restarts, it kills the > previous running container and start a new one. So i end up with 2 > applications on my cluster, one spawn by Marathon, and another orphan one. > > > > > > Is this behavior normal? Can you please explain what i am doing wrong? > > > > > ----------------------------------------------------------------------------------------------------------- > > > > Here is the configuration i have come so far: > > Mesos 0.19.1 (not dockerized) > > Marathon 0.6.1 (not dockerized) > > Docker 1.3 + Deimos 0.4.2 > > > > Mesos master is started: > > /usr/local/sbin/mesos-master --zk=zk://...:2181/mesos --port=5050 > --log_dir=/var/log/mesos --cluster=CLUSTER_POC --hostname=... --ip=... > --quorum=1 --work_dir=/var/lib/mesos > > > > Mesos slave is started: > > /usr/local/sbin/mesos-slave --master=zk://...:2181/mesos > --log_dir=/var/log/mesos --checkpoint=true > --containerizer_path=/usr/local/bin/deimos > --executor_registration_timeout=5mins --hostname=... --ip=... > --isolation=external --recover=reconnect --recovery_timeout=120mins > --strict=true > > > > Marathon is started: > > java -Xmx512m -Djava.library.path=/usr/local/lib > -Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp > /usr/local/bin/marathon mesosphere.marathon.Main --zk > zk://...:2181/marathon --master zk://...:2181/mesos --local_port_min 30000 > --hostname ... --event_subscriber http_callback --http_port 8080 > --task_launch_timeout 300000 --local_port_max 40000 --ha --checkpoint > > > > > > > > > >

