Problem:
If i restart a current framework leader for marathon ( the host from active
frameworks tab in mesos ui) , a new one is elected after a moment and any new
deployments are stuck infinitely at 'deploying' state (empty black bar, 0/1
and hanging - with debug level i dont see any errors in marathon/mesos logs)
Also the old tasks are untouchable at that time - yes, they keep running, but
cant kill, restart nor scale them.
When that happens i can:
stop marathon on all masters
remove the framework via a curl to mesos api /shutdown
purge /marathon from zookeper cli
restart docker services on all slaves (that kills the zombie containers)
restart mesos-slave services on all slaves (pampering my paranoia here)
then i can deploy apps again.
How can i avoid this problem? Any basic settings im missing? This is scary, as
the reboot of a single master (out of 3 or 5 servers) freezes everything that
is deployed using marathon, and the steps to reclaim control introduce downtime
to every single app sunning there.
Configuration:
Running ubuntu 14.04.2. LTS
mesos 0.22.1-1.0.ubuntu1404
marathon 0.9.0-1.0.381.ubuntu1404
chronos 2.3.4-1.0.81.ubuntu1404
The cluster uses 3 masters and a 15 slaves. Also the master machines are
running mesos-slave process (albeit those machines give only a portion of
resources as offerrings)
The configuration for mesos/marathon is very "default" dependant, options
specified You can see below. The quorum is 2.
Marathon service is run on 3 master machines
root@mesos-master1 ~ # tree /etc/marathon/
/etc/marathon/
`-- conf
|-- event_subscriber
|-- framework_name
|-- hostname
|-- logging_level
`-- zk
1 directory, 5 files
root@mesos-master1 ~ # tree /etc/mesos
/etc/mesos
`-- zk
0 directories, 1 file
root@mesos-master1 ~ # tree /etc/mesos-slave/
/etc/mesos-slave/
|-- containerizers
|-- docker_stop_timeout
|-- executor_registration_timeout
|-- executor_shutdown_grace_period
|-- hostname
|-- ip
|-- logging_level
`-- resources
0 directories, 8 files
root@mesos-master1 ~ # tree /etc/mesos-master
/etc/mesos-master
|-- cluster
|-- hostname
|-- ip
|-- logging_level
|-- quorum
`-- work_dir