Maciej, I had a similar problem but it got solved by setting LIBPROCESS_IP environment variable to the host IP address for the Marathon process.
Nikolay From: Maciej Strzelecki [mailto:[email protected]] Sent: Thursday, July 16, 2015 7:30 AM To: [email protected] Subject: Marathon can no longer deploy any apps after a failover Problem: If i restart a current framework leader for marathon ( the host from active frameworks tab in mesos ui) , a new one is elected after a moment and any new deployments are stuck infinitely at 'deploying' state (empty black bar, 0/1 and hanging - with debug level i dont see any errors in marathon/mesos logs) Also the old tasks are untouchable at that time - yes, they keep running, but cant kill, restart nor scale them. When that happens i can: stop marathon on all masters remove the framework via a curl to mesos api /shutdown purge /marathon from zookeper cli restart docker services on all slaves (that kills the zombie containers) restart mesos-slave services on all slaves (pampering my paranoia here) then i can deploy apps again. How can i avoid this problem? Any basic settings im missing? This is scary, as the reboot of a single master (out of 3 or 5 servers) freezes everything that is deployed using marathon, and the steps to reclaim control introduce downtime to every single app sunning there. Configuration: Running ubuntu 14.04.2. LTS mesos 0.22.1-1.0.ubuntu1404 marathon 0.9.0-1.0.381.ubuntu1404 chronos 2.3.4-1.0.81.ubuntu1404 The cluster uses 3 masters and a 15 slaves. Also the master machines are running mesos-slave process (albeit those machines give only a portion of resources as offerrings) The configuration for mesos/marathon is very "default" dependant, options specified You can see below. The quorum is 2. Marathon service is run on 3 master machines root@mesos-master1 ~ # tree /etc/marathon/ /etc/marathon/ `-- conf |-- event_subscriber |-- framework_name |-- hostname |-- logging_level `-- zk 1 directory, 5 files root@mesos-master1 ~ # tree /etc/mesos /etc/mesos `-- zk 0 directories, 1 file root@mesos-master1 ~ # tree /etc/mesos-slave/ /etc/mesos-slave/ |-- containerizers |-- docker_stop_timeout |-- executor_registration_timeout |-- executor_shutdown_grace_period |-- hostname |-- ip |-- logging_level `-- resources 0 directories, 8 files root@mesos-master1 ~ # tree /etc/mesos-master /etc/mesos-master |-- cluster |-- hostname |-- ip |-- logging_level |-- quorum `-- work_dir

