Sounds like a marathon issue. Mind asking in marathon's mailing list? On Thu, Jul 16, 2015 at 8:02 AM, Nikolay Borodachev <[email protected]> wrote:
> Maciej, > > > > I had a similar problem but it got solved by setting LIBPROCESS_IP > environment variable to the host IP address for the Marathon process. > > > > Nikolay > > > > > > *From:* Maciej Strzelecki [mailto:[email protected]] > *Sent:* Thursday, July 16, 2015 7:30 AM > *To:* [email protected] > *Subject:* Marathon can no longer deploy any apps after a failover > > > > Problem: > > > > > If i restart a current framework leader for marathon ( the host from > active frameworks tab in mesos ui) , a new one is elected after a moment > and any new deployments are stuck infinitely at 'deploying' state (empty > black bar, 0/1 and hanging - with debug level i dont see any errors in > marathon/mesos logs) > > Also the old tasks are untouchable at that time - yes, they keep running, > but cant kill, restart nor scale them. > > > > When that happens i can: > > stop marathon on all masters > > remove the framework via a curl to mesos api /shutdown > > purge /marathon from zookeper cli > > restart docker services on all slaves (that kills the zombie containers) > > restart mesos-slave services on all slaves (pampering my paranoia here) > then i can deploy apps again. > > > > How can i avoid this problem? Any basic settings im missing? This is > scary, as the reboot of a single master (out of 3 or 5 servers) freezes > everything that is deployed using marathon, and the steps to reclaim > control introduce downtime to every single app sunning there. > > > > > > > > > > Configuration: > > > > Running ubuntu 14.04.2. LTS > > mesos 0.22.1-1.0.ubuntu1404 > > marathon 0.9.0-1.0.381.ubuntu1404 > > chronos 2.3.4-1.0.81.ubuntu1404 > > > > The cluster uses 3 masters and a 15 slaves. Also the master machines are > running mesos-slave process (albeit those machines give only a portion of > resources as offerrings) > > > > The configuration for mesos/marathon is very "default" dependant, options > specified You can see below. The quorum is 2. > > > > Marathon service is run on 3 master machines > > > > root@mesos-master1 ~ # tree /etc/marathon/ > /etc/marathon/ > `-- conf > |-- event_subscriber > |-- framework_name > |-- hostname > |-- logging_level > `-- zk > > 1 directory, 5 files > root@mesos-master1 ~ # tree /etc/mesos > /etc/mesos > `-- zk > > 0 directories, 1 file > root@mesos-master1 ~ # tree /etc/mesos-slave/ > /etc/mesos-slave/ > |-- containerizers > |-- docker_stop_timeout > |-- executor_registration_timeout > |-- executor_shutdown_grace_period > |-- hostname > |-- ip > |-- logging_level > `-- resources > > 0 directories, 8 files > root@mesos-master1 ~ # tree /etc/mesos-master > /etc/mesos-master > |-- cluster > |-- hostname > |-- ip > |-- logging_level > |-- quorum > `-- work_dir >

