Thanks for guidelines! Ill try these paths out, and join the marathon mailing-list (was oblivious there was one ;))
Maciej Strzelecki Operations Engineer Tel: +49 30 6098381-50 Fax: +49 851-213728-88 E-mail: [email protected] www.crealytics.com<http://www.crealytics.com> blog.crealytics.com crealytics GmbH - Semantic PPC Advertising Technology Brunngasse 1 - 94032 Passau - Germany Oranienstraße 185 - 10999 Berlin - Germany Managing directors: Andreas Reiffen, Christof König, Dr. Markus Kurch Register court: Amtsgericht Passau, HRB 7466 Geschäftsführer: Andreas Reiffen, Christof König, Daniel Trost Reg.-Gericht: Amtsgericht Passau, HRB 7466 ________________________________ From: Vinod Kone <[email protected]> Sent: Thursday, July 16, 2015 7:09 PM To: [email protected] Subject: Re: Marathon can no longer deploy any apps after a failover Sounds like a marathon issue. Mind asking in marathon's mailing list? On Thu, Jul 16, 2015 at 8:02 AM, Nikolay Borodachev <[email protected]<mailto:[email protected]>> wrote: Maciej, I had a similar problem but it got solved by setting LIBPROCESS_IP environment variable to the host IP address for the Marathon process. Nikolay From: Maciej Strzelecki [mailto:[email protected]<mailto:[email protected]>] Sent: Thursday, July 16, 2015 7:30 AM To: [email protected]<mailto:[email protected]> Subject: Marathon can no longer deploy any apps after a failover Problem: If i restart a current framework leader for marathon ( the host from active frameworks tab in mesos ui) , a new one is elected after a moment and any new deployments are stuck infinitely at 'deploying' state (empty black bar, 0/1 and hanging - with debug level i dont see any errors in marathon/mesos logs) Also the old tasks are untouchable at that time - yes, they keep running, but cant kill, restart nor scale them. When that happens i can: stop marathon on all masters remove the framework via a curl to mesos api /shutdown purge /marathon from zookeper cli restart docker services on all slaves (that kills the zombie containers) restart mesos-slave services on all slaves (pampering my paranoia here) then i can deploy apps again. How can i avoid this problem? Any basic settings im missing? This is scary, as the reboot of a single master (out of 3 or 5 servers) freezes everything that is deployed using marathon, and the steps to reclaim control introduce downtime to every single app sunning there. Configuration: Running ubuntu 14.04.2. LTS mesos 0.22.1-1.0.ubuntu1404 marathon 0.9.0-1.0.381.ubuntu1404 chronos 2.3.4-1.0.81.ubuntu1404 The cluster uses 3 masters and a 15 slaves. Also the master machines are running mesos-slave process (albeit those machines give only a portion of resources as offerrings) The configuration for mesos/marathon is very "default" dependant, options specified You can see below. The quorum is 2. Marathon service is run on 3 master machines root@mesos-master1 ~ # tree /etc/marathon/ /etc/marathon/ `-- conf |-- event_subscriber |-- framework_name |-- hostname |-- logging_level `-- zk 1 directory, 5 files root@mesos-master1 ~ # tree /etc/mesos /etc/mesos `-- zk 0 directories, 1 file root@mesos-master1 ~ # tree /etc/mesos-slave/ /etc/mesos-slave/ |-- containerizers |-- docker_stop_timeout |-- executor_registration_timeout |-- executor_shutdown_grace_period |-- hostname |-- ip |-- logging_level `-- resources 0 directories, 8 files root@mesos-master1 ~ # tree /etc/mesos-master /etc/mesos-master |-- cluster |-- hostname |-- ip |-- logging_level |-- quorum `-- work_dir

