sometimes you need check zookeeper log, slave log, master log. this is mesos pain, it very difficult debug for the wired case.
2015-07-16 20:29 GMT+08:00 Maciej Strzelecki < maciej.strzele...@crealytics.com>: > Problem: > > > > If i restart a current framework leader for marathon ( the host from > active frameworks tab in mesos ui) , a new one is elected after a moment > and any new deployments are stuck infinitely at 'deploying' state (empty > black bar, 0/1 and hanging - with debug level i dont see any errors in > marathon/mesos logs) > > Also the old tasks are untouchable at that time - yes, they keep running, > but cant kill, restart nor scale them. > > > When that happens i can: > > stop marathon on all masters > > remove the framework via a curl to mesos api /shutdown > > purge /marathon from zookeper cli > > restart docker services on all slaves (that kills the zombie containers) > restart mesos-slave services on all slaves (pampering my paranoia here) > then i can deploy apps again. > > > How can i avoid this problem? Any basic settings im missing? This is > scary, as the reboot of a single master (out of 3 or 5 servers) freezes > everything that is deployed using marathon, and the steps to reclaim > control introduce downtime to every single app sunning there. > > > > > > Configuration: > > > Running ubuntu 14.04.2. LTS > > mesos 0.22.1-1.0.ubuntu1404 > > marathon 0.9.0-1.0.381.ubuntu1404 > > chronos 2.3.4-1.0.81.ubuntu1404 > > > The cluster uses 3 masters and a 15 slaves. Also the master machines > are running mesos-slave process (albeit those machines give only a portion > of resources as offerrings) > > > The configuration for mesos/marathon is very "default" dependant, > options specified You can see below. The quorum is 2. > > > Marathon service is run on 3 master machines > > > root@mesos-master1 ~ # tree /etc/marathon/ > /etc/marathon/ > `-- conf > |-- event_subscriber > |-- framework_name > |-- hostname > |-- logging_level > `-- zk > > 1 directory, 5 files > root@mesos-master1 ~ # tree /etc/mesos > /etc/mesos > `-- zk > > 0 directories, 1 file > root@mesos-master1 ~ # tree /etc/mesos-slave/ > /etc/mesos-slave/ > |-- containerizers > |-- docker_stop_timeout > |-- executor_registration_timeout > |-- executor_shutdown_grace_period > |-- hostname > |-- ip > |-- logging_level > `-- resources > > 0 directories, 8 files > root@mesos-master1 ~ # tree /etc/mesos-master > /etc/mesos-master > |-- cluster > |-- hostname > |-- ip > |-- logging_level > |-- quorum > `-- work_dir > -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com