Hm.. I'm not sure if any of the Marathon developers are on this list. They have a mailing list here: https://groups.google.com/forum/?hl=en#!forum/marathon-framework
On Mon, Jan 19, 2015 at 4:07 AM, Antonin Kral <[email protected]> wrote: > Hi all, > > first of all, than you for all the hard work on Mesos and related stuff. > We are running fairly small mesos/marathon cluster (3 masters + 9 > slaves + 3 ZK nodes). All servers are hosted at http://www.hetzner.de/ . > This means that we are sometime facing a network issues, frequently > caused by some DDoS attack running against other servers in datacenters. > > We are then facing huge problems with our Marathon installation. Typical > behavior would be that Marathon will abandon the tasks. So it will > report the lower number of tasks is running (frequently 0) then > requested with scaling. So it will try to scale up, which will fail as > workers are occupied with previous jobs, which are correctly reported in > Mesos. > > We have not been able to pinpoint anything helpful in the log files of > Marathon. We have tried running in 1 master as well as 3 masters modes. > 3 node mode seemed actually a bit worse. > > The only working solution so far is to stop everything. Wipe ZK and kill > all jobs on Mesos and then start all components again. > > So I would like to ask couple questions: > > - what is the actual use-case for Marathon? > > Is it expected to have larger number of apps/jobs (right now we have > something like 50 apps) or rather to have like 5 of them, which are > Mesos frameworks? > > - Is there a way how to tell Marathon to take ownership of currently > running jobs? > > Honestly, not really sure how this could work as I possibly don't > have any state information about them. > > - What should be the command line to get some helpful information for > you guyz to debug the problem next time? > > As you can see, the problem is that problems are quite random. We > didn't have any problem during December, but already had like 3 > total breakdowns last week. > > Thanks a lot, > > Antonin >

