Looping in Connor and Dario. On 21 January 2015 at 17:21, Benjamin Mahler <benjamin.mah...@gmail.com> wrote:
> Hm.. I'm not sure if any of the Marathon developers are on this list. > > They have a mailing list here: > https://groups.google.com/forum/?hl=en#!forum/marathon-framework > > On Mon, Jan 19, 2015 at 4:07 AM, Antonin Kral <a.k...@bobek.cz> wrote: > >> Hi all, >> >> first of all, than you for all the hard work on Mesos and related stuff. >> We are running fairly small mesos/marathon cluster (3 masters + 9 >> slaves + 3 ZK nodes). All servers are hosted at http://www.hetzner.de/ . >> This means that we are sometime facing a network issues, frequently >> caused by some DDoS attack running against other servers in datacenters. >> >> We are then facing huge problems with our Marathon installation. Typical >> behavior would be that Marathon will abandon the tasks. So it will >> report the lower number of tasks is running (frequently 0) then >> requested with scaling. So it will try to scale up, which will fail as >> workers are occupied with previous jobs, which are correctly reported in >> Mesos. >> >> We have not been able to pinpoint anything helpful in the log files of >> Marathon. We have tried running in 1 master as well as 3 masters modes. >> 3 node mode seemed actually a bit worse. >> >> The only working solution so far is to stop everything. Wipe ZK and kill >> all jobs on Mesos and then start all components again. >> >> So I would like to ask couple questions: >> >> - what is the actual use-case for Marathon? >> >> Is it expected to have larger number of apps/jobs (right now we have >> something like 50 apps) or rather to have like 5 of them, which are >> Mesos frameworks? >> >> - Is there a way how to tell Marathon to take ownership of currently >> running jobs? >> >> Honestly, not really sure how this could work as I possibly don't >> have any state information about them. >> >> - What should be the command line to get some helpful information for >> you guyz to debug the problem next time? >> >> As you can see, the problem is that problems are quite random. We >> didn't have any problem during December, but already had like 3 >> total breakdowns last week. >> >> Thanks a lot, >> >> Antonin >> > >