Thanks Niklas. Hi Antonin,
Marathon should be able to handle tjousands of tasks and that is exactly what it's made for. Unfortunately the latest release (0.7.6) has been very unstable. We fixed a lot of bugs that caused this unstability and just tagged an RC for 0.8.0 yesterday: https://github.com/mesosphere/marathon/releases/tag/v0.8.0-RC1. It would be great if you could try this RC and report if you still see these issues. I will add the Linux packages and some information about the changes later today. Cheers, Dario > On 22.01.2015, at 04:35, Niklas Nielsen <[email protected]> wrote: > > Looping in Connor and Dario. > >> On 21 January 2015 at 17:21, Benjamin Mahler <[email protected]> >> wrote: >> Hm.. I'm not sure if any of the Marathon developers are on this list. >> >> They have a mailing list here: >> https://groups.google.com/forum/?hl=en#!forum/marathon-framework >> >>> On Mon, Jan 19, 2015 at 4:07 AM, Antonin Kral <[email protected]> wrote: >>> Hi all, >>> >>> first of all, than you for all the hard work on Mesos and related stuff. >>> We are running fairly small mesos/marathon cluster (3 masters + 9 >>> slaves + 3 ZK nodes). All servers are hosted at http://www.hetzner.de/ . >>> This means that we are sometime facing a network issues, frequently >>> caused by some DDoS attack running against other servers in datacenters. >>> >>> We are then facing huge problems with our Marathon installation. Typical >>> behavior would be that Marathon will abandon the tasks. So it will >>> report the lower number of tasks is running (frequently 0) then >>> requested with scaling. So it will try to scale up, which will fail as >>> workers are occupied with previous jobs, which are correctly reported in >>> Mesos. >>> >>> We have not been able to pinpoint anything helpful in the log files of >>> Marathon. We have tried running in 1 master as well as 3 masters modes. >>> 3 node mode seemed actually a bit worse. >>> >>> The only working solution so far is to stop everything. Wipe ZK and kill >>> all jobs on Mesos and then start all components again. >>> >>> So I would like to ask couple questions: >>> >>> - what is the actual use-case for Marathon? >>> >>> Is it expected to have larger number of apps/jobs (right now we have >>> something like 50 apps) or rather to have like 5 of them, which are >>> Mesos frameworks? >>> >>> - Is there a way how to tell Marathon to take ownership of currently >>> running jobs? >>> >>> Honestly, not really sure how this could work as I possibly don't >>> have any state information about them. >>> >>> - What should be the command line to get some helpful information for >>> you guyz to debug the problem next time? >>> >>> As you can see, the problem is that problems are quite random. We >>> didn't have any problem during December, but already had like 3 >>> total breakdowns last week. >>> >>> Thanks a lot, >>> >>> Antonin >> >

