Re: Marathon stability and use-case

Benjamin Mahler Wed, 21 Jan 2015 17:23:16 -0800

Hm.. I'm not sure if any of the Marathon developers are on this list.

They have a mailing list here:
https://groups.google.com/forum/?hl=en#!forum/marathon-framework


On Mon, Jan 19, 2015 at 4:07 AM, Antonin Kral <[email protected]> wrote:

> Hi all,
>
> first of all, than you for all the hard work on Mesos and related stuff.
> We are running fairly small mesos/marathon cluster (3 masters + 9
> slaves + 3 ZK nodes). All servers are hosted at http://www.hetzner.de/ .
> This means that we are sometime facing a network issues, frequently
> caused by some DDoS attack running against other servers in datacenters.
>
> We are then facing huge problems with our Marathon installation. Typical
> behavior would be that Marathon will abandon the tasks. So it will
> report the lower number of tasks is running (frequently 0) then
> requested with scaling. So it will try to scale up, which will fail as
> workers are occupied with previous jobs, which are correctly reported in
> Mesos.
>
> We have not been able to pinpoint anything helpful in the log files of
> Marathon. We have tried running in 1 master as well as 3 masters modes.
> 3 node mode seemed actually a bit worse.
>
> The only working solution so far is to stop everything. Wipe ZK and kill
> all jobs on Mesos and then start all components again.
>
> So I would like to ask couple questions:
>
>   - what is the actual use-case for Marathon?
>
>     Is it expected to have larger number of apps/jobs (right now we have
>     something like 50 apps) or rather to have like 5 of them, which are
>     Mesos frameworks?
>
>   - Is there a way how to tell Marathon to take ownership of currently
>     running jobs?
>
>     Honestly, not really sure how this could work as I possibly don't
>     have any state information about them.
>
>   - What should be the command line to get some helpful information for
>     you guyz to debug the problem next time?
>
>     As you can see, the problem is that problems are quite random. We
>     didn't have any problem during December, but already had like 3
>     total breakdowns last week.
>
> Thanks a lot,
>
>     Antonin
>

Re: Marathon stability and use-case

Reply via email to