Re: Marathon stability and use-case

Niklas Nielsen Wed, 21 Jan 2015 19:37:45 -0800

Looping in Connor and Dario.

On 21 January 2015 at 17:21, Benjamin Mahler <benjamin.mah...@gmail.com>
wrote:


> Hm.. I'm not sure if any of the Marathon developers are on this list.
>
> They have a mailing list here:
> https://groups.google.com/forum/?hl=en#!forum/marathon-framework
>
> On Mon, Jan 19, 2015 at 4:07 AM, Antonin Kral <a.k...@bobek.cz> wrote:
>
>> Hi all,
>>
>> first of all, than you for all the hard work on Mesos and related stuff.
>> We are running fairly small mesos/marathon cluster (3 masters + 9
>> slaves + 3 ZK nodes). All servers are hosted at http://www.hetzner.de/ .
>> This means that we are sometime facing a network issues, frequently
>> caused by some DDoS attack running against other servers in datacenters.
>>
>> We are then facing huge problems with our Marathon installation. Typical
>> behavior would be that Marathon will abandon the tasks. So it will
>> report the lower number of tasks is running (frequently 0) then
>> requested with scaling. So it will try to scale up, which will fail as
>> workers are occupied with previous jobs, which are correctly reported in
>> Mesos.
>>
>> We have not been able to pinpoint anything helpful in the log files of
>> Marathon. We have tried running in 1 master as well as 3 masters modes.
>> 3 node mode seemed actually a bit worse.
>>
>> The only working solution so far is to stop everything. Wipe ZK and kill
>> all jobs on Mesos and then start all components again.
>>
>> So I would like to ask couple questions:
>>
>>   - what is the actual use-case for Marathon?
>>
>>     Is it expected to have larger number of apps/jobs (right now we have
>>     something like 50 apps) or rather to have like 5 of them, which are
>>     Mesos frameworks?
>>
>>   - Is there a way how to tell Marathon to take ownership of currently
>>     running jobs?
>>
>>     Honestly, not really sure how this could work as I possibly don't
>>     have any state information about them.
>>
>>   - What should be the command line to get some helpful information for
>>     you guyz to debug the problem next time?
>>
>>     As you can see, the problem is that problems are quite random. We
>>     didn't have any problem during December, but already had like 3
>>     total breakdowns last week.
>>
>> Thanks a lot,
>>
>>     Antonin
>>
>
>

Re: Marathon stability and use-case

Reply via email to