Not quite - looks to me like mesos slave disks filled with failed jobs
(because marathon
continued to throw a broken .zip into them) and with /tmp on the root
fs the servers became
unresponsive. Tobi mentions there's a way to set that at deploy time,
but in this case the
guy who can't type 'hello world' correctly would have been responsible
for setting the rate limits
too (that's me by the way!) so in itself that's not protection from pilot error.

I'm not sure if GC was able to clear /var any better (I doubt it very
much, my impression
was that's on the order of days). Think it's more the deploy could be
cancelled better while the
system was still functioning (speculation - i'm still in early stages
of learning the internals of this).

On 30 April 2014 22:08, Vinod Kone <vinodk...@gmail.com> wrote:
> Dick, I've also briefly skimmed at your original email to marathon mailing
> list and it sounded like executor sandboxes were not getting garbage
> collected (a mesos feature) when the slave work directory was rooted in /tmp
> vs /var? Did I understand that right? If yes, I would love to see some logs.
>
>
> On Wed, Apr 30, 2014 at 1:51 PM, Tobias Knaup <t...@knaup.me> wrote:
>>
>> In Marathon you can specify taskRateLimit (max number of tasks to start
>> per second) as part of your app definition.
>>
>>
>> On Wed, Apr 30, 2014 at 11:30 AM, Dick Davies <d...@hellooperator.net>
>> wrote:
>>>
>>> Managed to take out a mesos slave today with a typo while launching
>>> a marathon app, and wondered if there are throttles/limits that can be
>>> applied to repeated launches to limit the risk of such mistakes in the
>>> future.
>>>
>>> I started a thread on the marathon list
>>>  (
>>> https://groups.google.com/forum/?hl=en#!topic/marathon-framework/4iWLqTYTvgM
>>> )
>>>
>>> [ TL:DR: marathon throws an app that will never deploy correctly at
>>> slaves
>>> until the disk fills with debris and the slave dies ]
>>>
>>> but I suppose this could be something available in mesos itself.
>>>
>>> I can't find a lot of advice about operational aspects of Mesos admin;
>>> could others here provide some good advice about their experience in
>>> preventing failed task deploys from causing trouble on their clusters?
>>>
>>> Thanks!
>>
>>
>

Reply via email to