Interesting points, I'd like to understand your two cases above:

1. a rogue job can potentially render slaves useless, and,

Concretely what kinds of things are you considering here? Are you
considering jobs that saturate non-isolated resources? Something else?

2. a rogue slave (or rather a rogue executor) can blackhole jobs via false
positive completions

Concretely what kinds of things are you considered here? A maliciously
constructed slave? How would these false positives be fabricated? Does
authentication preclude this?


On Fri, May 2, 2014 at 11:00 AM, Sharma Podila <[email protected]> wrote:

> Although I am not as familiar with Marathon specifics, in general,
>
> 1. a rogue job can potentially render slaves useless, and,
> 2. a rogue slave (or rather a rogue executor) can blackhole jobs via false
> positive completions
>
> A strategy that helps with #1 is to limit the number of re-launches of an
> individual job/task upon failure. Even better if this is done with failure
> rate. Simple rate limiting may only delay the problem for a while.
> A strategy that helps with #2 is to "disable" the slave from further
> launches when too many failures are reported from it in a given time
> period. This can render many slaves disabled and reduce cluster throughput
> (which should alert the operator), which is better than falsely putting all
> jobs into completion state.
>
> An out of band monitor that watches job/task lifecycle events can achieve
> this, for example, using a stream processing technique over the continuous
> event stream.
>
> Sharma
>
>
>
> On Fri, May 2, 2014 at 10:35 AM, Dick Davies <[email protected]>wrote:
>
>> Not quite - looks to me like mesos slave disks filled with failed jobs
>> (because marathon
>> continued to throw a broken .zip into them) and with /tmp on the root
>> fs the servers became
>> unresponsive. Tobi mentions there's a way to set that at deploy time,
>> but in this case the
>> guy who can't type 'hello world' correctly would have been responsible
>> for setting the rate limits
>> too (that's me by the way!) so in itself that's not protection from pilot
>> error.
>>
>> I'm not sure if GC was able to clear /var any better (I doubt it very
>> much, my impression
>> was that's on the order of days). Think it's more the deploy could be
>> cancelled better while the
>> system was still functioning (speculation - i'm still in early stages
>> of learning the internals of this).
>>
>> On 30 April 2014 22:08, Vinod Kone <[email protected]> wrote:
>> > Dick, I've also briefly skimmed at your original email to marathon
>> mailing
>> > list and it sounded like executor sandboxes were not getting garbage
>> > collected (a mesos feature) when the slave work directory was rooted in
>> /tmp
>> > vs /var? Did I understand that right? If yes, I would love to see some
>> logs.
>> >
>> >
>> > On Wed, Apr 30, 2014 at 1:51 PM, Tobias Knaup <[email protected]> wrote:
>> >>
>> >> In Marathon you can specify taskRateLimit (max number of tasks to start
>> >> per second) as part of your app definition.
>> >>
>> >>
>> >> On Wed, Apr 30, 2014 at 11:30 AM, Dick Davies <[email protected]>
>> >> wrote:
>> >>>
>> >>> Managed to take out a mesos slave today with a typo while launching
>> >>> a marathon app, and wondered if there are throttles/limits that can be
>> >>> applied to repeated launches to limit the risk of such mistakes in the
>> >>> future.
>> >>>
>> >>> I started a thread on the marathon list
>> >>>  (
>> >>>
>> https://groups.google.com/forum/?hl=en#!topic/marathon-framework/4iWLqTYTvgM
>> >>> )
>> >>>
>> >>> [ TL:DR: marathon throws an app that will never deploy correctly at
>> >>> slaves
>> >>> until the disk fills with debris and the slave dies ]
>> >>>
>> >>> but I suppose this could be something available in mesos itself.
>> >>>
>> >>> I can't find a lot of advice about operational aspects of Mesos admin;
>> >>> could others here provide some good advice about their experience in
>> >>> preventing failed task deploys from causing trouble on their clusters?
>> >>>
>> >>> Thanks!
>> >>
>> >>
>> >
>>
>
>

Reply via email to