We are actually working on solving #2, by adding mutual authentication
between masters and slaves, and ensure that each group knows in advance
what the valid masters/slaves are. This allows us to ensure that no
malicious masters/slaves can join the cluster and do bad stuff. Please
contact me directly if you're interested in discussing/partnering on this
work.


On Tue, May 6, 2014 at 3:05 PM, Benjamin Mahler
<[email protected]>wrote:

> Interesting points, I'd like to understand your two cases above:
>
> 1. a rogue job can potentially render slaves useless, and,
>
> Concretely what kinds of things are you considering here? Are you
> considering jobs that saturate non-isolated resources? Something else?
>
> 2. a rogue slave (or rather a rogue executor) can blackhole jobs via false
> positive completions
>
> Concretely what kinds of things are you considered here? A maliciously
> constructed slave? How would these false positives be fabricated? Does
> authentication preclude this?
>
>
> On Fri, May 2, 2014 at 11:00 AM, Sharma Podila <[email protected]>wrote:
>
>> Although I am not as familiar with Marathon specifics, in general,
>>
>> 1. a rogue job can potentially render slaves useless, and,
>> 2. a rogue slave (or rather a rogue executor) can blackhole jobs via
>> false positive completions
>>
>> A strategy that helps with #1 is to limit the number of re-launches of an
>> individual job/task upon failure. Even better if this is done with failure
>> rate. Simple rate limiting may only delay the problem for a while.
>> A strategy that helps with #2 is to "disable" the slave from further
>> launches when too many failures are reported from it in a given time
>> period. This can render many slaves disabled and reduce cluster throughput
>> (which should alert the operator), which is better than falsely putting all
>> jobs into completion state.
>>
>> An out of band monitor that watches job/task lifecycle events can achieve
>> this, for example, using a stream processing technique over the continuous
>> event stream.
>>
>> Sharma
>>
>>
>>
>> On Fri, May 2, 2014 at 10:35 AM, Dick Davies <[email protected]>wrote:
>>
>>> Not quite - looks to me like mesos slave disks filled with failed jobs
>>> (because marathon
>>> continued to throw a broken .zip into them) and with /tmp on the root
>>> fs the servers became
>>> unresponsive. Tobi mentions there's a way to set that at deploy time,
>>> but in this case the
>>> guy who can't type 'hello world' correctly would have been responsible
>>> for setting the rate limits
>>> too (that's me by the way!) so in itself that's not protection from
>>> pilot error.
>>>
>>> I'm not sure if GC was able to clear /var any better (I doubt it very
>>> much, my impression
>>> was that's on the order of days). Think it's more the deploy could be
>>> cancelled better while the
>>> system was still functioning (speculation - i'm still in early stages
>>> of learning the internals of this).
>>>
>>> On 30 April 2014 22:08, Vinod Kone <[email protected]> wrote:
>>> > Dick, I've also briefly skimmed at your original email to marathon
>>> mailing
>>> > list and it sounded like executor sandboxes were not getting garbage
>>> > collected (a mesos feature) when the slave work directory was rooted
>>> in /tmp
>>> > vs /var? Did I understand that right? If yes, I would love to see some
>>> logs.
>>> >
>>> >
>>> > On Wed, Apr 30, 2014 at 1:51 PM, Tobias Knaup <[email protected]> wrote:
>>> >>
>>> >> In Marathon you can specify taskRateLimit (max number of tasks to
>>> start
>>> >> per second) as part of your app definition.
>>> >>
>>> >>
>>> >> On Wed, Apr 30, 2014 at 11:30 AM, Dick Davies <[email protected]
>>> >
>>> >> wrote:
>>> >>>
>>> >>> Managed to take out a mesos slave today with a typo while launching
>>> >>> a marathon app, and wondered if there are throttles/limits that can
>>> be
>>> >>> applied to repeated launches to limit the risk of such mistakes in
>>> the
>>> >>> future.
>>> >>>
>>> >>> I started a thread on the marathon list
>>> >>>  (
>>> >>>
>>> https://groups.google.com/forum/?hl=en#!topic/marathon-framework/4iWLqTYTvgM
>>> >>> )
>>> >>>
>>> >>> [ TL:DR: marathon throws an app that will never deploy correctly at
>>> >>> slaves
>>> >>> until the disk fills with debris and the slave dies ]
>>> >>>
>>> >>> but I suppose this could be something available in mesos itself.
>>> >>>
>>> >>> I can't find a lot of advice about operational aspects of Mesos
>>> admin;
>>> >>> could others here provide some good advice about their experience in
>>> >>> preventing failed task deploys from causing trouble on their
>>> clusters?
>>> >>>
>>> >>> Thanks!
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Reply via email to