We are actually working on solving #2, by adding mutual authentication between masters and slaves, and ensure that each group knows in advance what the valid masters/slaves are. This allows us to ensure that no malicious masters/slaves can join the cluster and do bad stuff. Please contact me directly if you're interested in discussing/partnering on this work.
On Tue, May 6, 2014 at 3:05 PM, Benjamin Mahler <[email protected]>wrote: > Interesting points, I'd like to understand your two cases above: > > 1. a rogue job can potentially render slaves useless, and, > > Concretely what kinds of things are you considering here? Are you > considering jobs that saturate non-isolated resources? Something else? > > 2. a rogue slave (or rather a rogue executor) can blackhole jobs via false > positive completions > > Concretely what kinds of things are you considered here? A maliciously > constructed slave? How would these false positives be fabricated? Does > authentication preclude this? > > > On Fri, May 2, 2014 at 11:00 AM, Sharma Podila <[email protected]>wrote: > >> Although I am not as familiar with Marathon specifics, in general, >> >> 1. a rogue job can potentially render slaves useless, and, >> 2. a rogue slave (or rather a rogue executor) can blackhole jobs via >> false positive completions >> >> A strategy that helps with #1 is to limit the number of re-launches of an >> individual job/task upon failure. Even better if this is done with failure >> rate. Simple rate limiting may only delay the problem for a while. >> A strategy that helps with #2 is to "disable" the slave from further >> launches when too many failures are reported from it in a given time >> period. This can render many slaves disabled and reduce cluster throughput >> (which should alert the operator), which is better than falsely putting all >> jobs into completion state. >> >> An out of band monitor that watches job/task lifecycle events can achieve >> this, for example, using a stream processing technique over the continuous >> event stream. >> >> Sharma >> >> >> >> On Fri, May 2, 2014 at 10:35 AM, Dick Davies <[email protected]>wrote: >> >>> Not quite - looks to me like mesos slave disks filled with failed jobs >>> (because marathon >>> continued to throw a broken .zip into them) and with /tmp on the root >>> fs the servers became >>> unresponsive. Tobi mentions there's a way to set that at deploy time, >>> but in this case the >>> guy who can't type 'hello world' correctly would have been responsible >>> for setting the rate limits >>> too (that's me by the way!) so in itself that's not protection from >>> pilot error. >>> >>> I'm not sure if GC was able to clear /var any better (I doubt it very >>> much, my impression >>> was that's on the order of days). Think it's more the deploy could be >>> cancelled better while the >>> system was still functioning (speculation - i'm still in early stages >>> of learning the internals of this). >>> >>> On 30 April 2014 22:08, Vinod Kone <[email protected]> wrote: >>> > Dick, I've also briefly skimmed at your original email to marathon >>> mailing >>> > list and it sounded like executor sandboxes were not getting garbage >>> > collected (a mesos feature) when the slave work directory was rooted >>> in /tmp >>> > vs /var? Did I understand that right? If yes, I would love to see some >>> logs. >>> > >>> > >>> > On Wed, Apr 30, 2014 at 1:51 PM, Tobias Knaup <[email protected]> wrote: >>> >> >>> >> In Marathon you can specify taskRateLimit (max number of tasks to >>> start >>> >> per second) as part of your app definition. >>> >> >>> >> >>> >> On Wed, Apr 30, 2014 at 11:30 AM, Dick Davies <[email protected] >>> > >>> >> wrote: >>> >>> >>> >>> Managed to take out a mesos slave today with a typo while launching >>> >>> a marathon app, and wondered if there are throttles/limits that can >>> be >>> >>> applied to repeated launches to limit the risk of such mistakes in >>> the >>> >>> future. >>> >>> >>> >>> I started a thread on the marathon list >>> >>> ( >>> >>> >>> https://groups.google.com/forum/?hl=en#!topic/marathon-framework/4iWLqTYTvgM >>> >>> ) >>> >>> >>> >>> [ TL:DR: marathon throws an app that will never deploy correctly at >>> >>> slaves >>> >>> until the disk fills with debris and the slave dies ] >>> >>> >>> >>> but I suppose this could be something available in mesos itself. >>> >>> >>> >>> I can't find a lot of advice about operational aspects of Mesos >>> admin; >>> >>> could others here provide some good advice about their experience in >>> >>> preventing failed task deploys from causing trouble on their >>> clusters? >>> >>> >>> >>> Thanks! >>> >> >>> >> >>> > >>> >> >> >

