Interesting points, I'd like to understand your two cases above: 1. a rogue job can potentially render slaves useless, and,
Concretely what kinds of things are you considering here? Are you considering jobs that saturate non-isolated resources? Something else? 2. a rogue slave (or rather a rogue executor) can blackhole jobs via false positive completions Concretely what kinds of things are you considered here? A maliciously constructed slave? How would these false positives be fabricated? Does authentication preclude this? On Fri, May 2, 2014 at 11:00 AM, Sharma Podila <[email protected]> wrote: > Although I am not as familiar with Marathon specifics, in general, > > 1. a rogue job can potentially render slaves useless, and, > 2. a rogue slave (or rather a rogue executor) can blackhole jobs via false > positive completions > > A strategy that helps with #1 is to limit the number of re-launches of an > individual job/task upon failure. Even better if this is done with failure > rate. Simple rate limiting may only delay the problem for a while. > A strategy that helps with #2 is to "disable" the slave from further > launches when too many failures are reported from it in a given time > period. This can render many slaves disabled and reduce cluster throughput > (which should alert the operator), which is better than falsely putting all > jobs into completion state. > > An out of band monitor that watches job/task lifecycle events can achieve > this, for example, using a stream processing technique over the continuous > event stream. > > Sharma > > > > On Fri, May 2, 2014 at 10:35 AM, Dick Davies <[email protected]>wrote: > >> Not quite - looks to me like mesos slave disks filled with failed jobs >> (because marathon >> continued to throw a broken .zip into them) and with /tmp on the root >> fs the servers became >> unresponsive. Tobi mentions there's a way to set that at deploy time, >> but in this case the >> guy who can't type 'hello world' correctly would have been responsible >> for setting the rate limits >> too (that's me by the way!) so in itself that's not protection from pilot >> error. >> >> I'm not sure if GC was able to clear /var any better (I doubt it very >> much, my impression >> was that's on the order of days). Think it's more the deploy could be >> cancelled better while the >> system was still functioning (speculation - i'm still in early stages >> of learning the internals of this). >> >> On 30 April 2014 22:08, Vinod Kone <[email protected]> wrote: >> > Dick, I've also briefly skimmed at your original email to marathon >> mailing >> > list and it sounded like executor sandboxes were not getting garbage >> > collected (a mesos feature) when the slave work directory was rooted in >> /tmp >> > vs /var? Did I understand that right? If yes, I would love to see some >> logs. >> > >> > >> > On Wed, Apr 30, 2014 at 1:51 PM, Tobias Knaup <[email protected]> wrote: >> >> >> >> In Marathon you can specify taskRateLimit (max number of tasks to start >> >> per second) as part of your app definition. >> >> >> >> >> >> On Wed, Apr 30, 2014 at 11:30 AM, Dick Davies <[email protected]> >> >> wrote: >> >>> >> >>> Managed to take out a mesos slave today with a typo while launching >> >>> a marathon app, and wondered if there are throttles/limits that can be >> >>> applied to repeated launches to limit the risk of such mistakes in the >> >>> future. >> >>> >> >>> I started a thread on the marathon list >> >>> ( >> >>> >> https://groups.google.com/forum/?hl=en#!topic/marathon-framework/4iWLqTYTvgM >> >>> ) >> >>> >> >>> [ TL:DR: marathon throws an app that will never deploy correctly at >> >>> slaves >> >>> until the disk fills with debris and the slave dies ] >> >>> >> >>> but I suppose this could be something available in mesos itself. >> >>> >> >>> I can't find a lot of advice about operational aspects of Mesos admin; >> >>> could others here provide some good advice about their experience in >> >>> preventing failed task deploys from causing trouble on their clusters? >> >>> >> >>> Thanks! >> >> >> >> >> > >> > >

