The GC algorithm should takes into account disk utilization. In other words, if disk utilization is high sandboxes will be deleted earlier than a week. Of course, if the disk is getting full faster than GC can react to it then there might be a problem.
On Fri, May 2, 2014 at 10:35 AM, Dick Davies <[email protected]> wrote: > Not quite - looks to me like mesos slave disks filled with failed jobs > (because marathon > continued to throw a broken .zip into them) and with /tmp on the root > fs the servers became > unresponsive. Tobi mentions there's a way to set that at deploy time, > but in this case the > guy who can't type 'hello world' correctly would have been responsible > for setting the rate limits > too (that's me by the way!) so in itself that's not protection from pilot > error. > > I'm not sure if GC was able to clear /var any better (I doubt it very > much, my impression > was that's on the order of days). Think it's more the deploy could be > cancelled better while the > system was still functioning (speculation - i'm still in early stages > of learning the internals of this). > > On 30 April 2014 22:08, Vinod Kone <[email protected]> wrote: > > Dick, I've also briefly skimmed at your original email to marathon > mailing > > list and it sounded like executor sandboxes were not getting garbage > > collected (a mesos feature) when the slave work directory was rooted in > /tmp > > vs /var? Did I understand that right? If yes, I would love to see some > logs. > > > > > > On Wed, Apr 30, 2014 at 1:51 PM, Tobias Knaup <[email protected]> wrote: > >> > >> In Marathon you can specify taskRateLimit (max number of tasks to start > >> per second) as part of your app definition. > >> > >> > >> On Wed, Apr 30, 2014 at 11:30 AM, Dick Davies <[email protected]> > >> wrote: > >>> > >>> Managed to take out a mesos slave today with a typo while launching > >>> a marathon app, and wondered if there are throttles/limits that can be > >>> applied to repeated launches to limit the risk of such mistakes in the > >>> future. > >>> > >>> I started a thread on the marathon list > >>> ( > >>> > https://groups.google.com/forum/?hl=en#!topic/marathon-framework/4iWLqTYTvgM > >>> ) > >>> > >>> [ TL:DR: marathon throws an app that will never deploy correctly at > >>> slaves > >>> until the disk fills with debris and the slave dies ] > >>> > >>> but I suppose this could be something available in mesos itself. > >>> > >>> I can't find a lot of advice about operational aspects of Mesos admin; > >>> could others here provide some good advice about their experience in > >>> preventing failed task deploys from causing trouble on their clusters? > >>> > >>> Thanks! > >> > >> > > >

