You may already know this, but, this does sound similar to http://www.mail-archive.com/[email protected]/msg00885.html
There was a possible (and partial) solution in using soft limits for memory for which a ticket was opened. On Tue, Aug 12, 2014 at 1:17 PM, Thomas Petr <[email protected]> wrote: > That solution would likely cause us more pain -- we'd still need to figure > out an appropriate amount of resources to request for artifact downloads / > extractions, our scheduler would need to be sophisticated enough to only > accept offers from the same slave that the setup task ran on, and we'd need > to manage some new shared artifact storage location outside of the > containers. Is splitting workflows into multiple tasks like this a common > pattern? > > I personally agree that tasks manually overriding cgroups limits is a > little sketchy (and am curious how MESOS-1279 would affect this > discussion), but I doubt that we'll be the last people to attempt something > like this. In other words, we acknowledge we're going rogue by > temporarily overriding the limits... are there other implications of > exposing the container ID that you're worried about? > > Do you have any thoughts about my other idea (overriding the fetcher > executable for a task)? > > Thanks, > Tom > > On Tue, Aug 12, 2014 at 2:05 PM, Vinod Kone <[email protected]> wrote: > >> Thanks Thomas for the clarification. >> >> One solution you could consider would be separating out the setup >> (fetch/extract) phase and running phase into separate mesos tasks. That way >> you can give the setup task resources need for fetching/extracting and as >> soon as it is done, you can send a TASK_FINISHED so that the resources used >> by that task are reclaimed by Mesos. That would give you the dynamism you >> need. Would that work in your scenario? >> >> Having the executor change cgroup limits behind the scenes, opaquely to >> Mesos, seems like a recipe for problems in the future to me, since it could >> lead to temporary over-commit of resources and affect isolation across >> containers. >> >> >> >> On Tue, Aug 12, 2014 at 10:45 AM, Thomas Petr <[email protected]> wrote: >> >>> Hey Vinod, >>> >>> We're not using mesos-fetcher to download the executor -- we ensure our >>> executor exists on the slaves beforehand (during machine provisioning, to >>> be exact). The issue that Whitney is talking about is OOMing while fetching >>> artifacts necessary for task execution (like the JAR for a web service). >>> >>> Our own executor >>> <https://github.com/HubSpot/Singularity/tree/master/SingularityExecutor> has >>> some nice enhancements around S3 downloads and artifact caching that we >>> don't necessarily want to lose if we switched back to using mesos-fetcher. >>> >>> Surfacing the container ID seems like a trivial change, but another >>> alternative could be to allow frameworks to specify an alternative fetcher >>> executable (perhaps in CommandInfo?). >>> >>> Thanks, >>> Tom >>> >>> >>> On Tue, Aug 12, 2014 at 1:09 PM, Vinod Kone <[email protected]> wrote: >>> >>>> Hi Whitney, >>>> >>>> While we could conceivably set the container id in the environment of >>>> the executor, I would like to understand the problem you are facing. >>>> >>>> The fetching and extracting of the executor is done in by >>>> mesos-fetcher, a process forked by slave and run under slave's cgroup. >>>> AFAICT, this shouldn't cause an OOM in the executor. Does your executor do >>>> more fetches/extracts once it is launched (e.g., for user's tasks)? >>>> >>> >>> >> >

