Out of curiosity, is isolation something that will be guaranteed by "standard" kernel mechanisms (like cgroups) or is it a functionality that requires using the driver/library directly?
Internally I have some POCs where a standard set of services runs in a cgroup always, and another cgroup is setup for mesos tasks. (that was easier than writing a custom mesos resource methodology). In such a scenario I'm curious how well mesos and non-mesos tasks on the same system would interact with the resource isolation. On Mon, Jan 18, 2016 at 8:23 AM Vikrama Ditya <[email protected]> wrote: > Clarification on isolation. As Ben and we (Nvidia) working to introduce > GPU as first class resource into Mesos. > > > > By default there is no isolation. But there will be isolation module for > Nvidia GPU devices which can be linked at build time and provide isolation > for GPU tasks among GPU devices. Initially device level isolation will be > there assuming all tasks using same device libraries (hence no file system > isolation). Our initial proposal is not exposing details of GPU but > subsequently more detail of GPU resources like (topology, memory, core, > bandwidth etc.) will be exposed to do better job scheduling. > > > > As Ben indicated very soon we will send out design proposal to community > for comments. > > > > Regards > > -- > Vikram > > > > *From:* Benjamin Mahler [mailto:[email protected]] > *Sent:* Saturday, January 16, 2016 4:31 PM > *To:* [email protected] > > > *Subject:* Re: Share GPU resources via attributes or as custom resources > (INTERNAL) > > > > There is a design proposal coming that will include guidance around using > GPUs and better GPU support in mesos, so stay tuned. > > > > Mesos supports adding arbitrary resources, e.g. > > > > --resources=cpus(*):4;gpus(*):4 > > > > Mesos will then manage a scalar "gpu" resource with a value of 4. This > means "gpu" scalars will be offered to the framework and the framework may > launch tasks / executors that are allocated a "gpu" scalar. Of course, > you'll need support from Marathon for custom resources when you define your > job, not sure if that exists currently. > > > > Now, by default no isolation is going to take place. That may be ok for > you if you have tight control over the fact that tasks/executors only try > to consume the number of gpus that have been allocated to them. If not, you > may run an isolator module for gpus (e.g. using the device whitelist > controller cgroup). At the current time you would have to write one, as I'm > not sure whether one has been written / published. > > > > You'll need to make sure your containers have access to the necessary gpu > libraries. If you are running without filesystem isolation then tasks can > just reach out of the sandbox to use the necessary libraries. > > > > Hope that helps, > > Ben > > > > On Thu, Jan 14, 2016 at 9:02 AM, <[email protected]> wrote: > > I have a machine with 4 GPUs and want to use Mesos+Marathon to schedule > the jobs to be run in the machine. Each job will use maximum 1 GPU and > sharing 1 GPU between small jobs would be ok. > I know Mesos does not directly support GPUs, but it seems I might use > custom resources or attributes to do what I want. But how exactly should > this be done? > > If I use --attributes="hasGpu:true", would a job be sent to the machine > when another job is already running in the machine (and only using 1 GPU)? > I would say all jobs requesting a machine with a hasGpu attribute would be > sent to the machine (as long as it has free CPU and memory resources). > Then, if a job is sent to the machine when the 4 GPUs are already busy, the > job will fail to start, right? Could then Marathon be used to re-send the > job after some time, until it is accepted by the machine? > > If I specify --resources="gpu(*):4", it is my understanding that once a > job is sent to the machine, all 4 GPUs will become busy to the eyes of > Mesos (even if this is not really true). If that is right, would this > work-around work: specify 4 different resources: gpu:A, gpu:B, gpu:C and > gpu:D; and use constraints in Marathon like this "constraints": [["gpu", > "LIKE", " [A-D]"]]? > > Cheers > > > ------------------------------ > This email message is for the sole use of the intended recipient(s) and > may contain confidential information. Any unauthorized review, use, > disclosure or distribution is prohibited. If you are not the intended > recipient, please contact the sender by reply email and destroy all copies > of the original message. > ------------------------------ >

