There is a design proposal coming that will include guidance around using
GPUs and better GPU support in mesos, so stay tuned.

Mesos supports adding arbitrary resources, e.g.

--resources=cpus(*):4;gpus(*):4

Mesos will then manage a scalar "gpu" resource with a value of 4. This
means "gpu" scalars will be offered to the framework and the framework may
launch tasks / executors that are allocated a "gpu" scalar. Of course,
you'll need support from Marathon for custom resources when you define your
job, not sure if that exists currently.

Now, by default no isolation is going to take place. That may be ok for you
if you have tight control over the fact that tasks/executors only try to
consume the number of gpus that have been allocated to them. If not, you
may run an isolator module for gpus (e.g. using the device whitelist
controller cgroup). At the current time you would have to write one, as I'm
not sure whether one has been written / published.

You'll need to make sure your containers have access to the necessary gpu
libraries. If you are running without filesystem isolation then tasks can
just reach out of the sandbox to use the necessary libraries.

Hope that helps,
Ben

On Thu, Jan 14, 2016 at 9:02 AM, <[email protected]> wrote:

> I have a machine with 4 GPUs and want to use Mesos+Marathon to schedule
> the jobs to be run in the machine. Each job will use maximum 1 GPU and
> sharing 1 GPU between small jobs would be ok.
> I know Mesos does not directly support GPUs, but it seems I might use
> custom resources or attributes to do what I want. But how exactly should
> this be done?
>
> If I use --attributes="hasGpu:true", would a job be sent to the machine
> when another job is already running in the machine (and only using 1 GPU)?
> I would say all jobs requesting a machine with a hasGpu attribute would be
> sent to the machine (as long as it has free CPU and memory resources).
> Then, if a job is sent to the machine when the 4 GPUs are already busy, the
> job will fail to start, right? Could then Marathon be used to re-send the
> job after some time, until it is accepted by the machine?
>
> If I specify --resources="gpu(*):4", it is my understanding that once a
> job is sent to the machine, all 4 GPUs will become busy to the eyes of
> Mesos (even if this is not really true). If that is right, would this
> work-around work: specify 4 different resources: gpu:A, gpu:B, gpu:C and
> gpu:D; and use constraints in Marathon like this  "constraints": [["gpu",
> "LIKE", " [A-D]"]]?
>
> Cheers
>

Reply via email to