To date in our HPC Grid running Son of Grid Engine 8.1.9, we've had
single Nvidia GPU cards per compute node.  We are contemplating the
purchase of a single compute node that has multiple GPU cards in it, and
want to ensure that running jobs only have access to the GPU resources
they ask for, and don't take over all of the GPU cards in the system.

We define gpu as a resource:
qconf -sc:
#name               shortcut   type      relop   requestable consumable
default  urgency
gpu                 gpu        INT       <=      YES         YES    0

We define GPU persistence mode and exclusive process on each node:
nvidia-smi -pm 1
nvidia-smi -c 3

We set the number of GPUs in the host definition:
qconf -me (hostname)

complex_values   gpu=1   for our existing nodes, and this setup has been
working fine for us.

With the new system, we would set:
complex_values   gpu=4

If a job is submitted asking for one GPU, will it be limited to only
having access to a single GPU card on the system, or can it detect the
other cards and take up all four (and if so how do we prevent that)?

Is there something like "cgroups" for gpus?



users mailing list

Reply via email to