On Wed, 14 Aug 2019 at 7:21am, Dj Merrill wrote

To date in our HPC Grid running Son of Grid Engine 8.1.9, we've had
single Nvidia GPU cards per compute node.  We are contemplating the
purchase of a single compute node that has multiple GPU cards in it, and
want to ensure that running jobs only have access to the GPU resources
they ask for, and don't take over all of the GPU cards in the system.

We use epilog and prolog scripts based on <https://github.com/kyamagu/sge-gpuprolog> to assign GPUs to jobs. It's (obviously) up to the users' scripts to honor the assignments, but it's been working for us so far.

We define gpu as a resource:
qconf -sc:
#name               shortcut   type      relop   requestable consumable
default  urgency
gpu                 gpu        INT       <=      YES         YES    0

We *used* to run this way until we ran into what seems like a bug in SoGE 8.1.9. See <http://gridengine.org/pipermail/users/2018-April/010116.html> and the ensuing thread for details, but the summary is that SGE would insist on trying to run a job on a particular node, even if there were free GPUs elsewhere. It was happening so often that we had to change our approach, and defined a queue on each GPU node with the same number of slots as GPUs. It's a far from perfect system, but it's working for now.

Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
users mailing list

Reply via email to