In our cluster, we've got several different types of GPUs.

Some jobs simply need any GPU, while others require a specific type.

Previously, we had "gpu" declared as a BOOLEAN attribute on each GPU-node
and had the GPU type (ie., TITANX, P100, etc) declared as an INT attribute
with the count of that number of GPUs per node.

For example:

        qconf -aattr exechost complex_values gpu=TRUE,TITANX=1 node1
        qconf -aattr exechost complex_values gpu=TRUE,TITANX=1 node2
        qconf -aattr exechost complex_values gpu=TRUE,P100=2 node3
        qconf -aattr exechost complex_values gpu=TRUE,P40=1 node4

A user could submit:
        qsub -l gpu myjob
and it could run on any of the nodes, or a user could run:
        qsub -l TITANX=1 myjob
and it could run on node1 or node2.

However... this lead to over-subscription as the 'gpu' BOOLEAN isn't a
consumable resource.

I'm considering changing "gpu" to an INT (set to the number of GPUs/node),
making it a consumable resource, and updating our JSV (in perl) so that
if the job is submitted as

        qsub -l gpu foobar

it will be altered to the equivalent of

        qsub -l gpu=1 foobar

to keep things easy for users.

Any suggestions about this plan?

Thanks,

Mark
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to