In our cluster, we've got several different types of GPUs.
Some jobs simply need any GPU, while others require a specific type.
Previously, we had "gpu" declared as a BOOLEAN attribute on each GPU-node
and had the GPU type (ie., TITANX, P100, etc) declared as an INT attribute
with the count of that number of GPUs per node.
For example:
qconf -aattr exechost complex_values gpu=TRUE,TITANX=1 node1
qconf -aattr exechost complex_values gpu=TRUE,TITANX=1 node2
qconf -aattr exechost complex_values gpu=TRUE,P100=2 node3
qconf -aattr exechost complex_values gpu=TRUE,P40=1 node4
A user could submit:
qsub -l gpu myjob
and it could run on any of the nodes, or a user could run:
qsub -l TITANX=1 myjob
and it could run on node1 or node2.
However... this lead to over-subscription as the 'gpu' BOOLEAN isn't a
consumable resource.
I'm considering changing "gpu" to an INT (set to the number of GPUs/node),
making it a consumable resource, and updating our JSV (in perl) so that
if the job is submitted as
qsub -l gpu foobar
it will be altered to the equivalent of
qsub -l gpu=1 foobar
to keep things easy for users.
Any suggestions about this plan?
Thanks,
Mark
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users