I've been using Slurm on a traditional CPU compute cluster, but am now looking at a somewhat different issue. We recently purchased a single machine with 10 high end graphics cards to be used for CUDA calculations and which will shared among a couple of different user groups.

Does it make sense to use Slurm for scheduling in this case? We'll want to do things like limit the number of GPU's any one user can use and manage resource contention the same way one would for a cluster. Potentially this would mean running slurmctld and slurmd on the same host?

Bonus question: these research groups (they do roughly the same kind of work) also have a pool of GPU workstations they're going to share. It would be super cool if we could somehow rope the workstations into the resource pool in cases where no one is working at the console. Because some of this stuff involves steps with interactive components, the understanding would be that all resources go to a console user when there is a console user.

Reply via email to