A similar question has been asked before (not by me), without an answer:
https://groups.google.com/forum/?hl=en#!topic/slurm-devel/4xkvs0dgYu8

Specifically - suppose I have a gpu cluster, 2 gpus per node, where some gpus might or might not function correctly (due to heat/fw issues/malfunction/...), so some nodes might only present 1 gpu, and others no gpu at all.

Using a gres.conf file with device nodes allows slurm to bind devices to jobs. The question is - does slurm also use the dev files to track the availability of the cards?

I do not wish to drain any nodes with failing cards - just let slurm know about this dynamically so jobs requesting gpus are properly scheduled, while other jobs can use the "bad" nodes.

My healthcheck agent on the nodes can add/remove device files for any gpu based on it's thresholds.

Based on the above I would expect the following 4 configuration considerations:

1. gres.conf statically holds the "optimal" gpu deployment (assume all is well)
2. slurm.conf GresTypes=gpu
3. slurm.conf NodeName Gres=gpu:2 <-- This will presumably drain any node with less than 2 gpus ? 4. FastSchedule=0 <-- Together with NO gres in the NodeName line to ensure nodes do not drain needlessly.

Is that correct?
Are there better solutions to dynamically track availability of resources?

Currently with LSF we are using a custom elim script to let lsf know about the availability of the resources.

Reply via email to