A similar question has been asked before (not by me), without an answer:
https://groups.google.com/forum/?hl=en#!topic/slurm-devel/4xkvs0dgYu8
Specifically - suppose I have a gpu cluster, 2 gpus per node, where some
gpus might or might not function correctly (due to heat/fw
issues/malfunction/...), so some nodes might only present 1 gpu, and
others no gpu at all.
Using a gres.conf file with device nodes allows slurm to bind devices to
jobs.
The question is - does slurm also use the dev files to track the
availability of the cards?
I do not wish to drain any nodes with failing cards - just let slurm
know about this dynamically so jobs requesting gpus are properly
scheduled, while other jobs can use the "bad" nodes.
My healthcheck agent on the nodes can add/remove device files for any
gpu based on it's thresholds.
Based on the above I would expect the following 4 configuration
considerations:
1. gres.conf statically holds the "optimal" gpu deployment (assume all
is well)
2. slurm.conf GresTypes=gpu
3. slurm.conf NodeName Gres=gpu:2 <-- This will presumably drain any
node with less than 2 gpus ?
4. FastSchedule=0 <-- Together with NO gres in the NodeName line to
ensure nodes do not drain needlessly.
Is that correct?
Are there better solutions to dynamically track availability of resources?
Currently with LSF we are using a custom elim script to let lsf know
about the availability of the resources.