[slurm-users] Automatically cancel jobs not utilizing their GPUs

Stephan Roth Thu, 02 Jul 2020 00:00:08 -0700

Hi all,

Does anyone have ideas or suggestions on how to automatically canceljobs which don't utilize the GPUs allocated to them?


The Slurm version in use is 19.05.

I'm thinking about collecting GPU utilization per process on all nodeswith NVML/nvidia-smi, update a mean value of the collected utilizationper GPU and cancel a job if the mean value is below a to-be-definedthreshold after a to-be-defined amount of minutes.


Thank you for any input,

Cheers,
Stephan

[slurm-users] Automatically cancel jobs not utilizing their GPUs

Reply via email to