Hi all,

Does anyone have ideas or suggestions on how to automatically cancel jobs which don't utilize the GPUs allocated to them?

The Slurm version in use is 19.05.

I'm thinking about collecting GPU utilization per process on all nodes with NVML/nvidia-smi, update a mean value of the collected utilization per GPU and cancel a job if the mean value is below a to-be-defined threshold after a to-be-defined amount of minutes.

Thank you for any input,

Cheers,
Stephan

Reply via email to