You can get a count of GRES using "scontrol show job $SLURM_JOBID",  
but that does not identify the specific CPUs on each node, only the  
number of GPUs. The information about specific GPUs allocated to a job  
is in the job credential used by the slurmd to set the  
CUDA_VISIBLE_DEVICES environment variable, so it could probably be  
made available to the prolog relatively easily.

Quoting Carles Fenoy <[email protected]>:

> Hi all,
>
> Is there any way to get the allocated GRES(GPU) to a job on each node? We
> have detected a problem with some devices that need to be rebooted from
> time to time, and I would prefer to restart the device in the prolog of the
> job. The problem is that I don't know how to get which device has been
> allocated to a job and cannot restart all the devices in a node without
> affecting already allocated jobs.
>
> Regards,
>
> --
> Carles Fenoy
> Barcelona Supercomputing Center
>

Reply via email to