Hi all,
We are considering using cgroups in a new GPU cluster, and I want to know
which is the current status of the devices part of the cgroups plugin.
We have also observed that the tasks, of a job requesting gres, that don't
request generic resources explicitly are not assigned any resources.
Example:
A job request 2 gpus with
sbatch --gres=gpu:1 --ntasks=2 --cpus-per-task=2 --wrap="env; srun env |
grep CUDA"
The first env shows:
CUDA_VISIBLE_DEVICES=0
although "srun env" shows:
CUDA_VISIBLE_DEVICES=NoDevFiles
CUDA_VISIBLE_DEVICES=NoDevFiles
Is this the expected behavior?
Maybe if a job request gres and its steps don't, slurmstepd should not
overwrite the job environment in:
gres_gpu.c(211):
} else {
/* The gres.conf file must identify specific device files
* in order to set the CUDA_VISIBLE_DEVICES env var */
env_array_overwrite(job_env_ptr,"CUDA_VISIBLE_DEVICES",
"NoDevFiles");
}
--
--
Carles Fenoy