Hi all,

We are considering using cgroups in a new GPU cluster, and I want to know
which is the current status of the devices part of the cgroups plugin.

We have also observed that the tasks, of a job requesting gres, that don't
request generic resources explicitly are not assigned any resources.
Example:

A job request 2 gpus with

sbatch --gres=gpu:1 --ntasks=2 --cpus-per-task=2 --wrap="env; srun env |
grep CUDA"

The first env shows:
CUDA_VISIBLE_DEVICES=0

although "srun env" shows:
CUDA_VISIBLE_DEVICES=NoDevFiles
CUDA_VISIBLE_DEVICES=NoDevFiles

Is this the expected behavior?

Maybe if a job request gres and its steps don't, slurmstepd should not
overwrite the job environment in:

gres_gpu.c(211):

        } else {
                /* The gres.conf file must identify specific device files
                 * in order to set the CUDA_VISIBLE_DEVICES env var */
                env_array_overwrite(job_env_ptr,"CUDA_VISIBLE_DEVICES",
                                    "NoDevFiles");
        }


-- 
--
Carles Fenoy

Reply via email to