Re: [slurm-dev] Status of cgroups implementation

jette Mon, 01 Aug 2011 14:15:50 -0700

The current logic requires job steps to explicitly request the genericresources (GRES, e.g. GPUs) to be allocated. This decision was basedupon users commonly running many job steps within a job allocation andusing different resources for each job step. If a job step inheritsall of the job's GRES by default, that would require job steps toexplicitly request no GRES if desired(e.g. "srun --gres=gpu:0 ..."). This may not be the best design forall users, but it is what exists today.

Moe




Quoting Carles Fenoy <[email protected]>:

Hi all,

We are considering using cgroups in a new GPU cluster, and I want to know
which is the current status of the devices part of the cgroups plugin.

We have also observed that the tasks, of a job requesting gres, that don't
request generic resources explicitly are not assigned any resources.
Example:

A job request 2 gpus with

sbatch --gres=gpu:1 --ntasks=2 --cpus-per-task=2 --wrap="env; srun env |
grep CUDA"

The first env shows:
CUDA_VISIBLE_DEVICES=0

although "srun env" shows:
CUDA_VISIBLE_DEVICES=NoDevFiles
CUDA_VISIBLE_DEVICES=NoDevFiles

Is this the expected behavior?

Maybe if a job request gres and its steps don't, slurmstepd should not
overwrite the job environment in:

gres_gpu.c(211):

        } else {
                /* The gres.conf file must identify specific device files
                 * in order to set the CUDA_VISIBLE_DEVICES env var */
                env_array_overwrite(job_env_ptr,"CUDA_VISIBLE_DEVICES",
                                    "NoDevFiles");
        }


--
--
Carles Fenoy

Re: [slurm-dev] Status of cgroups implementation

Reply via email to