Both modes of operation are quite common (one step or many steps in a
job allocation). I believe that having the behavior configurable by
job using an environment variable or command line option would be
ideal, but it does not exist today.
Moe
Quoting "Mark A. Grondona" <[email protected]>:
On Mon, 1 Aug 2011 14:14:55 -0700, "[email protected]"
<[email protected]> wrote:
The current logic requires job steps to explicitly request the generic
resources (GRES, e.g. GPUs) to be allocated. This decision was based
upon users commonly running many job steps within a job allocation and
using different resources for each job step. If a job step inherits
all of the job's GRES by default, that would require job steps to
explicitly request no GRES if desired
(e.g. "srun --gres=gpu:0 ..."). This may not be the best design for
all users, but it is what exists today.
The only problem with this approach is that it makes the common case
more difficult (most of the time users run a single job step per
allocation), in order to satisfy the uncommon case.
Could this behavior be made configurable?
mark
Moe
Quoting Carles Fenoy <[email protected]>:
> Hi all,
>
> We are considering using cgroups in a new GPU cluster, and I want to know
> which is the current status of the devices part of the cgroups plugin.
>
> We have also observed that the tasks, of a job requesting gres, that don't
> request generic resources explicitly are not assigned any resources.
> Example:
>
> A job request 2 gpus with
>
> sbatch --gres=gpu:1 --ntasks=2 --cpus-per-task=2 --wrap="env; srun env |
> grep CUDA"
>
> The first env shows:
> CUDA_VISIBLE_DEVICES=0
>
> although "srun env" shows:
> CUDA_VISIBLE_DEVICES=NoDevFiles
> CUDA_VISIBLE_DEVICES=NoDevFiles
>
> Is this the expected behavior?
>
> Maybe if a job request gres and its steps don't, slurmstepd should not
> overwrite the job environment in:
>
> gres_gpu.c(211):
>
> } else {
> /* The gres.conf file must identify specific device files
> * in order to set the CUDA_VISIBLE_DEVICES env var */
> env_array_overwrite(job_env_ptr,"CUDA_VISIBLE_DEVICES",
> "NoDevFiles");
> }
>
>
> --
> --
> Carles Fenoy
>