Re: [slurm-users] cgroup limits not created for jobs

Christopher Samuel Sun, 26 Jul 2020 22:38:11 -0700

On 7/26/20 12:21 pm, Paul Raines wrote:

Thank you so much.  This also explains my GPU CUDA_VISIBLE_DEVICES missing
problem in my previous post.


I've missed that, but yes, that would do it.

As a new SLURM admin, I am a bit suprised at this default behavior.
Seems like a way for users to game the system by never running srun.

This is because by default salloc only requests a job allocation, itexpects you to use srun to run an application on a compute node. Butyes, it is non-obvious (as evidenced by the number of "sinteractive" andother scripts out there that folks have written not realising about theSallocDefaultCommand config option - I wrote one back in 2013!).

The only limit I suppose that is being really enforced at that point
is walltime?

Well the user isn't on the compute node so there's nothing really elseto enforce.

I guess I need to research srun and SallocDefaultCommand more, but isthere some way to set some kind of separate walltime limit on a
job for the time a salloc has to run srun?  It is not clear if one
can make a SallocDefaultCommand that does "srun ..." that really
covers all possibilities.

An srun inside of a salloc (just like an sbatch) should not be able toexceed the time limit for the job allocation.


If it helps this is the SallocDefaultCommand we use for our GPU nodes:

srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 -G 0 --gpus-per-task=0--gpus-per-node=0 --gpus-per-socket=0 --pty --preserve-env --mpi=none-m block $SHELL

We have to give all those possible permutations to not use various GPUGRES as otherwise this srun will consume them if the salloc asked for itand then when the user tries to "srun" their application across thenodes it will block as there won't be any available on this first node.

Of course the fact that because of this the user can't see the GPUswithout the srun can confuse some people, but it's unavoidable for thisuse case.


All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Re: [slurm-users] cgroup limits not created for jobs

Reply via email to