salloc should be used in start an interactive session. If you would
like to end up on a node in the allocation by default you should look at
the slurm.conf SallocDefaultCommand option and reference this FAQ
http://slurm.schedmd.com/faq.html#salloc_default_command.
Doing it your way with srun will consume resources leaving you in the
position you are currently in.
On 05/07/15 09:35, Igor Kozin wrote:
Chris, Mehdi, thank you. I must say that based on what I'd read salloc appeared to me as
a command to start interactive jobs while srun is to "Run parallel jobs".
If I get it right now, srun must be used to start interactive sessions
srun --ntasks=1 --mem-per-cpu=1000 --pty /bin/bash
(and salloc should be probably removed from the list of tools available to our
users).
Now, if I set in slurm.conf
DefMemPerCPU=800
MaxMemPerCPU=1600
and run srun --ntasks=1 --pty /bin/bash
I get
memory.limit_in_bytes 838860800
I can still override the max mem limit on the command line but at the cost of
having more cores
srun --ntasks=1 --mem-per-cpu=2000 --pty /bin/bash
memory.limit_in_bytes 2097152000
cpuset.cpus 0-2
It's only when I hit the limit of number of cores x 1600 MB I get an error.
srun --ntasks=1 --mem-per-cpu=20000 --pty /bin/bash
srun: Force Terminated job 52
srun: error: CPU count per node can not be satisfied
srun: error: Unable to allocate resources: Requested node configuration is not
available
So far so good.
-----Original Message-----
From: Mehdi Denou [mailto:[email protected]]
Sent: 07 May 2015 13:07
To: slurm-dev
Subject: [slurm-dev] Re: cgroups support in slurm (sbatch vs salloc)
Here is another example which is (from my point of view) less confusing:
[root@host1 ~]# salloc -N 1
salloc: Granted job allocation 8
[root@host1 ~]# srun hostname
host9
[root@host1 ~]# hostname
host1
[root@host1 ~]# exit
exit
salloc: Relinquishing job allocation 8
salloc: Job allocation 8 has been revoked.
[root@host1 ~]#
Le 07/05/2015 13:28, Chris Samuel a écrit :
On Thu, 7 May 2015 04:01:25 AM Igor Kozin wrote:
My real question is why running
salloc --mem-per-cpu=1000 --ntasks=1 bash
does not create cgroups and therefore gets you an unlimited interactive
session?
My understanding is that salloc will give you a session on the same node you
run it, and you then need to use srun to launch a process on the assigned
compute node (and thus into the relevant control group).
To demonstrate, here is an example from one of our systems (Slurm 14.03.11),
first just running hostname in salloc so you can see the shell is on the same
node:
[samuel@merri ~]$ salloc hostname
salloc: Pending job allocation 2096414
salloc: job 2096414 queued and waiting for resources
salloc: job 2096414 has been allocated resources
salloc: Granted job allocation 2096414
merri
salloc: Relinquishing job allocation 2096414
[samuel@merri ~]$
Now running hostname with srun inside salloc to show it appears on the compute
node instead:
[samuel@merri ~]$ salloc srun hostname
salloc: Pending job allocation 2096415
salloc: job 2096415 queued and waiting for resources
salloc: job 2096415 has been allocated resources
salloc: Granted job allocation 2096415
Scratch directory /scratch/merri/jobs/2096415 has been allocated
merri009
salloc: Relinquishing job allocation 2096415
Now to demonstrate that the one on the login node has (as expected) no cgroup
whilst the one run with srun does run inside a cgroup:
[samuel@merri ~]$ salloc cat /proc/self/cpuset
salloc: Pending job allocation 2096416
salloc: job 2096416 queued and waiting for resources
salloc: job 2096416 has been allocated resources
salloc: Granted job allocation 2096416
/
salloc: Relinquishing job allocation 2096416
salloc: Job allocation 2096416 has been revoked.
[samuel@merri ~]$
[samuel@merri ~]$ salloc srun cat /proc/self/cpuset
salloc: Pending job allocation 2096417
salloc: job 2096417 queued and waiting for resources
salloc: job 2096417 has been allocated resources
salloc: Granted job allocation 2096417
Scratch directory /scratch/merri/jobs/2096417 has been allocated
/slurm/uid_500/job_2096417/step_0
salloc: Relinquishing job allocation 2096417
salloc: Job allocation 2096417 has been revoked.
[samuel@merri ~]$
Hope that helps!
All the best,
Chris