Thank you all for your help and for your suggestions, that's really
appreciated! I will give a try to the cgroups plugins and to the
--exclusive option and experiment with slurm to find out what's more
appropriate for my server.
On 06/23/2015 02:59 PM, Morris Jette wrote:
See the task/cgroups plugin for constraining jobs to specific CPUs and
memory. Also see --exclusive option in srun.
On June 23, 2015 4:44:27 AM PDT, "Rémi Piatek" <[email protected]> wrote:
I had considered this simple explanation, but it seemed unlikely to me,
as it would imply that we completely have to rely on users to specify
correctly the number of CPUs they need. People use my server for
CPU-intensive jobs, so it is important for me to make sure that
resources are fairly shared. I was hoping slurm would allow me to do
this, and prevent people from free-riding (so far, it would be easy to
request a small number of CPUs and use a much larger number, thus
slowing down the other users).
I read that when jobs exceed the memory requested and allocated by
slurm, they are automatically interrupted. Is there nothing similar for
the use of CPUs?
Thanks for the help! Much appreciated.
On 06/23/2015 12:05 PM, Loris Bennett wrote:
Rémi Piatek <[email protected]> writes:
Hello, I am getting started with SLURM and I am having a
hard time understanding how it allocates CPUs to users
depending on the resources they request. The problem I am
facing can be summarized as follows. Consider a bash
script test.sh <http://test.sh> that requests 8 CPUs but
actually starts a job that uses 10 CPUs: #!/bin/sh #SBATCH
--ntasks=8 stress -c 10 On a server with 32 CPUs, if I
start 5 times this script with sbatch test.sh
<http://test.sh>, 4 of them start running right away and
the last one appears as pending, as shown by the squeue
command: JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON) 5 main test.sh <http://test.sh> jack PD
0:00 1 (Resources) 1 main test.sh <http://test.sh> jack R
0:08 1 server 2 main test.sh <http://test.sh> jack R 0:08
1 server 3 main test.sh <http://test.sh> jack R 0:05 1
server 4 main test.sh <http://test.sh> jack R 0:05 1
server The problem is that these 4 jobs are actually using
40 CPUs and overload the server. I would on the contrary
expect SLURM to either not start the jobs that are
actually using more resources than requested by the user,
or to put them on hold until there are enough resources to
start them. How can I make sure that the users of my
server do not start jobs that use too many CPUs? Some
useful details about my slurm.conf file: # SCHEDULING
#DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0
SchedulerType=sched/backfill SchedulerPort=7321
SelectType=select/cons_res SelectTypeParameters=CR_CPU #
COMPUTE NODES NodeName=server CPUs=32 RealMemory=10000
State=UNKNOWN # PARTITIONS PartitionName=main Nodes=server
Default=YES Shared=YES MaxTime=INFINITE State=UP I am
probably making a trivial mistake in the configuration
file, of just misunderstanding a basic concept of SLURM.
Any help or advice would be much appreciated. Many thanks
in advance!
Slurm just keeps track of how many cores have been assigned to
running jobs - it doesn't check how many processes are
actually started within a given job. So, it is up to the user
to make sure she starts the correct number of processes.
Cheers, Loris
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.