[slurm-dev] Re: slurm resources: shared queue and cores

Moe Jette Wed, 03 Jul 2013 07:52:04 -0700

The shared option is working as designed. See "man slurm.conf".

Quoting Olaf Gellert <[email protected]>:


>
> Hi there,
>
> I am running a serial partition on three nodes. So the nodes
> are shared and each one should run some jobs (as many as memory
> or cpus/cores permit). The nodes have 2 sockets, each CPU has
> 6 cores and SMT (2 threads), and slurm "knows" that:
>
> [2013-07-03T09:22:42+02:00] node:node01 cpus:24 c:6 s:2 t:2 mem:90112
> a_mem:0 state:0
> [2013-07-03T09:22:42+02:00] node:node02 cpus:24 c:6 s:2 t:2 mem:90112
> a_mem:0 state:0
> [2013-07-03T09:22:42+02:00] node:node03 cpus:24 c:6 s:2 t:2 mem:90112
> a_mem:0 state:0
>
> Memory-constraints seem to work. When I start jobs using 2 CPUs
> each, slurm puts first 12 jobs on node01, next 12 jobs on node02,
> next 12 jobs on node03 and then starts the next jobs additionally
> on node01 (though none of the first 12 jobs is finished already).
>
> The jobs have this in their header:
>
> ---------------- 8< ---- snip ---- 8< ---------------------------
> #SBATCH -p serial
> #SBATCH --nodes=1
> #SBATCH --ntasks-per-node=2
> #SBATCH --ntasks=2
> #SBATCH --cpus-per-task=1
> #SBATCH --mem=60mb
> #SBATCH --share
> #SBATCH --time=01:01:00
>
> srun --mpi=openmpi ./testprog
> ---------------- 8< ---- snap ---- 8< ---------------------------
>
> For each job, processes like these are started:
>
> 11:59 00:00:00 srun
> 11:59 00:00:00 srun
> 11:59 00:00:11 /what/ever/./testprog
> 11:59 00:00:11 /what/ever/./testprog
>
> For each job slurm seems to know that it needs
> 2 CPUs:
>
> [2013-07-03T12:06:40+02:00] job_id:907 nhosts:1 ncpus:2 node_req:0
> nodes=miklip08
> [2013-07-03T12:06:40+02:00] Node[0]:
> [2013-07-03T12:06:40+02:00]   Mem(MB):60:0  Sockets:2  Cores:6  CPUs:2:0
> [2013-07-03T12:06:40+02:00]   Socket[0] Core[0] is allocated
> [2013-07-03T12:06:40+02:00] --------------------
> [2013-07-03T12:06:40+02:00] cpu_array_value[0]:2 reps:1
> [2013-07-03T12:06:40+02:00] ====================
>
> So after 12 jobs on a node, the next one should need to
> wait (until 2 cpus are availbale)... Anything that I am missing?
>
> Regards, Olaf
>
> P.S.: The according section from slurm.conf:
>
> ---------------- 8< ---- snip ---- 8< ---------------------------
> # SCHEDULING
> #DefMemPerCPU=0
> FastSchedule=1
> #MaxMemPerCPU=0
> #SchedulerRootFilter=1
> #SchedulerTimeSlice=30
> SchedulerType=sched/backfill
> SchedulerPort=7321
> SelectType=select/cons_res
> #SelectTypeParameters=CR_Core_Memory
> SelectTypeParameters=CR_CPU_Memory
>
> # COMPUTE NODES
> NodeName=node[01-03] RealMemory=90112 Sockets=2 CoresPerSocket=6
> ThreadsPerCore=2 State=UNKNOWN
> PartitionName=serial AllowGroups=ALL Nodes=node[01-03] Default=YES
> Shared=YES DefMemPerCPU=4096 MaxMemPerCPU=16384 MaxTime=480 State=UP
> ---------------- 8< ---- snap ---- 8< ---------------------------
>
> --
> Dipl. Inform. Olaf Gellert            email  [email protected]
> Deutsches Klimarechenzentrum GmbH     phone  +49 (0)40 460094 214
> Bundesstrasse 45a                     fax    +49 (0)40 460094 270
> D-20146 Hamburg, Germany              www    http://www.dkrz.de
>
> Sitz der Gesellschaft: Hamburg
> Geschï¿½ftsfï¿½hrer: Prof. Dr. Thomas Ludwig
> Registergericht: Amtsgericht Hamburg, HRB 39784
>

[slurm-dev] Re: slurm resources: shared queue and cores

Reply via email to