[slurm-dev] RE: Possible core allocation issues

Martin Perry Wed, 11 Feb 2015 09:22:08 -0800

Kai,

The Linux scheduler is not constrained by Slurm core allocations. So normally 
it can schedule Slurm tasks, and their child processes, on any core on the 
node. To constrain a Slurm job to its allocated cores, configure the 
task/cgroup plugin in slurm.conf with ConstrainCores=yes in cgroup.conf. See 
the slurm.conf and cgroup.conf man pages for more information.


Also, as you note, Slurm uses its own CPU numbering system, so the CPU_IDs 
reported by Slurm will not necessarily match the CPU numbers reported by Linux 
commands like top.

Regards,
Martin Perry
Bull Phoenix

-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Wednesday, February 11, 2015 4:34 AM
To: slurm-dev
Subject: [slurm-dev] Possible core allocation issues


Hi,

We are experiencing some possible core allocation issues in our system. One of 
the issues is that the system seems to allocate cores to be shared by several 
processes as shown by the output of the top command on one of the nodes:


top - 16:59:08 up 16 days, 23:37,  1 user,  load average: 11.99, 11.90, 11.58
Tasks: 333 total,  13 running, 320 sleeping,   0 stopped,   0 zombie
Cpu(s): 66.7%us,  0.1%sy,  0.0%ni, 33.1%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  99052292k total,  9088412k used, 89963880k free,   177592k buffers
Swap: 40959992k total,        0k used, 40959992k free,  6160760k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+   P COMMAND
  9477 user1    20   0 10.7g 166m 6724 R 100.0  0.2  46:01.09  8 parcas2
  9478 user1    20   0 10.7g 163m 6464 R 100.0  0.2  46:01.08 10 parcas2
  9481 user1    20   0 10.7g 161m 6548 R 100.0  0.2  46:01.00  3 parcas2
  9479 user1    20   0 10.7g 164m 6692 R 99.6  0.2  45:58.78  1 parcas2
  9471 user1    20   0 10.7g 168m 7100 R 50.2  0.2  23:14.44  0 parcas2
  9476 user1    20   0 10.7g 164m 7084 R 50.2  0.2  23:14.83  6 parcas2
11620 user2    20   0 2047m  89m 4452 R 50.2  0.1   3:15.78  0 parcas_Fe-Fe_cu
11621 user2    20   0 2047m  88m 3824 R 50.2  0.1   3:15.79  2 parcas_Fe-Fe_cu
11622 user2    20   0 2047m  88m 3824 R 50.2  0.1   3:15.82  4 parcas_Fe-Fe_cu
  9472 user1    20   0 10.7g 166m 7168 R 49.8  0.2  23:14.72  2 parcas2
  9473 user1    20   0 10.7g 161m 7164 R 49.8  0.2  23:14.57  4 parcas2
11623 user2    20   0 2047m  88m 3824 R 49.8  0.1   3:15.82  6 parcas_Fe-Fe_cu


On this node the cores 0, 2, 4 and 6 are shared by the processes of the two 
users. There are altogether 12 tasks being run by the users on this node and 
since there are 12 cores per node (2 sockets, 6 cores each), there should in 
principle be no reason to share cores like this. Here are the details of the 
node in question:


NodeName=xxxx Arch=x86_64 CoresPerSocket=6
    CPUAlloc=12 CPUErr=0 CPUTot=12 CPULoad=11.92 Features=(null)
    Gres=(null)
    NodeAddr=al32 NodeHostName=al32 Version=14.03
    OS=Linux RealMemory=97000 AllocMem=72000 Sockets=2 Boards=1
    State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1
    BootTime=2014-12-29T17:21:18 SlurmdStartTime=2014-12-29T17:26:19
    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0


According to the CPU load all the cores are in use, but the cores
5,7,9 and 11 are not listed as being in use by the top command. Also, if we 
look at the information available on the job that user2 has submitted, we see 
the following:


JobId=2257209 Name=testing2
    UserId=user2(xxx) GroupId=group2(xxx)
    Priority=93 Nice=0 Account=local QOS=normal
    JobState=RUNNING Reason=None Dependency=(null)
    Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
    DerivedExitCode=0:0
    RunTime=00:07:23 TimeLimit=6-23:59:00 TimeMin=N/A
    SubmitTime=2015-01-15T16:49:54 EligibleTime=2015-01-15T16:49:54
    StartTime=2015-01-15T16:52:34 EndTime=2015-01-22T16:51:34
    PreemptTime=None SuspendTime=None SecsPreSuspend=0
    Partition=8G_long_par AllocNode:Sid=alcyone:3537
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=al32
    BatchHost=al32
    NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
      Nodes=al32 CPU_IDs=2-3,6-7 Mem=8000
    MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
    Features=(null) Gres=(null) Reservation=(null)
    Shared=0 Contiguous=0 Licenses=(null) Network=(null)
    Command=xxx
    WorkDir=xxx
    StdErr=xxx
    StdIn=xxx
    StdOut=xxx


Here we see that the job is allocated four cores. The IDs are different than 
the ones listed in top, which may be just because SLURM uses its own ID 
numbers. I know that the P column in the output of the top command refers to 
the "last used CPU", so is the apparent sharing of the processors just an 
artefact of how the top command reports CPU usage or is SLURM in fact - at 
least at times - letting the jobs share resources? The latter is something that 
we would like to avoid.

The other, possibly related issue, is that a user reported only being allocated 
2 cores for a job even though the submit script requests for
4 (thus doubling the time it takes to finish the run). The script he used was:


#!/bin/bash
#SBATCH -J Fe_noreppot_nogb
#SBATCH -N 1
#SBATCH -n 4
#SBATCH -t 6-23:59:00
#SBATCH -p 8G_long_par


And the parallel run was started by:


module load mvapich2/1.9-intel
mpirun -np 4 ./parcas


The partition he submitted his job to (and which was also used by the job by 
user2 above) has the following setup:


PartitionName=8G_long_par Nodes=xxx Default=NO MinNodes=1 MaxNodes=2
DefaultTime=10 MaxTime=30-0 Shared=NO Priority=5 State=UP


And for the whole system we've defined (among other things):


SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory


Basically there's nothing in the setup that in my opinion should cause the 
allocation of only half of the requested cores (using #SBATCH -c in the submit 
script instead of #SBATCH -n did not help) or the sharing of resources between 
several submitted jobs, but maybe I'm missing something.

Any suggestions on how to proceed in solving these issues would be welcome.

Best regards, Kai Ruusuvuori


--
Kai Ruusuvuori, PhD Student

Faculty of Science
Department of Physics
Division of Atmospheric Sciences

P.O. Box 64
Gustaf Hällströmin katu 2
00014 University of Helsinki
Finland

[slurm-dev] RE: Possible core allocation issues

Reply via email to