[slurm-dev] Possible core allocation issues

kai . ruusuvuori Wed, 11 Feb 2015 03:33:40 -0800

Hi,

We are experiencing some possible core allocation issues in oursystem. One of the issues is that the system seems to allocate coresto be shared by several processes as shown by the output of the topcommand on one of the nodes:



top - 16:59:08 up 16 days, 23:37,  1 user,  load average: 11.99, 11.90, 11.58
Tasks: 333 total,  13 running, 320 sleeping,   0 stopped,   0 zombie
Cpu(s): 66.7%us,  0.1%sy,  0.0%ni, 33.1%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  99052292k total,  9088412k used, 89963880k free,   177592k buffers
Swap: 40959992k total,        0k used, 40959992k free,  6160760k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+   P COMMAND
 9477 user1    20   0 10.7g 166m 6724 R 100.0  0.2  46:01.09  8 parcas2
 9478 user1    20   0 10.7g 163m 6464 R 100.0  0.2  46:01.08 10 parcas2
 9481 user1    20   0 10.7g 161m 6548 R 100.0  0.2  46:01.00  3 parcas2
 9479 user1    20   0 10.7g 164m 6692 R 99.6  0.2  45:58.78  1 parcas2
 9471 user1    20   0 10.7g 168m 7100 R 50.2  0.2  23:14.44  0 parcas2
 9476 user1    20   0 10.7g 164m 7084 R 50.2  0.2  23:14.83  6 parcas2
11620 user2    20   0 2047m  89m 4452 R 50.2  0.1   3:15.78  0 parcas_Fe-Fe_cu
11621 user2    20   0 2047m  88m 3824 R 50.2  0.1   3:15.79  2 parcas_Fe-Fe_cu
11622 user2    20   0 2047m  88m 3824 R 50.2  0.1   3:15.82  4 parcas_Fe-Fe_cu
 9472 user1    20   0 10.7g 166m 7168 R 49.8  0.2  23:14.72  2 parcas2
 9473 user1    20   0 10.7g 161m 7164 R 49.8  0.2  23:14.57  4 parcas2
11623 user2    20   0 2047m  88m 3824 R 49.8  0.1   3:15.82  6 parcas_Fe-Fe_cu

On this node the cores 0, 2, 4 and 6 are shared by the processes ofthe two users. There are altogether 12 tasks being run by the users onthis node and since there are 12 cores per node (2 sockets, 6 coreseach), there should in principle be no reason to share cores likethis. Here are the details of the node in question:



NodeName=xxxx Arch=x86_64 CoresPerSocket=6
   CPUAlloc=12 CPUErr=0 CPUTot=12 CPULoad=11.92 Features=(null)
   Gres=(null)
   NodeAddr=al32 NodeHostName=al32 Version=14.03
   OS=Linux RealMemory=97000 AllocMem=72000 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2014-12-29T17:21:18 SlurmdStartTime=2014-12-29T17:26:19
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

According to the CPU load all the cores are in use, but the cores5,7,9 and 11 are not listed as being in use by the top command. Also,if we look at the information available on the job that user2 hassubmitted, we see the following:



JobId=2257209 Name=testing2
   UserId=user2(xxx) GroupId=group2(xxx)
   Priority=93 Nice=0 Account=local QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:07:23 TimeLimit=6-23:59:00 TimeMin=N/A
   SubmitTime=2015-01-15T16:49:54 EligibleTime=2015-01-15T16:49:54
   StartTime=2015-01-15T16:52:34 EndTime=2015-01-22T16:51:34
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=8G_long_par AllocNode:Sid=alcyone:3537
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=al32
   BatchHost=al32
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
     Nodes=al32 CPU_IDs=2-3,6-7 Mem=8000
   MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=xxx
   WorkDir=xxx
   StdErr=xxx
   StdIn=xxx
   StdOut=xxx

Here we see that the job is allocated four cores. The IDs aredifferent than the ones listed in top, which may be just because SLURMuses its own ID numbers. I know that the P column in the output of thetop command refers to the "last used CPU", so is the apparent sharingof the processors just an artefact of how the top command reports CPUusage or is SLURM in fact - at least at times - letting the jobs shareresources? The latter is something that we would like to avoid.

The other, possibly related issue, is that a user reported only beingallocated 2 cores for a job even though the submit script requests for4 (thus doubling the time it takes to finish the run). The script heused was:



#!/bin/bash
#SBATCH -J Fe_noreppot_nogb
#SBATCH -N 1
#SBATCH -n 4
#SBATCH -t 6-23:59:00
#SBATCH -p 8G_long_par


And the parallel run was started by:


module load mvapich2/1.9-intel
mpirun -np 4 ./parcas

The partition he submitted his job to (and which was also used by thejob by user2 above) has the following setup:

PartitionName=8G_long_par Nodes=xxx Default=NO MinNodes=1 MaxNodes=2DefaultTime=10 MaxTime=30-0 Shared=NO Priority=5 State=UP



And for the whole system we've defined (among other things):


SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

Basically there's nothing in the setup that in my opinion should causethe allocation of only half of the requested cores (using #SBATCH -cin the submit script instead of #SBATCH -n did not help) or thesharing of resources between several submitted jobs, but maybe I'mmissing something.


Any suggestions on how to proceed in solving these issues would be welcome.

Best regards, Kai Ruusuvuori


--
Kai Ruusuvuori, PhD Student

Faculty of Science
Department of Physics
Division of Atmospheric Sciences

P.O. Box 64
Gustaf Hällströmin katu 2
00014 University of Helsinki
Finland

[slurm-dev] Possible core allocation issues

Reply via email to