Hi,

We are experiencing some possible core allocation issues in our system. One of the issues is that the system seems to allocate cores to be shared by several processes as shown by the output of the top command on one of the nodes:


top - 16:59:08 up 16 days, 23:37,  1 user,  load average: 11.99, 11.90, 11.58
Tasks: 333 total,  13 running, 320 sleeping,   0 stopped,   0 zombie
Cpu(s): 66.7%us,  0.1%sy,  0.0%ni, 33.1%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  99052292k total,  9088412k used, 89963880k free,   177592k buffers
Swap: 40959992k total,        0k used, 40959992k free,  6160760k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+   P COMMAND
 9477 user1    20   0 10.7g 166m 6724 R 100.0  0.2  46:01.09  8 parcas2
 9478 user1    20   0 10.7g 163m 6464 R 100.0  0.2  46:01.08 10 parcas2
 9481 user1    20   0 10.7g 161m 6548 R 100.0  0.2  46:01.00  3 parcas2
 9479 user1    20   0 10.7g 164m 6692 R 99.6  0.2  45:58.78  1 parcas2
 9471 user1    20   0 10.7g 168m 7100 R 50.2  0.2  23:14.44  0 parcas2
 9476 user1    20   0 10.7g 164m 7084 R 50.2  0.2  23:14.83  6 parcas2
11620 user2    20   0 2047m  89m 4452 R 50.2  0.1   3:15.78  0 parcas_Fe-Fe_cu
11621 user2    20   0 2047m  88m 3824 R 50.2  0.1   3:15.79  2 parcas_Fe-Fe_cu
11622 user2    20   0 2047m  88m 3824 R 50.2  0.1   3:15.82  4 parcas_Fe-Fe_cu
 9472 user1    20   0 10.7g 166m 7168 R 49.8  0.2  23:14.72  2 parcas2
 9473 user1    20   0 10.7g 161m 7164 R 49.8  0.2  23:14.57  4 parcas2
11623 user2    20   0 2047m  88m 3824 R 49.8  0.1   3:15.82  6 parcas_Fe-Fe_cu


On this node the cores 0, 2, 4 and 6 are shared by the processes of the two users. There are altogether 12 tasks being run by the users on this node and since there are 12 cores per node (2 sockets, 6 cores each), there should in principle be no reason to share cores like this. Here are the details of the node in question:


NodeName=xxxx Arch=x86_64 CoresPerSocket=6
   CPUAlloc=12 CPUErr=0 CPUTot=12 CPULoad=11.92 Features=(null)
   Gres=(null)
   NodeAddr=al32 NodeHostName=al32 Version=14.03
   OS=Linux RealMemory=97000 AllocMem=72000 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2014-12-29T17:21:18 SlurmdStartTime=2014-12-29T17:26:19
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0


According to the CPU load all the cores are in use, but the cores 5,7,9 and 11 are not listed as being in use by the top command. Also, if we look at the information available on the job that user2 has submitted, we see the following:


JobId=2257209 Name=testing2
   UserId=user2(xxx) GroupId=group2(xxx)
   Priority=93 Nice=0 Account=local QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:07:23 TimeLimit=6-23:59:00 TimeMin=N/A
   SubmitTime=2015-01-15T16:49:54 EligibleTime=2015-01-15T16:49:54
   StartTime=2015-01-15T16:52:34 EndTime=2015-01-22T16:51:34
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=8G_long_par AllocNode:Sid=alcyone:3537
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=al32
   BatchHost=al32
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
     Nodes=al32 CPU_IDs=2-3,6-7 Mem=8000
   MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=xxx
   WorkDir=xxx
   StdErr=xxx
   StdIn=xxx
   StdOut=xxx


Here we see that the job is allocated four cores. The IDs are different than the ones listed in top, which may be just because SLURM uses its own ID numbers. I know that the P column in the output of the top command refers to the "last used CPU", so is the apparent sharing of the processors just an artefact of how the top command reports CPU usage or is SLURM in fact - at least at times - letting the jobs share resources? The latter is something that we would like to avoid.

The other, possibly related issue, is that a user reported only being allocated 2 cores for a job even though the submit script requests for 4 (thus doubling the time it takes to finish the run). The script he used was:


#!/bin/bash
#SBATCH -J Fe_noreppot_nogb
#SBATCH -N 1
#SBATCH -n 4
#SBATCH -t 6-23:59:00
#SBATCH -p 8G_long_par


And the parallel run was started by:


module load mvapich2/1.9-intel
mpirun -np 4 ./parcas


The partition he submitted his job to (and which was also used by the job by user2 above) has the following setup:


PartitionName=8G_long_par Nodes=xxx Default=NO MinNodes=1 MaxNodes=2 DefaultTime=10 MaxTime=30-0 Shared=NO Priority=5 State=UP


And for the whole system we've defined (among other things):


SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory


Basically there's nothing in the setup that in my opinion should cause the allocation of only half of the requested cores (using #SBATCH -c in the submit script instead of #SBATCH -n did not help) or the sharing of resources between several submitted jobs, but maybe I'm missing something.

Any suggestions on how to proceed in solving these issues would be welcome.

Best regards, Kai Ruusuvuori


--
Kai Ruusuvuori, PhD Student

Faculty of Science
Department of Physics
Division of Atmospheric Sciences

P.O. Box 64
Gustaf Hällströmin katu 2
00014 University of Helsinki
Finland

Reply via email to