Hi,
We are experiencing some possible core allocation issues in our
system. One of the issues is that the system seems to allocate cores
to be shared by several processes as shown by the output of the top
command on one of the nodes:
top - 16:59:08 up 16 days, 23:37, 1 user, load average: 11.99, 11.90, 11.58
Tasks: 333 total, 13 running, 320 sleeping, 0 stopped, 0 zombie
Cpu(s): 66.7%us, 0.1%sy, 0.0%ni, 33.1%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 99052292k total, 9088412k used, 89963880k free, 177592k buffers
Swap: 40959992k total, 0k used, 40959992k free, 6160760k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
9477 user1 20 0 10.7g 166m 6724 R 100.0 0.2 46:01.09 8 parcas2
9478 user1 20 0 10.7g 163m 6464 R 100.0 0.2 46:01.08 10 parcas2
9481 user1 20 0 10.7g 161m 6548 R 100.0 0.2 46:01.00 3 parcas2
9479 user1 20 0 10.7g 164m 6692 R 99.6 0.2 45:58.78 1 parcas2
9471 user1 20 0 10.7g 168m 7100 R 50.2 0.2 23:14.44 0 parcas2
9476 user1 20 0 10.7g 164m 7084 R 50.2 0.2 23:14.83 6 parcas2
11620 user2 20 0 2047m 89m 4452 R 50.2 0.1 3:15.78 0 parcas_Fe-Fe_cu
11621 user2 20 0 2047m 88m 3824 R 50.2 0.1 3:15.79 2 parcas_Fe-Fe_cu
11622 user2 20 0 2047m 88m 3824 R 50.2 0.1 3:15.82 4 parcas_Fe-Fe_cu
9472 user1 20 0 10.7g 166m 7168 R 49.8 0.2 23:14.72 2 parcas2
9473 user1 20 0 10.7g 161m 7164 R 49.8 0.2 23:14.57 4 parcas2
11623 user2 20 0 2047m 88m 3824 R 49.8 0.1 3:15.82 6 parcas_Fe-Fe_cu
On this node the cores 0, 2, 4 and 6 are shared by the processes of
the two users. There are altogether 12 tasks being run by the users on
this node and since there are 12 cores per node (2 sockets, 6 cores
each), there should in principle be no reason to share cores like
this. Here are the details of the node in question:
NodeName=xxxx Arch=x86_64 CoresPerSocket=6
CPUAlloc=12 CPUErr=0 CPUTot=12 CPULoad=11.92 Features=(null)
Gres=(null)
NodeAddr=al32 NodeHostName=al32 Version=14.03
OS=Linux RealMemory=97000 AllocMem=72000 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1
BootTime=2014-12-29T17:21:18 SlurmdStartTime=2014-12-29T17:26:19
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
According to the CPU load all the cores are in use, but the cores
5,7,9 and 11 are not listed as being in use by the top command. Also,
if we look at the information available on the job that user2 has
submitted, we see the following:
JobId=2257209 Name=testing2
UserId=user2(xxx) GroupId=group2(xxx)
Priority=93 Nice=0 Account=local QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:07:23 TimeLimit=6-23:59:00 TimeMin=N/A
SubmitTime=2015-01-15T16:49:54 EligibleTime=2015-01-15T16:49:54
StartTime=2015-01-15T16:52:34 EndTime=2015-01-22T16:51:34
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=8G_long_par AllocNode:Sid=alcyone:3537
ReqNodeList=(null) ExcNodeList=(null)
NodeList=al32
BatchHost=al32
NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
Nodes=al32 CPU_IDs=2-3,6-7 Mem=8000
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=xxx
WorkDir=xxx
StdErr=xxx
StdIn=xxx
StdOut=xxx
Here we see that the job is allocated four cores. The IDs are
different than the ones listed in top, which may be just because SLURM
uses its own ID numbers. I know that the P column in the output of the
top command refers to the "last used CPU", so is the apparent sharing
of the processors just an artefact of how the top command reports CPU
usage or is SLURM in fact - at least at times - letting the jobs share
resources? The latter is something that we would like to avoid.
The other, possibly related issue, is that a user reported only being
allocated 2 cores for a job even though the submit script requests for
4 (thus doubling the time it takes to finish the run). The script he
used was:
#!/bin/bash
#SBATCH -J Fe_noreppot_nogb
#SBATCH -N 1
#SBATCH -n 4
#SBATCH -t 6-23:59:00
#SBATCH -p 8G_long_par
And the parallel run was started by:
module load mvapich2/1.9-intel
mpirun -np 4 ./parcas
The partition he submitted his job to (and which was also used by the
job by user2 above) has the following setup:
PartitionName=8G_long_par Nodes=xxx Default=NO MinNodes=1 MaxNodes=2
DefaultTime=10 MaxTime=30-0 Shared=NO Priority=5 State=UP
And for the whole system we've defined (among other things):
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
Basically there's nothing in the setup that in my opinion should cause
the allocation of only half of the requested cores (using #SBATCH -c
in the submit script instead of #SBATCH -n did not help) or the
sharing of resources between several submitted jobs, but maybe I'm
missing something.
Any suggestions on how to proceed in solving these issues would be welcome.
Best regards, Kai Ruusuvuori
--
Kai Ruusuvuori, PhD Student
Faculty of Science
Department of Physics
Division of Atmospheric Sciences
P.O. Box 64
Gustaf Hällströmin katu 2
00014 University of Helsinki
Finland