There is a fix in github about a week old. On July 23, 2014 6:42:02 AM PDT, "Kevin M. Hildebrand" <[email protected]> wrote: >Ok, I see what's happening, but don't know why yet. If the job is >assigned non-contiguous nodes, for some reason the CPUs for the >intervening nodes are being counted. i.e., if I'm assigned >compute-b25-[0-2], NumCPUs is correct, at 60 (three nodes worth of >CPUs) >However, if I'm assigned compute-b25-[0,4], NumCPUs is incorrect, and >is shown as 100 (five nodes worth of CPUs instead of two). > >Kevin > > >From: Kevin M. Hildebrand [mailto:[email protected]] >Sent: Wednesday, July 23, 2014 8:51 AM >To: slurm-dev >Subject: [slurm-dev] Bizarre number of CPUs calculated after updating >to 14.03.06 > >Hi, we were running 14.03.03 and updated to 14.03.06 yesterday, and >since then I've been seeing bizarre figures for NumCPUs in submitted >jobs. >For example, I submit a simple job as follows: > >> sbatch -n 30 slurmtest.script >JobId=431695 Name=slurmtest.script > >> scontrol show job 431695 >JobId=431695 Name=slurmtest.script > UserId=kevin(7260) GroupId=glue-staff(8675) > Priority=40133 Nice=0 Account=bubba QOS=wide-short > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 > RunTime=00:00:10 TimeLimit=00:30:00 TimeMin=N/A > SubmitTime=2014-07-23T08:34:48 EligibleTime=2014-07-23T08:34:48 > StartTime=2014-07-23T08:34:49 EndTime=2014-07-23T09:04:49 > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > Partition=standard AllocNode:Sid=deepthought2:21227 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=compute-b25-[24,37] > BatchHost=compute-b25-24 >NumNodes=2 NumCPUs=280 CPUs/Task=1 ReqB:S:C:T=0:0:*:* <---- NOTICE >NumCPUs here > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0 > MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 > Features=(null) Gres=(null) Reservation=(null) > Shared=0 Contiguous=0 Licenses=(null) Network=(null) > Command=/export/home/dt2-admin/kevin/slurmtest.script > WorkDir=/home/dt2-admin/kevin > StdErr=/home/dt2-admin/kevin/slurm-431695.out > StdIn=/dev/null > StdOut=/home/dt2-admin/kevin/slurm-431695.out > >This cluster is made up of nodes of 20 cores each, so I'd expect >NumCPUs to be 40 since the job is exclusive. >Here's the node records for the nodes that were assigned: > >> scontrol show node "compute-b25-[24,37]" >NodeName=compute-b25-24 Arch=x86_64 CoresPerSocket=10 > CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=2.97 Features=(null) > Gres=(null) > NodeAddr=compute-b25-24 NodeHostName=compute-b25-24 Version=14.03 > OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1 > State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1 > BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:44 > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > >NodeName=compute-b25-37 Arch=x86_64 CoresPerSocket=10 > CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=0.01 Features=(null) > Gres=(null) > NodeAddr=compute-b25-37 NodeHostName=compute-b25-37 Version=14.03 > OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1 > State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1 > BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:47 > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > >I'm seeing this behavior on two different clusters, both of which were >updated to 14.03.06. Was something changed recently that could explain >this? > >Thanks, >Kevin
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
