Looks like commit #abb65255968d8cdafd90e2337131d53b9578cd82 I've just grabbed it and verified that it indeed fixes the problem.
Thanks for the quick reply! Kevin From: Morris Jette [mailto:[email protected]] Sent: Wednesday, July 23, 2014 10:05 AM To: slurm-dev Subject: [slurm-dev] RE: Bizarre number of CPUs calculated after updating to 14.03.06 There is a fix in github about a week old. On July 23, 2014 6:42:02 AM PDT, "Kevin M. Hildebrand" <[email protected]> wrote: Ok, I see what's happening, but don't know why yet. If the job is assigned non-contiguous nodes, for some reason the CPUs for the intervening nodes are being counted. i.e., if I'm assigned compute-b25-[0-2], NumCPUs is correct, at 60 (three nodes worth of CPUs) However, if I'm assigned compute-b25-[0,4], NumCPUs is incorrect, and is shown as 100 (five nodes worth of CPUs instead of two). Kevin From: Kevin M. Hildebrand [mailto:[email protected]] Sent: Wednesday, July 23, 2014 8:51 AM To: slurm-dev Subject: [slurm-dev] Bizarre number of CPUs calculated after updating to 14.03.06 Hi, we were running 14.03.03 and updated to 14.03.06 yesterday, and since then I've been seeing bizarre figures for NumCPUs in submitted jobs. For example, I submit a simple job as follows: > sbatch -n 30 slurmtest.script JobId=431695 Name=slurmtest.script > scontrol show job 431695 JobId=431695 Name=slurmtest.script UserId=kevin(7260) GroupId=glue-staff(8675) Priority=40133 Nice=0 Account=bubba QOS=wide-short JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:10 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2014-07-23T08:34:48 EligibleTime=2014-07-23T08:34:48 StartTime=2014-07-23T08:34:49 EndTime=2014-07-23T09:04:49 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=standard AllocNode:Sid=deepthought2:21227 ReqNodeList=(null) ExcNodeList=(null) NodeList=compute-b25-[24,37] BatchHost=compute-b25-24 NumNodes=2 NumCPUs=280 CPUs/Task=1 ReqB:S:C:T=0:0:*:* <---- NOTICE NumCPUs here Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0 MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/export/home/dt2-admin/kevin/slurmtest.script WorkDir=/home/dt2-admin/kevin StdErr=/home/dt2-admin/kevin/slurm-431695.out StdIn=/dev/null StdOut=/home/dt2-admin/kevin/slurm-431695.out This cluster is made up of nodes of 20 cores each, so I'd expect NumCPUs to be 40 since the job is exclusive. Here's the node records for the nodes that were assigned: > scontrol show node "compute-b25-[24,37]" NodeName=compute-b25-24 Arch=x86_64 CoresPerSocket=10 CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=2.97 Features=(null) Gres=(null) NodeAddr=compute-b25-24 NodeHostName=compute-b25-24 Version=14.03 OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1 BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:44 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=compute-b25-37 Arch=x86_64 CoresPerSocket=10 CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=0.01 Features=(null) Gres=(null) NodeAddr=compute-b25-37 NodeHostName=compute-b25-37 Version=14.03 OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1 BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:47 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s I'm seeing this behavior on two different clusters, both of which were updated to 14.03.06. Was something changed recently that could explain this? Thanks, Kevin -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
