Looks like commit #abb65255968d8cdafd90e2337131d53b9578cd82

I've just grabbed it and verified that it indeed fixes the problem.

Thanks for the quick reply!

Kevin

From: Morris Jette [mailto:[email protected]]
Sent: Wednesday, July 23, 2014 10:05 AM
To: slurm-dev
Subject: [slurm-dev] RE: Bizarre number of CPUs calculated after updating to 
14.03.06

There is a fix in github about a week old.
On July 23, 2014 6:42:02 AM PDT, "Kevin M. Hildebrand" <[email protected]> wrote:
Ok, I see what's happening, but don't know why yet.  If the job is assigned 
non-contiguous nodes, for some reason the CPUs for the intervening nodes are 
being counted.  i.e., if I'm assigned compute-b25-[0-2], NumCPUs is correct, at 
60 (three nodes worth of CPUs)
However, if I'm assigned compute-b25-[0,4], NumCPUs is incorrect, and is shown 
as 100 (five nodes worth of CPUs instead of two).


Kevin




From: Kevin M. Hildebrand [mailto:[email protected]]
Sent: Wednesday, July 23, 2014 8:51 AM
To: slurm-dev
Subject: [slurm-dev] Bizarre number of CPUs calculated after updating to 
14.03.06


Hi, we were running 14.03.03 and updated to 14.03.06 yesterday, and since then 
I've been seeing bizarre figures for NumCPUs in submitted jobs.
For example, I submit a simple job as follows:


> sbatch -n 30 slurmtest.script
JobId=431695 Name=slurmtest.script


> scontrol show job 431695
JobId=431695 Name=slurmtest.script
   UserId=kevin(7260) GroupId=glue-staff(8675)
   Priority=40133 Nice=0 Account=bubba QOS=wide-short
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:10 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2014-07-23T08:34:48 EligibleTime=2014-07-23T08:34:48
   StartTime=2014-07-23T08:34:49 EndTime=2014-07-23T09:04:49
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=standard AllocNode:Sid=deepthought2:21227
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=compute-b25-[24,37]
   BatchHost=compute-b25-24
   NumNodes=2 NumCPUs=280 CPUs/Task=1 ReqB:S:C:T=0:0:*:*   <---- NOTICE NumCPUs 
here
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/export/home/dt2-admin/kevin/slurmtest.script
   WorkDir=/home/dt2-admin/kevin
   StdErr=/home/dt2-admin/kevin/slurm-431695.out
   StdIn=/dev/null
   StdOut=/home/dt2-admin/kevin/slurm-431695.out


This cluster is made up of nodes of 20 cores each, so I'd expect NumCPUs to be 
40 since the job is exclusive.
Here's the node records for the nodes that were assigned:


> scontrol show node "compute-b25-[24,37]"
NodeName=compute-b25-24 Arch=x86_64 CoresPerSocket=10
   CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=2.97 Features=(null)
   Gres=(null)
   NodeAddr=compute-b25-24 NodeHostName=compute-b25-24 Version=14.03
   OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
   BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:44
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s



NodeName=compute-b25-37 Arch=x86_64 CoresPerSocket=10
   CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=0.01 Features=(null)
   Gres=(null)
   NodeAddr=compute-b25-37 NodeHostName=compute-b25-37 Version=14.03
   OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
   BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:47
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s




I'm seeing this behavior on two different clusters, both of which were updated 
to 14.03.06.  Was something changed recently that could explain this?


Thanks,
Kevin


--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Reply via email to