There is a fix in github about a week old.

On July 23, 2014 6:42:02 AM PDT, "Kevin M. Hildebrand" <[email protected]> wrote:
>Ok, I see what's happening, but don't know why yet.  If the job is
>assigned non-contiguous nodes, for some reason the CPUs for the
>intervening nodes are being counted.  i.e., if I'm assigned
>compute-b25-[0-2], NumCPUs is correct, at 60 (three nodes worth of
>CPUs)
>However, if I'm assigned compute-b25-[0,4], NumCPUs is incorrect, and
>is shown as 100 (five nodes worth of CPUs instead of two).
>
>Kevin
>
>
>From: Kevin M. Hildebrand [mailto:[email protected]]
>Sent: Wednesday, July 23, 2014 8:51 AM
>To: slurm-dev
>Subject: [slurm-dev] Bizarre number of CPUs calculated after updating
>to 14.03.06
>
>Hi, we were running 14.03.03 and updated to 14.03.06 yesterday, and
>since then I've been seeing bizarre figures for NumCPUs in submitted
>jobs.
>For example, I submit a simple job as follows:
>
>> sbatch -n 30 slurmtest.script
>JobId=431695 Name=slurmtest.script
>
>> scontrol show job 431695
>JobId=431695 Name=slurmtest.script
>   UserId=kevin(7260) GroupId=glue-staff(8675)
>   Priority=40133 Nice=0 Account=bubba QOS=wide-short
>   JobState=RUNNING Reason=None Dependency=(null)
>   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
>   RunTime=00:00:10 TimeLimit=00:30:00 TimeMin=N/A
>   SubmitTime=2014-07-23T08:34:48 EligibleTime=2014-07-23T08:34:48
>   StartTime=2014-07-23T08:34:49 EndTime=2014-07-23T09:04:49
>   PreemptTime=None SuspendTime=None SecsPreSuspend=0
>   Partition=standard AllocNode:Sid=deepthought2:21227
>   ReqNodeList=(null) ExcNodeList=(null)
>   NodeList=compute-b25-[24,37]
>   BatchHost=compute-b25-24
>NumNodes=2 NumCPUs=280 CPUs/Task=1 ReqB:S:C:T=0:0:*:*   <---- NOTICE
>NumCPUs here
>   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
>   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>   Features=(null) Gres=(null) Reservation=(null)
>   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
>   Command=/export/home/dt2-admin/kevin/slurmtest.script
>   WorkDir=/home/dt2-admin/kevin
>   StdErr=/home/dt2-admin/kevin/slurm-431695.out
>   StdIn=/dev/null
>   StdOut=/home/dt2-admin/kevin/slurm-431695.out
>
>This cluster is made up of nodes of 20 cores each, so I'd expect
>NumCPUs to be 40 since the job is exclusive.
>Here's the node records for the nodes that were assigned:
>
>> scontrol show node "compute-b25-[24,37]"
>NodeName=compute-b25-24 Arch=x86_64 CoresPerSocket=10
>   CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=2.97 Features=(null)
>   Gres=(null)
>   NodeAddr=compute-b25-24 NodeHostName=compute-b25-24 Version=14.03
>   OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
>   State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
>   BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:44
>   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
>NodeName=compute-b25-37 Arch=x86_64 CoresPerSocket=10
>   CPUAlloc=20 CPUErr=0 CPUTot=20 CPULoad=0.01 Features=(null)
>   Gres=(null)
>   NodeAddr=compute-b25-37 NodeHostName=compute-b25-37 Version=14.03
>   OS=Linux RealMemory=128000 AllocMem=0 Sockets=2 Boards=1
>   State=ALLOCATED ThreadsPerCore=1 TmpDisk=750000 Weight=1
>   BootTime=2014-07-01T09:03:34 SlurmdStartTime=2014-07-23T08:24:47
>   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
>I'm seeing this behavior on two different clusters, both of which were
>updated to 14.03.06.  Was something changed recently that could explain
>this?
>
>Thanks,
>Kevin

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Reply via email to