Not sure if it's going to answer or not, but a good way to check that the Slurm 
daemons are seeing the CPUs would be to run 'slurmd -C' on the compute nodes to 
print the hardware configuration it sees.....


[root@ssc001 ~]# su -c "slurmd -C" slurm
NodeName=ssc001 CPUs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 
ThreadsPerCore=1 RealMemory=64403 TmpDisk=384659
UpTime=7-12:12:53


Perhaps that will at least help narrow down the search window.


~~
Ade

________________________________
From: Jeff Avila <j...@exa.com>
Sent: 18 May 2017 21:17:29
To: slurm-dev
Subject: [slurm-dev] discrepancy between node config and # of cpus found


Good Morning All,

We have a small (one contol node, three compute node) SLURM test cluster. The 
three compute nodes are KVM virtual machines, each with four processors.
We’re having trouble starting jobs using more than one task per node; in fact 
running a job with more than three tasks seems to be impossible.

What I think should be the salient part of slurm.conf (the entirety of the file 
is appended to the end of this email) is shown here:


NodeName=vm-qexec[1-3] CPUs=4 RealMemory=1000 State=UNKNOWN
PartitionName=normal Nodes=vm-qexec[1-3] Default=YES MaxTime=INFINITE State=UP
PartitionName=batch  Nodes=vm-qexec[1-3] Default=YES MaxTime=INFINITE State=UP

"cat /proc/cpuinfo | grep processor | wc -l" yields “4” for each of the qexec 
nodes, so I’m a little confused at the output I get from sinfo, although it 
does seem to explain the behavior I’m seeing:

[root@vm-qmaster ~]# sinfo -Nle
Thu May 18 15:15:01 2017
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT 
AVAIL_FE REASON
vm-qexec1      1    batch*        idle    1    1:1:1      1        0      1   
(null) none
vm-qexec1      1    normal        idle    1    1:1:1      1        0      1   
(null) none
vm-qexec2      1    batch*        idle    1    1:1:1      1        0      1   
(null) none
vm-qexec2      1    normal        idle    1    1:1:1      1        0      1   
(null) none
vm-qexec3      1    batch*        idle    1    1:1:1      1        0      1   
(null) none
vm-qexec3      1    normal        idle    1    1:1:1      1        0      1   
(null) none
[root@vm-qmaster ~]#

How, exactly, do I get SLURM to recongnise the rest of the CPUs I have on each 
node?

Thanks!

-Jeff



<full slurm.conf below>





slurm.conf
[root@vm-qmaster slurm]# more slurm.conf
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=vm-qmaster.exa.com
ControlAddr=149.65.154.1
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=vm-qexec[1-3] CPUs=4 RealMemory=1000 State=UNKNOWN
PartitionName=normal Nodes=vm-qexec[1-3] Default=YES MaxTime=INFINITE State=UP
PartitionName=batch  Nodes=vm-qexec[1-3] Default=YES MaxTime=INFINITE State=UP
________________________________

   [HPC Wales - www.hpcwales.co.uk] <http://www.hpcwales.co.uk>

________________________________

The contents of this email and any files transmitted with it are confidential 
and intended solely for the named addressee only.  Unless you are the named 
addressee (or authorised to receive this on their behalf) you may not copy it 
or use it, or disclose it to anyone else.  If you have received this email in 
error, please notify the sender by email or telephone.  All emails sent by High 
Performance Computing Wales have been checked using an Anti-Virus system.  We 
would advise you to run your own virus check before opening any attachments 
received as we will not in any event accept any liability whatsoever, once an 
email and/or attachment is received.

High Performance Computing Wales is a private limited company incorporated in 
Wales on 8 March 2010 as company number 07181701.

Our registered office is at Finance Office, Bangor University, Cae Derwen, 
College Road, Bangor, Gwynedd. LL57 2DG. UK.

High Performance Computing Wales is part funded by the European Regional 
Development Fund through the Welsh Government.

Reply via email to