[slurm-dev] Re: discrepancy between node config and # of cpus found

Jeff Avila Fri, 19 May 2017 14:45:36 -0700

Yes, I did give that a try, though it didn’t seem to make any difference to the 
error messages I got.


slurmd -C on our three compute nodes shows 4 CPUs each, as one would expect 
given that they have four cpus each.

Here’s the result from one:

[root@vm-qexec1 ~]# ClusterName=cluster NodeName=vm-qexec1 CPUs=4 Boards=1 
SocketsPerBoard=4 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=995 TmpDisk=8651

Thanks for the help!

> On May 19, 2017, at 5:11 AM, Adam Huffman <adam.huff...@gmail.com> wrote:
> 
> Hello
> 
> While this doesn't necessarily address the strange sinfo output, to be able 
> to run more than one job per node you need to change this:
> 
> SelectType=select/linear
> 
> to
> 
> SelectType=select/cons_res
> 
> and then choose which parameters you want e.g.
> 
> SelectTypeParameters=CR_Core_Memory
> 
> See https://slurm.schedmd.com/cons_res.html
> 
> 
> Cheers,
> Adam
> 
> On Fri, May 19, 2017 at 9:23 AM, Ade Fewings <ade.fewi...@hpcwales.co.uk> 
> wrote:
> Not sure if it's going to answer or not, but a good way to check that the 
> Slurm daemons are seeing the CPUs would be to run 'slurmd -C' on the compute 
> nodes to print the hardware configuration it sees.....
> 
> 
> 
> [root@ssc001 ~]# su -c "slurmd -C" slurm
> NodeName=ssc001 CPUs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 
> ThreadsPerCore=1 RealMemory=64403 TmpDisk=384659
> UpTime=7-12:12:53
> 
> 
> Perhaps that will at least help narrow down the search window. 
> 
> 
> ~~
> Ade
> From: Jeff Avila <j...@exa.com>
> Sent: 18 May 2017 21:17:29
> To: slurm-dev
> Subject: [slurm-dev] discrepancy between node config and # of cpus found
>  
> 
> Good Morning All,
> 
> We have a small (one contol node, three compute node) SLURM test cluster. The 
> three compute nodes are KVM virtual machines, each with four processors.
> We’re having trouble starting jobs using more than one task per node; in fact 
> running a job with more than three tasks seems to be impossible.
> 
> What I think should be the salient part of slurm.conf (the entirety of the 
> file is appended to the end of this email) is shown here:
> 
> 
> NodeName=vm-qexec[1-3] CPUs=4 RealMemory=1000 State=UNKNOWN
> PartitionName=normal Nodes=vm-qexec[1-3] Default=YES MaxTime=INFINITE State=UP
> PartitionName=batch  Nodes=vm-qexec[1-3] Default=YES MaxTime=INFINITE State=UP
> 
> "cat /proc/cpuinfo | grep processor | wc -l" yields “4” for each of the qexec 
> nodes, so I’m a little confused at the output I get from sinfo, although it 
> does seem to explain the behavior I’m seeing:
> 
> [root@vm-qmaster ~]# sinfo -Nle
> Thu May 18 15:15:01 2017
> NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT 
> AVAIL_FE REASON
> vm-qexec1      1    batch*        idle    1    1:1:1      1        0      1   
> (null) none
> vm-qexec1      1    normal        idle    1    1:1:1      1        0      1   
> (null) none
> vm-qexec2      1    batch*        idle    1    1:1:1      1        0      1   
> (null) none
> vm-qexec2      1    normal        idle    1    1:1:1      1        0      1   
> (null) none
> vm-qexec3      1    batch*        idle    1    1:1:1      1        0      1   
> (null) none
> vm-qexec3      1    normal        idle    1    1:1:1      1        0      1   
> (null) none
> [root@vm-qmaster ~]#
> 
> How, exactly, do I get SLURM to recongnise the rest of the CPUs I have on 
> each node?
> 
> Thanks!
> 
> -Jeff
> 
> 
> 
> <full slurm.conf below>
> 
> 
> 
> 
> 
> slurm.conf
> [root@vm-qmaster slurm]# more slurm.conf
> # slurm.conf file generated by configurator.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> ControlMachine=vm-qmaster.exa.com
> ControlAddr=149.65.154.1
> #BackupController=
> #BackupAddr=
> #
> AuthType=auth/munge
> #CheckpointType=checkpoint/none
> CryptoType=crypto/munge
> #DisableRootJobs=NO
> #EnforcePartLimits=NO
> #Epilog=
> #EpilogSlurmctld=
> #FirstJobId=1
> #MaxJobId=999999
> #GresTypes=
> #GroupUpdateForce=0
> #GroupUpdateTime=600
> #JobCheckpointDir=/var/slurm/checkpoint
> #JobCredentialPrivateKey=
> #JobCredentialPublicCertificate=
> #JobFileAppend=0
> #JobRequeue=1
> #JobSubmitPlugins=1
> #KillOnBadExit=0
> #LaunchType=launch/slurm
> #Licenses=foo*4,bar
> #MailProg=/bin/mail
> #MaxJobCount=5000
> #MaxStepCount=40000
> #MaxTasksPerNode=128
> MpiDefault=none
> #MpiParams=ports=#-#
> #PluginDir=
> #PlugStackConfig=
> #PrivateData=jobs
> ProctrackType=proctrack/pgid
> #Prolog=
> #PrologFlags=
> #PrologSlurmctld=
> #PropagatePrioProcess=0
> #PropagateResourceLimits=
> #PropagateResourceLimitsExcept=
> #RebootProgram=
> ReturnToService=1
> #SallocDefaultCommand=
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/spool/slurmd
> SlurmUser=slurm
> #SlurmdUser=root
> #SrunEpilog=
> #SrunProlog=
> StateSaveLocation=/var/spool
> SwitchType=switch/none
> #TaskEpilog=
> TaskPlugin=task/none
> #TaskPluginParam=
> #TaskProlog=
> #TopologyPlugin=topology/tree
> #TmpFS=/tmp
> #TrackWCKey=no
> #TreeWidth=
> #UnkillableStepProgram=
> #UsePAM=0
> #
> #
> # TIMERS
> #BatchStartTimeout=10
> #CompleteWait=0
> #EpilogMsgTime=2000
> #GetEnvTimeout=2
> #HealthCheckInterval=0
> #HealthCheckProgram=
> InactiveLimit=0
> KillWait=30
> #MessageTimeout=10
> #ResvOverRun=0
> MinJobAge=300
> #OverTimeLimit=0
> SlurmctldTimeout=120
> SlurmdTimeout=300
> #UnkillableStepTimeout=60
> #VSizeFactor=0
> Waittime=0
> #
> #
> # SCHEDULING
> #DefMemPerCPU=0
> FastSchedule=1
> #MaxMemPerCPU=0
> #SchedulerRootFilter=1
> #SchedulerTimeSlice=30
> SchedulerType=sched/backfill
> SchedulerPort=7321
> SelectType=select/linear
> #SelectTypeParameters=
> #
> #
> # JOB PRIORITY
> #PriorityFlags=
> #PriorityType=priority/basic
> #PriorityDecayHalfLife=
> #PriorityCalcPeriod=
> #PriorityFavorSmall=
> #PriorityMaxAge=
> #PriorityUsageResetPeriod=
> #PriorityWeightAge=
> #PriorityWeightFairshare=
> #PriorityWeightJobSize=
> #PriorityWeightPartition=
> #PriorityWeightQOS=
> #
> #
> # LOGGING AND ACCOUNTING
> #AccountingStorageEnforce=0
> #AccountingStorageHost=
> #AccountingStorageLoc=
> #AccountingStoragePass=
> #AccountingStoragePort=
> AccountingStorageType=accounting_storage/none
> #AccountingStorageUser=
> AccountingStoreJobComment=YES
> ClusterName=cluster
> #DebugFlags=
> #JobCompHost=
> #JobCompLoc=
> #JobCompPass=
> #JobCompPort=
> JobCompType=jobcomp/none
> #JobCompUser=
> #JobContainerType=job_container/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=3
> #SlurmctldLogFile=
> SlurmdDebug=3
> #SlurmdLogFile=
> #SlurmSchedLogFile=
> #SlurmSchedLogLevel=
> #
> #
> # POWER SAVE SUPPORT FOR IDLE NODES (optional)
> #SuspendProgram=
> #ResumeProgram=
> #SuspendTimeout=
> #ResumeTimeout=
> #ResumeRate=
> #SuspendExcNodes=
> #SuspendExcParts=
> #SuspendRate=
> #SuspendTime=
> #
> #
> # COMPUTE NODES
> NodeName=vm-qexec[1-3] CPUs=4 RealMemory=1000 State=UNKNOWN
> PartitionName=normal Nodes=vm-qexec[1-3] Default=YES MaxTime=INFINITE State=UP
> PartitionName=batch  Nodes=vm-qexec[1-3] Default=YES MaxTime=INFINITE State=UP
>    
> 
> The contents of this email and any files transmitted with it are confidential 
> and intended solely for the named addressee only.  Unless you are the named 
> addressee (or authorised to receive this on their behalf) you may not copy it 
> or use it, or disclose it to anyone else.  If you have received this email in 
> error, please notify the sender by email or telephone.  All emails sent by 
> High Performance Computing Wales have been checked using an Anti-Virus 
> system.  We would advise you to run your own virus check before opening any 
> attachments received as we will not in any event accept any liability 
> whatsoever, once an email and/or attachment is received. 
> 
> High Performance Computing Wales is a private limited company incorporated in 
> Wales on 8 March 2010 as company number 07181701. 
> 
> Our registered office is at Finance Office, Bangor University, Cae Derwen, 
> College Road, Bangor, Gwynedd. LL57 2DG. UK. 
> 
> High Performance Computing Wales is part funded by the European Regional 
> Development Fund through the Welsh Government.
> 
>

[slurm-dev] Re: discrepancy between node config and # of cpus found

Reply via email to