Yes, I did give that a try, though it didn’t seem to make any difference to the error messages I got.
slurmd -C on our three compute nodes shows 4 CPUs each, as one would expect given that they have four cpus each. Here’s the result from one: [root@vm-qexec1 ~]# ClusterName=cluster NodeName=vm-qexec1 CPUs=4 Boards=1 SocketsPerBoard=4 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=995 TmpDisk=8651 Thanks for the help! > On May 19, 2017, at 5:11 AM, Adam Huffman <adam.huff...@gmail.com> wrote: > > Hello > > While this doesn't necessarily address the strange sinfo output, to be able > to run more than one job per node you need to change this: > > SelectType=select/linear > > to > > SelectType=select/cons_res > > and then choose which parameters you want e.g. > > SelectTypeParameters=CR_Core_Memory > > See https://slurm.schedmd.com/cons_res.html > > > Cheers, > Adam > > On Fri, May 19, 2017 at 9:23 AM, Ade Fewings <ade.fewi...@hpcwales.co.uk> > wrote: > Not sure if it's going to answer or not, but a good way to check that the > Slurm daemons are seeing the CPUs would be to run 'slurmd -C' on the compute > nodes to print the hardware configuration it sees..... > > > > [root@ssc001 ~]# su -c "slurmd -C" slurm > NodeName=ssc001 CPUs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 > ThreadsPerCore=1 RealMemory=64403 TmpDisk=384659 > UpTime=7-12:12:53 > > > Perhaps that will at least help narrow down the search window. > > > ~~ > Ade > From: Jeff Avila <j...@exa.com> > Sent: 18 May 2017 21:17:29 > To: slurm-dev > Subject: [slurm-dev] discrepancy between node config and # of cpus found > > > Good Morning All, > > We have a small (one contol node, three compute node) SLURM test cluster. The > three compute nodes are KVM virtual machines, each with four processors. > We’re having trouble starting jobs using more than one task per node; in fact > running a job with more than three tasks seems to be impossible. > > What I think should be the salient part of slurm.conf (the entirety of the > file is appended to the end of this email) is shown here: > > > NodeName=vm-qexec[1-3] CPUs=4 RealMemory=1000 State=UNKNOWN > PartitionName=normal Nodes=vm-qexec[1-3] Default=YES MaxTime=INFINITE State=UP > PartitionName=batch Nodes=vm-qexec[1-3] Default=YES MaxTime=INFINITE State=UP > > "cat /proc/cpuinfo | grep processor | wc -l" yields “4” for each of the qexec > nodes, so I’m a little confused at the output I get from sinfo, although it > does seem to explain the behavior I’m seeing: > > [root@vm-qmaster ~]# sinfo -Nle > Thu May 18 15:15:01 2017 > NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT > AVAIL_FE REASON > vm-qexec1 1 batch* idle 1 1:1:1 1 0 1 > (null) none > vm-qexec1 1 normal idle 1 1:1:1 1 0 1 > (null) none > vm-qexec2 1 batch* idle 1 1:1:1 1 0 1 > (null) none > vm-qexec2 1 normal idle 1 1:1:1 1 0 1 > (null) none > vm-qexec3 1 batch* idle 1 1:1:1 1 0 1 > (null) none > vm-qexec3 1 normal idle 1 1:1:1 1 0 1 > (null) none > [root@vm-qmaster ~]# > > How, exactly, do I get SLURM to recongnise the rest of the CPUs I have on > each node? > > Thanks! > > -Jeff > > > > <full slurm.conf below> > > > > > > slurm.conf > [root@vm-qmaster slurm]# more slurm.conf > # slurm.conf file generated by configurator.html. > # Put this file on all nodes of your cluster. > # See the slurm.conf man page for more information. > # > ControlMachine=vm-qmaster.exa.com > ControlAddr=149.65.154.1 > #BackupController= > #BackupAddr= > # > AuthType=auth/munge > #CheckpointType=checkpoint/none > CryptoType=crypto/munge > #DisableRootJobs=NO > #EnforcePartLimits=NO > #Epilog= > #EpilogSlurmctld= > #FirstJobId=1 > #MaxJobId=999999 > #GresTypes= > #GroupUpdateForce=0 > #GroupUpdateTime=600 > #JobCheckpointDir=/var/slurm/checkpoint > #JobCredentialPrivateKey= > #JobCredentialPublicCertificate= > #JobFileAppend=0 > #JobRequeue=1 > #JobSubmitPlugins=1 > #KillOnBadExit=0 > #LaunchType=launch/slurm > #Licenses=foo*4,bar > #MailProg=/bin/mail > #MaxJobCount=5000 > #MaxStepCount=40000 > #MaxTasksPerNode=128 > MpiDefault=none > #MpiParams=ports=#-# > #PluginDir= > #PlugStackConfig= > #PrivateData=jobs > ProctrackType=proctrack/pgid > #Prolog= > #PrologFlags= > #PrologSlurmctld= > #PropagatePrioProcess=0 > #PropagateResourceLimits= > #PropagateResourceLimitsExcept= > #RebootProgram= > ReturnToService=1 > #SallocDefaultCommand= > SlurmctldPidFile=/var/run/slurmctld.pid > SlurmctldPort=6817 > SlurmdPidFile=/var/run/slurmd.pid > SlurmdPort=6818 > SlurmdSpoolDir=/var/spool/slurmd > SlurmUser=slurm > #SlurmdUser=root > #SrunEpilog= > #SrunProlog= > StateSaveLocation=/var/spool > SwitchType=switch/none > #TaskEpilog= > TaskPlugin=task/none > #TaskPluginParam= > #TaskProlog= > #TopologyPlugin=topology/tree > #TmpFS=/tmp > #TrackWCKey=no > #TreeWidth= > #UnkillableStepProgram= > #UsePAM=0 > # > # > # TIMERS > #BatchStartTimeout=10 > #CompleteWait=0 > #EpilogMsgTime=2000 > #GetEnvTimeout=2 > #HealthCheckInterval=0 > #HealthCheckProgram= > InactiveLimit=0 > KillWait=30 > #MessageTimeout=10 > #ResvOverRun=0 > MinJobAge=300 > #OverTimeLimit=0 > SlurmctldTimeout=120 > SlurmdTimeout=300 > #UnkillableStepTimeout=60 > #VSizeFactor=0 > Waittime=0 > # > # > # SCHEDULING > #DefMemPerCPU=0 > FastSchedule=1 > #MaxMemPerCPU=0 > #SchedulerRootFilter=1 > #SchedulerTimeSlice=30 > SchedulerType=sched/backfill > SchedulerPort=7321 > SelectType=select/linear > #SelectTypeParameters= > # > # > # JOB PRIORITY > #PriorityFlags= > #PriorityType=priority/basic > #PriorityDecayHalfLife= > #PriorityCalcPeriod= > #PriorityFavorSmall= > #PriorityMaxAge= > #PriorityUsageResetPeriod= > #PriorityWeightAge= > #PriorityWeightFairshare= > #PriorityWeightJobSize= > #PriorityWeightPartition= > #PriorityWeightQOS= > # > # > # LOGGING AND ACCOUNTING > #AccountingStorageEnforce=0 > #AccountingStorageHost= > #AccountingStorageLoc= > #AccountingStoragePass= > #AccountingStoragePort= > AccountingStorageType=accounting_storage/none > #AccountingStorageUser= > AccountingStoreJobComment=YES > ClusterName=cluster > #DebugFlags= > #JobCompHost= > #JobCompLoc= > #JobCompPass= > #JobCompPort= > JobCompType=jobcomp/none > #JobCompUser= > #JobContainerType=job_container/none > JobAcctGatherFrequency=30 > JobAcctGatherType=jobacct_gather/none > SlurmctldDebug=3 > #SlurmctldLogFile= > SlurmdDebug=3 > #SlurmdLogFile= > #SlurmSchedLogFile= > #SlurmSchedLogLevel= > # > # > # POWER SAVE SUPPORT FOR IDLE NODES (optional) > #SuspendProgram= > #ResumeProgram= > #SuspendTimeout= > #ResumeTimeout= > #ResumeRate= > #SuspendExcNodes= > #SuspendExcParts= > #SuspendRate= > #SuspendTime= > # > # > # COMPUTE NODES > NodeName=vm-qexec[1-3] CPUs=4 RealMemory=1000 State=UNKNOWN > PartitionName=normal Nodes=vm-qexec[1-3] Default=YES MaxTime=INFINITE State=UP > PartitionName=batch Nodes=vm-qexec[1-3] Default=YES MaxTime=INFINITE State=UP > > > The contents of this email and any files transmitted with it are confidential > and intended solely for the named addressee only. Unless you are the named > addressee (or authorised to receive this on their behalf) you may not copy it > or use it, or disclose it to anyone else. If you have received this email in > error, please notify the sender by email or telephone. All emails sent by > High Performance Computing Wales have been checked using an Anti-Virus > system. We would advise you to run your own virus check before opening any > attachments received as we will not in any event accept any liability > whatsoever, once an email and/or attachment is received. > > High Performance Computing Wales is a private limited company incorporated in > Wales on 8 March 2010 as company number 07181701. > > Our registered office is at Finance Office, Bangor University, Cae Derwen, > College Road, Bangor, Gwynedd. LL57 2DG. UK. > > High Performance Computing Wales is part funded by the European Regional > Development Fund through the Welsh Government. > >