Hi Danny,
I downloaded slurm-14.11.6.tar.bz2
<javascript:handle_download('download/latest/slurm-14.11.6.tar.bz2');>from
http://www.schedmd.com/#repos, and built the RPMs using rpmbuild -ta
slurm-14.11.6.tar.bz2. Then installed the RPMS on the controller as
well as the compute nodes.
Weird thing is that it works perfectly in select/linear
Is there anyway to turn on more debugging features? I currently have
debug level 9.
Thanks,
David
On 05/07/2015 08:47 PM, Danny Auble wrote:
How did you install? My guess is it isn't a full install like Moe
said. I would remove the PluginDir option since it will default to
where you configured it to be. Based on you pointing to /usr/lib64 as
the location on your one node I'm surprised it didn't work.
On May 7, 2015 8:13:35 PM PDT, David Lin <[email protected]> wrote:
Hi Danny,
No that doesn't work,
starting slurmd: slurmd: error: Couldn't find the specified plugin
name for select/cons_res looking at all files
slurmd: error: cannot find select plugin for select/cons_res
slurmd: fatal: Can't find plugin for select/cons_res
David
On 05/07/2015 07:39 PM, Danny Auble wrote:
What happens if you set
PluginDir=/usr/lib64
On May 7, 2015 6:10:19 PM PDT, David Lin <[email protected]>
wrote:
Hi Moe,
I do have the Slurm plugins installed, and I do see the file
/usr/lib64/select_cons_res.so <http://res.so>
my slurm.conf also has PluginDir=/usr/lib64/slurm
I've pasted my full slurm.conf below just in case.
Thanks!
David
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=rsg-master
ControlAddr=171.64.74.213 <http://171.64.74.213>
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
PluginDir=/usr/lib64/slurm
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=2
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=0
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=9
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=9
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=rsg[4-7] State=UNKNOWN CPUs=24 Sockets=2
CoresPerSocket=6 ThreadsPerCore=2
NodeName=rsg[12-15] State=UNKNOWN CPUs=24 Sockets=2
CoresPerSocket=6 ThreadsPerCore=2
NodeName=rsg[16-31] State=UNKNOWN CPUs=32 Sockets=2
CoresPerSocket=8 ThreadsPerCore=2
On 05/07/2015 05:59 PM, Moe Jette wrote:
It looks like you didn't install the RPM with Slurm
plugins. Quoting David Lin <[email protected]>:
Hello, I am having some issues with the
select/cons_res mode of slurm. When I tried to
execute a job such as srun -N 2 -n 2 hostname, I get
this $ srun -N 2 -n 2 -q RHEL6 hostname srun: error:
slurm_receive_msg: Zero Bytes were transmitted or
received srun: error: Unable to allocate resources:
Zero Bytes were transmitted or received and on the
slurmctld log, I see this [2015-05-07T16:52:43.264]
error: we don't have select plugin type 102
[2015-05-07T16:52:43.264] error:
select_g_select_jobinfo_unpack: unpack error
[2015-05-07T16:52:43.264] error: Malformed RPC of
type REQUEST_RESOURCE_ALLOCATION(4001) received
[2015-05-07T16:52:43.264] error: slurm_receive_msg:
Header lengths are longer than data received
[2015-05-07T16:52:43.274] error: slurm_receive_msg:
Header lengths are longer than data received All of
the nodes as well as the controller running slurmctld
have the exact same slurm.conf, and I've included the
relevant section below. # SCHEDULING #DefMemPerCPU=0
FastSchedule=0 #MaxMemPerCPU=0 #SchedulerRootFilter=1
#SchedulerTimeSlice=30 SchedulerType=sched/backfill
SchedulerPort=7321 SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory Is there some
configuration I'm missing? Thank you! David