What happens if you set PluginDir=/usr/lib64
On May 7, 2015 6:10:19 PM PDT, David Lin <[email protected]> wrote: > >Hi Moe, >I do have the Slurm plugins installed, and I do see the file >/usr/lib64/select_cons_res.so >my slurm.conf also has PluginDir=/usr/lib64/slurm >I've pasted my full slurm.conf below just in case. > >Thanks! >David > > ># slurm.conf file generated by configurator.html. ># Put this file on all nodes of your cluster. ># See the slurm.conf man page for more information. ># >ControlMachine=rsg-master >ControlAddr=171.64.74.213 >#BackupController= >#BackupAddr= ># >AuthType=auth/munge >CacheGroups=0 >#CheckpointType=checkpoint/none >CryptoType=crypto/munge >#DisableRootJobs=NO >#EnforcePartLimits=NO >#Epilog= >#EpilogSlurmctld= >#FirstJobId=1 >#MaxJobId=999999 >#GresTypes= >#GroupUpdateForce=0 >#GroupUpdateTime=600 >#JobCheckpointDir=/var/slurm/checkpoint >#JobCredentialPrivateKey= >#JobCredentialPublicCertificate= >#JobFileAppend=0 >#JobRequeue=1 >#JobSubmitPlugins=1 >#KillOnBadExit=0 >#LaunchType=launch/slurm >#Licenses=foo*4,bar >#MailProg=/bin/mail >#MaxJobCount=5000 >#MaxStepCount=40000 >#MaxTasksPerNode=128 >MpiDefault=none >#MpiParams=ports=#-# >PluginDir=/usr/lib64/slurm >#PlugStackConfig= >#PrivateData=jobs >ProctrackType=proctrack/pgid >#Prolog= >#PrologFlags= >#PrologSlurmctld= >#PropagatePrioProcess=0 >#PropagateResourceLimits= >#PropagateResourceLimitsExcept= >#RebootProgram= >ReturnToService=2 >#SallocDefaultCommand= >SlurmctldPidFile=/var/run/slurmctld.pid >SlurmctldPort=6817 >SlurmdPidFile=/var/run/slurmd.pid >SlurmdPort=6818 >SlurmdSpoolDir=/var/spool/slurmd >SlurmUser=slurm >#SlurmdUser=root >#SrunEpilog= >#SrunProlog= >StateSaveLocation=/var/spool >SwitchType=switch/none >#TaskEpilog= >TaskPlugin=task/none >#TaskPluginParam= >#TaskProlog= >#TopologyPlugin=topology/tree >#TmpFS=/tmp >#TrackWCKey=no >#TreeWidth= >#UnkillableStepProgram= >#UsePAM=0 ># ># ># TIMERS >#BatchStartTimeout=10 >#CompleteWait=0 >#EpilogMsgTime=2000 >#GetEnvTimeout=2 >#HealthCheckInterval=0 >#HealthCheckProgram= >InactiveLimit=0 >KillWait=30 >#MessageTimeout=10 >#ResvOverRun=0 >MinJobAge=300 >#OverTimeLimit=0 >SlurmctldTimeout=120 >SlurmdTimeout=300 >#UnkillableStepTimeout=60 >#VSizeFactor=0 >Waittime=0 ># ># ># SCHEDULING >#DefMemPerCPU=0 >FastSchedule=0 >#MaxMemPerCPU=0 >#SchedulerRootFilter=1 >#SchedulerTimeSlice=30 >SchedulerType=sched/backfill >SchedulerPort=7321 >SelectType=select/cons_res >SelectTypeParameters=CR_Core_Memory ># ># ># JOB PRIORITY >#PriorityFlags= >#PriorityType=priority/basic >#PriorityDecayHalfLife= >#PriorityCalcPeriod= >#PriorityFavorSmall= >#PriorityMaxAge= >#PriorityUsageResetPeriod= >#PriorityWeightAge= >#PriorityWeightFairshare= >#PriorityWeightJobSize= >#PriorityWeightPartition= >#PriorityWeightQOS= ># ># ># LOGGING AND ACCOUNTING >#AccountingStorageEnforce=0 >#AccountingStorageHost= >#AccountingStorageLoc= >#AccountingStoragePass= >#AccountingStoragePort= >AccountingStorageType=accounting_storage/none >#AccountingStorageUser= >AccountingStoreJobComment=YES >ClusterName=cluster >#DebugFlags= >#JobCompHost= >#JobCompLoc= >#JobCompPass= >#JobCompPort= >JobCompType=jobcomp/none >#JobCompUser= >#JobContainerType=job_container/none >JobAcctGatherFrequency=30 >JobAcctGatherType=jobacct_gather/none >SlurmctldDebug=9 >SlurmctldLogFile=/var/log/slurmctld.log >SlurmdDebug=9 >SlurmdLogFile=/var/log/slurmd.log >#SlurmSchedLogFile= >#SlurmSchedLogLevel= ># ># ># POWER SAVE SUPPORT FOR IDLE NODES (optional) >#SuspendProgram= >#ResumeProgram= >#SuspendTimeout= >#ResumeTimeout= >#ResumeRate= >#SuspendExcNodes= >#SuspendExcParts= >#SuspendRate= >#SuspendTime= ># ># ># COMPUTE NODES >NodeName=rsg[4-7] State=UNKNOWN CPUs=24 >Sockets=2 >CoresPerSocket=6 ThreadsPerCore=2 >NodeName=rsg[12-15] State=UNKNOWN CPUs=24 >Sockets=2 >CoresPerSocket=6 ThreadsPerCore=2 >NodeName=rsg[16-31] State=UNKNOWN CPUs=32 >Sockets=2 >CoresPerSocket=8 ThreadsPerCore=2 > > > > > >On 05/07/2015 05:59 PM, Moe Jette wrote: >> >> It looks like you didn't install the RPM with Slurm plugins. >> >> Quoting David Lin <[email protected]>: >>> Hello, >>> >>> I am having some issues with the select/cons_res mode of slurm. When >I >>> tried to execute a job such as srun -N 2 -n 2 hostname, I get this >>> >>> $ srun -N 2 -n 2 -q RHEL6 hostname >>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or >received >>> srun: error: Unable to allocate resources: Zero Bytes were >transmitted >>> or received >>> >>> and on the slurmctld log, I see this >>> >>> [2015-05-07T16:52:43.264] error: we don't have select plugin type >102 >>> [2015-05-07T16:52:43.264] error: select_g_select_jobinfo_unpack: >unpack >>> error >>> [2015-05-07T16:52:43.264] error: Malformed RPC of type >>> REQUEST_RESOURCE_ALLOCATION(4001) received >>> [2015-05-07T16:52:43.264] error: slurm_receive_msg: Header lengths >are >>> longer than data received >>> [2015-05-07T16:52:43.274] error: slurm_receive_msg: Header lengths >are >>> longer than data received >>> >>> All of the nodes as well as the controller running slurmctld have >the >>> exact same slurm.conf, and I've included the relevant section below. >>> >>> # SCHEDULING >>> #DefMemPerCPU=0 >>> FastSchedule=0 >>> #MaxMemPerCPU=0 >>> #SchedulerRootFilter=1 >>> #SchedulerTimeSlice=30 >>> SchedulerType=sched/backfill >>> SchedulerPort=7321 >>> SelectType=select/cons_res >>> SelectTypeParameters=CR_Core_Memory >>> >>> Is there some configuration I'm missing? >>> >>> Thank you! >>> David >> >>
