[slurm-dev] Re: Issues with cons_res

Danny Auble Thu, 07 May 2015 19:38:44 -0700

What happens if you set 

PluginDir=/usr/lib64




On May 7, 2015 6:10:19 PM PDT, David Lin <[email protected]> wrote:
>
>Hi Moe,
>I do have the Slurm plugins installed, and I do see the file 
>/usr/lib64/select_cons_res.so
>my slurm.conf also has PluginDir=/usr/lib64/slurm
>I've pasted my full slurm.conf below just in case.
>
>Thanks!
>David
>
>
># slurm.conf file generated by configurator.html.
># Put this file on all nodes of your cluster.
># See the slurm.conf man page for more information.
>#
>ControlMachine=rsg-master
>ControlAddr=171.64.74.213
>#BackupController=
>#BackupAddr=
>#
>AuthType=auth/munge
>CacheGroups=0
>#CheckpointType=checkpoint/none
>CryptoType=crypto/munge
>#DisableRootJobs=NO
>#EnforcePartLimits=NO
>#Epilog=
>#EpilogSlurmctld=
>#FirstJobId=1
>#MaxJobId=999999
>#GresTypes=
>#GroupUpdateForce=0
>#GroupUpdateTime=600
>#JobCheckpointDir=/var/slurm/checkpoint
>#JobCredentialPrivateKey=
>#JobCredentialPublicCertificate=
>#JobFileAppend=0
>#JobRequeue=1
>#JobSubmitPlugins=1
>#KillOnBadExit=0
>#LaunchType=launch/slurm
>#Licenses=foo*4,bar
>#MailProg=/bin/mail
>#MaxJobCount=5000
>#MaxStepCount=40000
>#MaxTasksPerNode=128
>MpiDefault=none
>#MpiParams=ports=#-#
>PluginDir=/usr/lib64/slurm
>#PlugStackConfig=
>#PrivateData=jobs
>ProctrackType=proctrack/pgid
>#Prolog=
>#PrologFlags=
>#PrologSlurmctld=
>#PropagatePrioProcess=0
>#PropagateResourceLimits=
>#PropagateResourceLimitsExcept=
>#RebootProgram=
>ReturnToService=2
>#SallocDefaultCommand=
>SlurmctldPidFile=/var/run/slurmctld.pid
>SlurmctldPort=6817
>SlurmdPidFile=/var/run/slurmd.pid
>SlurmdPort=6818
>SlurmdSpoolDir=/var/spool/slurmd
>SlurmUser=slurm
>#SlurmdUser=root
>#SrunEpilog=
>#SrunProlog=
>StateSaveLocation=/var/spool
>SwitchType=switch/none
>#TaskEpilog=
>TaskPlugin=task/none
>#TaskPluginParam=
>#TaskProlog=
>#TopologyPlugin=topology/tree
>#TmpFS=/tmp
>#TrackWCKey=no
>#TreeWidth=
>#UnkillableStepProgram=
>#UsePAM=0
>#
>#
># TIMERS
>#BatchStartTimeout=10
>#CompleteWait=0
>#EpilogMsgTime=2000
>#GetEnvTimeout=2
>#HealthCheckInterval=0
>#HealthCheckProgram=
>InactiveLimit=0
>KillWait=30
>#MessageTimeout=10
>#ResvOverRun=0
>MinJobAge=300
>#OverTimeLimit=0
>SlurmctldTimeout=120
>SlurmdTimeout=300
>#UnkillableStepTimeout=60
>#VSizeFactor=0
>Waittime=0
>#
>#
># SCHEDULING
>#DefMemPerCPU=0
>FastSchedule=0
>#MaxMemPerCPU=0
>#SchedulerRootFilter=1
>#SchedulerTimeSlice=30
>SchedulerType=sched/backfill
>SchedulerPort=7321
>SelectType=select/cons_res
>SelectTypeParameters=CR_Core_Memory
>#
>#
># JOB PRIORITY
>#PriorityFlags=
>#PriorityType=priority/basic
>#PriorityDecayHalfLife=
>#PriorityCalcPeriod=
>#PriorityFavorSmall=
>#PriorityMaxAge=
>#PriorityUsageResetPeriod=
>#PriorityWeightAge=
>#PriorityWeightFairshare=
>#PriorityWeightJobSize=
>#PriorityWeightPartition=
>#PriorityWeightQOS=
>#
>#
># LOGGING AND ACCOUNTING
>#AccountingStorageEnforce=0
>#AccountingStorageHost=
>#AccountingStorageLoc=
>#AccountingStoragePass=
>#AccountingStoragePort=
>AccountingStorageType=accounting_storage/none
>#AccountingStorageUser=
>AccountingStoreJobComment=YES
>ClusterName=cluster
>#DebugFlags=
>#JobCompHost=
>#JobCompLoc=
>#JobCompPass=
>#JobCompPort=
>JobCompType=jobcomp/none
>#JobCompUser=
>#JobContainerType=job_container/none
>JobAcctGatherFrequency=30
>JobAcctGatherType=jobacct_gather/none
>SlurmctldDebug=9
>SlurmctldLogFile=/var/log/slurmctld.log
>SlurmdDebug=9
>SlurmdLogFile=/var/log/slurmd.log
>#SlurmSchedLogFile=
>#SlurmSchedLogLevel=
>#
>#
># POWER SAVE SUPPORT FOR IDLE NODES (optional)
>#SuspendProgram=
>#ResumeProgram=
>#SuspendTimeout=
>#ResumeTimeout=
>#ResumeRate=
>#SuspendExcNodes=
>#SuspendExcParts=
>#SuspendRate=
>#SuspendTime=
>#
>#
># COMPUTE NODES
>NodeName=rsg[4-7]                        State=UNKNOWN CPUs=24
>Sockets=2 
>CoresPerSocket=6 ThreadsPerCore=2
>NodeName=rsg[12-15]                      State=UNKNOWN CPUs=24
>Sockets=2 
>CoresPerSocket=6 ThreadsPerCore=2
>NodeName=rsg[16-31]                      State=UNKNOWN CPUs=32
>Sockets=2 
>CoresPerSocket=8 ThreadsPerCore=2
>
>
>
>
>
>On 05/07/2015 05:59 PM, Moe Jette wrote:
>>
>> It looks like you didn't install the RPM with Slurm plugins.
>>
>> Quoting David Lin <[email protected]>:
>>> Hello,
>>>
>>> I am having some issues with the select/cons_res mode of slurm. When
>I
>>> tried to execute a job such as srun -N 2 -n 2 hostname, I get this
>>>
>>> $ srun -N 2 -n 2 -q RHEL6 hostname
>>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or
>received
>>> srun: error: Unable to allocate resources: Zero Bytes were
>transmitted
>>> or received
>>>
>>> and on the slurmctld log, I see this
>>>
>>> [2015-05-07T16:52:43.264] error: we don't have select plugin type
>102
>>> [2015-05-07T16:52:43.264] error: select_g_select_jobinfo_unpack:
>unpack
>>> error
>>> [2015-05-07T16:52:43.264] error: Malformed RPC of type
>>> REQUEST_RESOURCE_ALLOCATION(4001) received
>>> [2015-05-07T16:52:43.264] error: slurm_receive_msg: Header lengths
>are
>>> longer than data received
>>> [2015-05-07T16:52:43.274] error: slurm_receive_msg: Header lengths
>are
>>> longer than data received
>>>
>>> All of the nodes as well as the controller running slurmctld have
>the
>>> exact same slurm.conf, and I've included the relevant section below.
>>>
>>> # SCHEDULING
>>> #DefMemPerCPU=0
>>> FastSchedule=0
>>> #MaxMemPerCPU=0
>>> #SchedulerRootFilter=1
>>> #SchedulerTimeSlice=30
>>> SchedulerType=sched/backfill
>>> SchedulerPort=7321
>>> SelectType=select/cons_res
>>> SelectTypeParameters=CR_Core_Memory
>>>
>>> Is there some configuration I'm missing?
>>>
>>> Thank you!
>>> David
>>
>>

[slurm-dev] Re: Issues with cons_res

Reply via email to