I don't think you'll get more logging info. Configuration seems fine now. Seems 
like  build/install issue. I'd check file dates for consistency (in case any 
old files), file permissions, and the relevant bits of the slurmctld log when 
you start it up.

On May 7, 2015 8:53:16 PM PDT, David Lin <[email protected]> wrote:
>Hi Danny,
>I downloaded slurm-14.11.6.tar.bz2 
><javascript:handle_download('download/latest/slurm-14.11.6.tar.bz2');>from
>
>http://www.schedmd.com/#repos, and built the RPMs using rpmbuild -ta 
>slurm-14.11.6.tar.bz2.  Then installed the RPMS on the controller as 
>well as the compute nodes.
>
>Weird thing is that it works perfectly in select/linear
>
>Is there anyway to turn on more debugging features?  I currently have 
>debug level 9.
>
>Thanks,
>David
>
>
>On 05/07/2015 08:47 PM, Danny Auble wrote:
>> How did you install? My guess is it isn't a full install like Moe 
>> said. I would remove the PluginDir option since it will default to 
>> where you configured it to be. Based on you pointing to /usr/lib64 as
>
>> the location on your one node I'm surprised it didn't work.
>>
>> On May 7, 2015 8:13:35 PM PDT, David Lin <[email protected]>
>wrote:
>>
>>     Hi Danny,
>>     No that doesn't work,
>>
>>     starting slurmd: slurmd: error: Couldn't find the specified
>plugin
>>     name for select/cons_res looking at all files
>>     slurmd: error: cannot find select plugin for select/cons_res
>>     slurmd: fatal: Can't find plugin for select/cons_res
>>
>>     David
>>
>>     On 05/07/2015 07:39 PM, Danny Auble wrote:
>>>     What happens if you set
>>>
>>>     PluginDir=/usr/lib64
>>>
>>>
>>>
>>>     On May 7, 2015 6:10:19 PM PDT, David Lin <[email protected]>
>>>     wrote:
>>>
>>>         Hi Moe,
>>>         I do have the Slurm plugins installed, and I do see the file
>>>         /usr/lib64/select_cons_res.so  <http://res.so>
>>>         my slurm.conf also has PluginDir=/usr/lib64/slurm
>>>         I've pasted my full slurm.conf below just in case.
>>>
>>>         Thanks!
>>>         David
>>>
>>>
>>>         # slurm.conf file generated by configurator.html.
>>>         # Put this file on all nodes of your cluster.
>>>         # See the slurm.conf man page for more information.
>>>         #
>>>         ControlMachine=rsg-master
>>>         ControlAddr=171.64.74.213  <http://171.64.74.213>
>>>         #BackupController=
>>>         #BackupAddr=
>>>         #
>>>         AuthType=auth/munge
>>>         CacheGroups=0
>>>         #CheckpointType=checkpoint/none
>>>         CryptoType=crypto/munge
>>>         #DisableRootJobs=NO
>>>         #EnforcePartLimits=NO
>>>         #Epilog=
>>>         #EpilogSlurmctld=
>>>         #FirstJobId=1
>>>         #MaxJobId=999999
>>>         #GresTypes=
>>>         #GroupUpdateForce=0
>>>         #GroupUpdateTime=600
>>>         #JobCheckpointDir=/var/slurm/checkpoint
>>>         #JobCredentialPrivateKey=
>>>         #JobCredentialPublicCertificate=
>>>         #JobFileAppend=0
>>>         #JobRequeue=1
>>>         #JobSubmitPlugins=1
>>>         #KillOnBadExit=0
>>>         #LaunchType=launch/slurm
>>>         #Licenses=foo*4,bar
>>>         #MailProg=/bin/mail
>>>         #MaxJobCount=5000
>>>         #MaxStepCount=40000
>>>         #MaxTasksPerNode=128
>>>         MpiDefault=none
>>>         #MpiParams=ports=#-#
>>>         PluginDir=/usr/lib64/slurm
>>>         #PlugStackConfig=
>>>         #PrivateData=jobs
>>>         ProctrackType=proctrack/pgid
>>>         #Prolog=
>>>         #PrologFlags=
>>>         #PrologSlurmctld=
>>>         #PropagatePrioProcess=0
>>>         #PropagateResourceLimits=
>>>         #PropagateResourceLimitsExcept=
>>>         #RebootProgram=
>>>         ReturnToService=2
>>>         #SallocDefaultCommand=
>>>         SlurmctldPidFile=/var/run/slurmctld.pid
>>>         SlurmctldPort=6817
>>>         SlurmdPidFile=/var/run/slurmd.pid
>>>         SlurmdPort=6818
>>>         SlurmdSpoolDir=/var/spool/slurmd
>>>         SlurmUser=slurm
>>>         #SlurmdUser=root
>>>         #SrunEpilog=
>>>         #SrunProlog=
>>>         StateSaveLocation=/var/spool
>>>         SwitchType=switch/none
>>>         #TaskEpilog=
>>>         TaskPlugin=task/none
>>>         #TaskPluginParam=
>>>         #TaskProlog=
>>>         #TopologyPlugin=topology/tree
>>>         #TmpFS=/tmp
>>>         #TrackWCKey=no
>>>         #TreeWidth=
>>>         #UnkillableStepProgram=
>>>         #UsePAM=0
>>>         #
>>>         #
>>>         # TIMERS
>>>         #BatchStartTimeout=10
>>>         #CompleteWait=0
>>>         #EpilogMsgTime=2000
>>>         #GetEnvTimeout=2
>>>         #HealthCheckInterval=0
>>>         #HealthCheckProgram=
>>>         InactiveLimit=0
>>>         KillWait=30
>>>         #MessageTimeout=10
>>>         #ResvOverRun=0
>>>         MinJobAge=300
>>>         #OverTimeLimit=0
>>>         SlurmctldTimeout=120
>>>         SlurmdTimeout=300
>>>         #UnkillableStepTimeout=60
>>>         #VSizeFactor=0
>>>         Waittime=0
>>>         #
>>>         #
>>>         # SCHEDULING
>>>         #DefMemPerCPU=0
>>>         FastSchedule=0
>>>         #MaxMemPerCPU=0
>>>         #SchedulerRootFilter=1
>>>         #SchedulerTimeSlice=30
>>>         SchedulerType=sched/backfill
>>>         SchedulerPort=7321
>>>         SelectType=select/cons_res
>>>         SelectTypeParameters=CR_Core_Memory
>>>         #
>>>         #
>>>         # JOB PRIORITY
>>>         #PriorityFlags=
>>>         #PriorityType=priority/basic
>>>         #PriorityDecayHalfLife=
>>>         #PriorityCalcPeriod=
>>>         #PriorityFavorSmall=
>>>         #PriorityMaxAge=
>>>         #PriorityUsageResetPeriod=
>>>         #PriorityWeightAge=
>>>         #PriorityWeightFairshare=
>>>         #PriorityWeightJobSize=
>>>         #PriorityWeightPartition=
>>>         #PriorityWeightQOS=
>>>         #
>>>         #
>>>         # LOGGING AND ACCOUNTING
>>>         #AccountingStorageEnforce=0
>>>         #AccountingStorageHost=
>>>         #AccountingStorageLoc=
>>>         #AccountingStoragePass=
>>>         #AccountingStoragePort=
>>>         AccountingStorageType=accounting_storage/none
>>>         #AccountingStorageUser=
>>>         AccountingStoreJobComment=YES
>>>         ClusterName=cluster
>>>         #DebugFlags=
>>>         #JobCompHost=
>>>         #JobCompLoc=
>>>         #JobCompPass=
>>>         #JobCompPort=
>>>         JobCompType=jobcomp/none
>>>         #JobCompUser=
>>>         #JobContainerType=job_container/none
>>>         JobAcctGatherFrequency=30
>>>         JobAcctGatherType=jobacct_gather/none
>>>         SlurmctldDebug=9
>>>         SlurmctldLogFile=/var/log/slurmctld.log
>>>         SlurmdDebug=9
>>>         SlurmdLogFile=/var/log/slurmd.log
>>>         #SlurmSchedLogFile=
>>>         #SlurmSchedLogLevel=
>>>         #
>>>         #
>>>         # POWER SAVE SUPPORT FOR IDLE NODES (optional)
>>>         #SuspendProgram=
>>>         #ResumeProgram=
>>>         #SuspendTimeout=
>>>         #ResumeTimeout=
>>>         #ResumeRate=
>>>         #SuspendExcNodes=
>>>         #SuspendExcParts=
>>>         #SuspendRate=
>>>         #SuspendTime=
>>>         #
>>>         #
>>>         # COMPUTE NODES
>>>         NodeName=rsg[4-7]                        State=UNKNOWN
>CPUs=24 Sockets=2
>>>         CoresPerSocket=6 ThreadsPerCore=2
>>>         NodeName=rsg[12-15]                      State=UNKNOWN
>CPUs=24 Sockets=2
>>>         CoresPerSocket=6 ThreadsPerCore=2
>>>         NodeName=rsg[16-31]                      State=UNKNOWN
>CPUs=32 Sockets=2
>>>         CoresPerSocket=8 ThreadsPerCore=2
>>>
>>>
>>>
>>>
>>>
>>>         On 05/07/2015 05:59 PM, Moe Jette wrote:
>>>
>>>             It looks like you didn't install the RPM with Slurm
>>>             plugins. Quoting David Lin <[email protected]>:
>>>
>>>                 Hello, I am having some issues with the
>>>                 select/cons_res mode of slurm. When I tried to
>>>                 execute a job such as srun -N 2 -n 2 hostname, I get
>>>                 this $ srun -N 2 -n 2 -q RHEL6 hostname srun: error:
>>>                 slurm_receive_msg: Zero Bytes were transmitted or
>>>                 received srun: error: Unable to allocate resources:
>>>                 Zero Bytes were transmitted or received and on the
>>>                 slurmctld log, I see this [2015-05-07T16:52:43.264]
>>>                 error: we don't have select plugin type 102
>>>                 [2015-05-07T16:52:43.264] error:
>>>                 select_g_select_jobinfo_unpack: unpack error
>>>                 [2015-05-07T16:52:43.264] error: Malformed RPC of
>>>                 type REQUEST_RESOURCE_ALLOCATION(4001) received
>>>                 [2015-05-07T16:52:43.264] error: slurm_receive_msg:
>>>                 Header lengths are longer than data received
>>>                 [2015-05-07T16:52:43.274] error: slurm_receive_msg:
>>>                 Header lengths are longer than data received All of
>>>                 the nodes as well as the controller running
>slurmctld
>>>                 have the exact same slurm.conf, and I've included
>the
>>>                 relevant section below. # SCHEDULING #DefMemPerCPU=0
>>>                 FastSchedule=0 #MaxMemPerCPU=0
>#SchedulerRootFilter=1
>>>                 #SchedulerTimeSlice=30 SchedulerType=sched/backfill
>>>                 SchedulerPort=7321 SelectType=select/cons_res
>>>                 SelectTypeParameters=CR_Core_Memory Is there some
>>>                 configuration I'm missing? Thank you! David 
>>>
>>>
>>

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Reply via email to