Hi Danny,
I downloaded slurm-14.11.6.tar.bz2 <javascript:handle_download('download/latest/slurm-14.11.6.tar.bz2');>from http://www.schedmd.com/#repos, and built the RPMs using rpmbuild -ta slurm-14.11.6.tar.bz2. Then installed the RPMS on the controller as well as the compute nodes.

Weird thing is that it works perfectly in select/linear

Is there anyway to turn on more debugging features? I currently have debug level 9.

Thanks,
David


On 05/07/2015 08:47 PM, Danny Auble wrote:
How did you install? My guess is it isn't a full install like Moe said. I would remove the PluginDir option since it will default to where you configured it to be. Based on you pointing to /usr/lib64 as the location on your one node I'm surprised it didn't work.

On May 7, 2015 8:13:35 PM PDT, David Lin <[email protected]> wrote:

    Hi Danny,
    No that doesn't work,

    starting slurmd: slurmd: error: Couldn't find the specified plugin
    name for select/cons_res looking at all files
    slurmd: error: cannot find select plugin for select/cons_res
    slurmd: fatal: Can't find plugin for select/cons_res

    David

    On 05/07/2015 07:39 PM, Danny Auble wrote:
    What happens if you set

    PluginDir=/usr/lib64



    On May 7, 2015 6:10:19 PM PDT, David Lin <[email protected]>
    wrote:

        Hi Moe,
        I do have the Slurm plugins installed, and I do see the file
        /usr/lib64/select_cons_res.so  <http://res.so>
        my slurm.conf also has PluginDir=/usr/lib64/slurm
        I've pasted my full slurm.conf below just in case.

        Thanks!
        David


        # slurm.conf file generated by configurator.html.
        # Put this file on all nodes of your cluster.
        # See the slurm.conf man page for more information.
        #
        ControlMachine=rsg-master
        ControlAddr=171.64.74.213  <http://171.64.74.213>
        #BackupController=
        #BackupAddr=
        #
        AuthType=auth/munge
        CacheGroups=0
        #CheckpointType=checkpoint/none
        CryptoType=crypto/munge
        #DisableRootJobs=NO
        #EnforcePartLimits=NO
        #Epilog=
        #EpilogSlurmctld=
        #FirstJobId=1
        #MaxJobId=999999
        #GresTypes=
        #GroupUpdateForce=0
        #GroupUpdateTime=600
        #JobCheckpointDir=/var/slurm/checkpoint
        #JobCredentialPrivateKey=
        #JobCredentialPublicCertificate=
        #JobFileAppend=0
        #JobRequeue=1
        #JobSubmitPlugins=1
        #KillOnBadExit=0
        #LaunchType=launch/slurm
        #Licenses=foo*4,bar
        #MailProg=/bin/mail
        #MaxJobCount=5000
        #MaxStepCount=40000
        #MaxTasksPerNode=128
        MpiDefault=none
        #MpiParams=ports=#-#
        PluginDir=/usr/lib64/slurm
        #PlugStackConfig=
        #PrivateData=jobs
        ProctrackType=proctrack/pgid
        #Prolog=
        #PrologFlags=
        #PrologSlurmctld=
        #PropagatePrioProcess=0
        #PropagateResourceLimits=
        #PropagateResourceLimitsExcept=
        #RebootProgram=
        ReturnToService=2
        #SallocDefaultCommand=
        SlurmctldPidFile=/var/run/slurmctld.pid
        SlurmctldPort=6817
        SlurmdPidFile=/var/run/slurmd.pid
        SlurmdPort=6818
        SlurmdSpoolDir=/var/spool/slurmd
        SlurmUser=slurm
        #SlurmdUser=root
        #SrunEpilog=
        #SrunProlog=
        StateSaveLocation=/var/spool
        SwitchType=switch/none
        #TaskEpilog=
        TaskPlugin=task/none
        #TaskPluginParam=
        #TaskProlog=
        #TopologyPlugin=topology/tree
        #TmpFS=/tmp
        #TrackWCKey=no
        #TreeWidth=
        #UnkillableStepProgram=
        #UsePAM=0
        #
        #
        # TIMERS
        #BatchStartTimeout=10
        #CompleteWait=0
        #EpilogMsgTime=2000
        #GetEnvTimeout=2
        #HealthCheckInterval=0
        #HealthCheckProgram=
        InactiveLimit=0
        KillWait=30
        #MessageTimeout=10
        #ResvOverRun=0
        MinJobAge=300
        #OverTimeLimit=0
        SlurmctldTimeout=120
        SlurmdTimeout=300
        #UnkillableStepTimeout=60
        #VSizeFactor=0
        Waittime=0
        #
        #
        # SCHEDULING
        #DefMemPerCPU=0
        FastSchedule=0
        #MaxMemPerCPU=0
        #SchedulerRootFilter=1
        #SchedulerTimeSlice=30
        SchedulerType=sched/backfill
        SchedulerPort=7321
        SelectType=select/cons_res
        SelectTypeParameters=CR_Core_Memory
        #
        #
        # JOB PRIORITY
        #PriorityFlags=
        #PriorityType=priority/basic
        #PriorityDecayHalfLife=
        #PriorityCalcPeriod=
        #PriorityFavorSmall=
        #PriorityMaxAge=
        #PriorityUsageResetPeriod=
        #PriorityWeightAge=
        #PriorityWeightFairshare=
        #PriorityWeightJobSize=
        #PriorityWeightPartition=
        #PriorityWeightQOS=
        #
        #
        # LOGGING AND ACCOUNTING
        #AccountingStorageEnforce=0
        #AccountingStorageHost=
        #AccountingStorageLoc=
        #AccountingStoragePass=
        #AccountingStoragePort=
        AccountingStorageType=accounting_storage/none
        #AccountingStorageUser=
        AccountingStoreJobComment=YES
        ClusterName=cluster
        #DebugFlags=
        #JobCompHost=
        #JobCompLoc=
        #JobCompPass=
        #JobCompPort=
        JobCompType=jobcomp/none
        #JobCompUser=
        #JobContainerType=job_container/none
        JobAcctGatherFrequency=30
        JobAcctGatherType=jobacct_gather/none
        SlurmctldDebug=9
        SlurmctldLogFile=/var/log/slurmctld.log
        SlurmdDebug=9
        SlurmdLogFile=/var/log/slurmd.log
        #SlurmSchedLogFile=
        #SlurmSchedLogLevel=
        #
        #
        # POWER SAVE SUPPORT FOR IDLE NODES (optional)
        #SuspendProgram=
        #ResumeProgram=
        #SuspendTimeout=
        #ResumeTimeout=
        #ResumeRate=
        #SuspendExcNodes=
        #SuspendExcParts=
        #SuspendRate=
        #SuspendTime=
        #
        #
        # COMPUTE NODES
        NodeName=rsg[4-7]                        State=UNKNOWN CPUs=24 Sockets=2
        CoresPerSocket=6 ThreadsPerCore=2
        NodeName=rsg[12-15]                      State=UNKNOWN CPUs=24 Sockets=2
        CoresPerSocket=6 ThreadsPerCore=2
        NodeName=rsg[16-31]                      State=UNKNOWN CPUs=32 Sockets=2
        CoresPerSocket=8 ThreadsPerCore=2





        On 05/07/2015 05:59 PM, Moe Jette wrote:

            It looks like you didn't install the RPM with Slurm
            plugins. Quoting David Lin <[email protected]>:

                Hello, I am having some issues with the
                select/cons_res mode of slurm. When I tried to
                execute a job such as srun -N 2 -n 2 hostname, I get
                this $ srun -N 2 -n 2 -q RHEL6 hostname srun: error:
                slurm_receive_msg: Zero Bytes were transmitted or
                received srun: error: Unable to allocate resources:
                Zero Bytes were transmitted or received and on the
                slurmctld log, I see this [2015-05-07T16:52:43.264]
                error: we don't have select plugin type 102
                [2015-05-07T16:52:43.264] error:
                select_g_select_jobinfo_unpack: unpack error
                [2015-05-07T16:52:43.264] error: Malformed RPC of
                type REQUEST_RESOURCE_ALLOCATION(4001) received
                [2015-05-07T16:52:43.264] error: slurm_receive_msg:
                Header lengths are longer than data received
                [2015-05-07T16:52:43.274] error: slurm_receive_msg:
                Header lengths are longer than data received All of
                the nodes as well as the controller running slurmctld
                have the exact same slurm.conf, and I've included the
                relevant section below. # SCHEDULING #DefMemPerCPU=0
                FastSchedule=0 #MaxMemPerCPU=0 #SchedulerRootFilter=1
                #SchedulerTimeSlice=30 SchedulerType=sched/backfill
                SchedulerPort=7321 SelectType=select/cons_res
                SelectTypeParameters=CR_Core_Memory Is there some
configuration I'm missing? Thank you! David



Reply via email to