Hello,

I'm trying to set up SLURM-15.08.1 on a single multi-core node to
manage multi-threaded jobs. The machine has 16 physical cores
on 2 sockets with HyperThreading enabled. I'm using the EASY
scheduling algorithm with backfilling. The goal is to fully utilize all
the available cores at all times.

Given a list of three jobs with requirements of 8 cores, 2 cores,
and 4 cores, the expectation is that the jobs should be co-scheduled
to utilize 14 of the 16 available cores.  However, I can't seem to
get SLURM to work as expected. SLURM runs the latter two jobs
together but refuses to schedule the first job until they finish.
(Is this the expected behavior of the EASY-backfilling algorithm?)

Here's the list of jobs:

  $ cat job1.batch

    #!/bin/bash
    #SBATCH --sockets-per-node=1
    #SBATCH --cores-per-socket=8
    #SBATCH --threads-per-core=1
    srun /path/to/application1
  
  $ cat job2.batch
  
    #!/bin/bash
    #SBATCH --sockets-per-node=1
    #SBATCH --cores-per-socket=2
    #SBATCH --threads-per-core=1
    srun /path/to/application2
  
  $ cat job3.batch
  
    #!/bin/bash
    #SBATCH --sockets-per-node=1
    #SBATCH --cores-per-socket=4
    #SBATCH --threads-per-core=1
    srun /path/to/application3

Here's my SLURM config:

  $ cat /path/to/slurm.conf

    ControlMachine=localhost
    ControlAddr=127.0.0.1
    AuthType=auth/none
    CacheGroups=0
    CryptoType=crypto/munge
    MpiDefault=none
    ProctrackType=proctrack/linuxproc
    ReturnToService=1
    SlurmctldPidFile=/path/to/slurmctld.pid
    SlurmctldPort=6817
    SlurmdPidFile=/path/to/pids/slurmd.pid
    SlurmdPort=6818
    SlurmdSpoolDir=/path/to/slurmdspooldir
    SlurmUser=myuserid
    SlurmdUser=myuserid
    StateSaveLocation=/path/to/states
    SwitchType=switch/none
    InactiveLimit=0
    KillWait=30
    MinJobAge=300
    SlurmctldTimeout=120
    SlurmdTimeout=300
    Waittime=0
    FastSchedule=1
    SchedulerType=sched/backfill
    SchedulerPort=7321
    SelectType=select/cons_res
    SelectTypeParameters=CR_CORE
    AccountingStorageLoc=/path/to/accounting.log
    AccountingStorageType=accounting_storage/filetxt
    AccountingStoreJobComment=YES
    ClusterName=cluster
    JobCompLoc=/path/to/completion.log
    JobCompType=jobcomp/filetxt
    JobAcctGatherFrequency=30
    JobAcctGatherType=jobacct_gather/linux
    SlurmctldDebug=3
    SlurmdDebug=3
    NodeName=localhost NodeAddr=127.0.0.1 Sockets=2 CoresPerSocket=8 
ThreadsPerCore=2 State=UNKNOWN
    PartitionName=debug Nodes=localhost Default=YES MaxTime=INFINITE State=UP
    DebugFlags=Backfill,CPU_Bind,Priority,SelectType

I'm a SLURM newbie so I might be missing something obvious. I'd
appreciate any help.

Thanks,
-Rohan

Reply via email to