I believe it is still the case, but I haven't tested it.  I put this in way back when partition_job_depth was first introduced (which was eons ago now).  We run about 100 or so partitions, so this has served us well as a general rule.  What happens is that if you set partition job depth too deep it may not get through all the partitions before it has to give up and start again.  This lead to partition starvation in the past where there were jobs waiting to be scheduled in a partition that had space but they never started because the main loop never got to them.  The backfill loop took to long to clean up thus those jobs took forever to schedule.

With the various improvements to the scheduler this may no longer be the case, but I haven't taken the time to test it on our cluster as our current set up has worked well.

-Paul Edmon-

On 5/29/19 11:04 AM, Kilian Cavalotti wrote:
Hi Paul,

I'm wondering about this part in your SchedulerParameters:

### default_queue_depth should be some multiple of the partition_job_depth, ### ideally number_of_partitions * partition_job_depth, but typically the main ### loop exits prematurely if you go over about 400. A partition_job_depth of
### 10 seems to work well.

Do you remember if that's still the case, or if it's in relation with a reported issue? That sure sounds like something that would need to be fixed if it hasn't been already.

Cheers,
--
Kilian

On Wed, May 29, 2019 at 7:42 AM Paul Edmon <ped...@cfa.harvard.edu <mailto:ped...@cfa.harvard.edu>> wrote:

    For reference we are running 18.08.7

    -Paul Edmon-

    On 5/29/19 10:39 AM, Paul Edmon wrote:

    Sure.  Here is what we have:

    ########################## Scheduling
    #####################################
    ### This section is specific to scheduling

    ### Tells the scheduler to enforce limits for all partitions
    ### that a job submits to.
    EnforcePartLimits=ALL

    ### Let's slurm know that we have a jobsubmit.lua script
    JobSubmitPlugins=lua

    ### When a job is launched this has slurmctld send the user
    information
    ### instead of having AD do the lookup on the node itself.
    LaunchParameters=send_gids

    ### Maximum sizes for Jobs.
    MaxJobCount=200000
    MaxArraySize=10000
    DefMemPerCPU=100

    ### Job Timers
    CompleteWait=0

    ### We set the EpilogMsgTime long so that Epilog Messages don't
    pile up all
    ### at one time due to forced exit which can cause problems for
    the master.
    EpilogMsgTime=3000000
    InactiveLimit=0
    KillWait=30

    ### This only applies to the reservation time limit, the job must
    still obey
    ### the partition time limit.
    ResvOverRun=UNLIMITED
    MinJobAge=600
    Waittime=0

    ### Scheduling parameters
    ### FastSchedule 2 lets slurm know not to auto detect the node config
    ### but rather follow our definition.  We also use setting 2 as
    due to our geographic
    ### size nodes may drop out of slurm and then reconnect.  If we
    had 1 they would be
    ### set to drain when they reconnect.  Setting it to 2 allows
    them to rejoin with out
    ### issue.
    FastSchedule=2
    SchedulerType=sched/backfill
    SelectType=select/cons_res
    SelectTypeParameters=CR_Core_Memory

    ### Govern's default preemption behavior
    PreemptType=preempt/partition_prio
    PreemptMode=REQUEUE

    ### default_queue_depth should be some multiple of the
    partition_job_depth,
    ### ideally number_of_partitions * partition_job_depth, but
    typically the main
    ### loop exits prematurely if you go over about 400. A
    partition_job_depth of
    ### 10 seems to work well.
    SchedulerParameters=\
    default_queue_depth=1150,\
    partition_job_depth=10,\
    max_sched_time=50,\
    bf_continue,\
    bf_interval=30,\
    bf_resolution=600,\
    bf_window=11520,\
    bf_max_job_part=0,\
    bf_max_job_user=10,\
    bf_max_job_test=10000,\
    bf_max_job_start=1000,\
    bf_ignore_newly_avail_nodes,\
    kill_invalid_depend,\
    pack_serial_at_end,\
    nohold_on_prolog_fail,\
    preempt_strict_order,\
    preempt_youngest_first,\
    max_rpc_cnt=8

    ################################ Fairshare
    ################################
    ### This section sets the fairshare calculations

    PriorityType=priority/multifactor

    ### Settings for fairshare calculation frequency and shape.
    FairShareDampeningFactor=1
    PriorityDecayHalfLife=28-0
    PriorityCalcPeriod=1

    ### Settings for fairshare weighting.
    PriorityMaxAge=7-0
    PriorityWeightAge=10000000
    PriorityWeightFairshare=20000000
    PriorityWeightJobSize=0
    PriorityWeightPartition=0
    PriorityWeightQOS=1000000000

    I'm happy to chat about any of the settings if you want, or share
    our full config.

    -Paul Edmon-

    On 5/29/19 10:17 AM, Julius, Chad wrote:

    All,

    We rushed our Slurm install due to a short timeframe and missed
    some important items.  We are now looking to implement a better
    system than the first in, first out we have now.  My question,
    are the defaults listed in the slurm.conf file a good start? 
    Would anyone be willing to share their Scheduling section in
    their .conf?  Also we are looking to increase the maximum array
    size but I don’t see that in the slurm.conf in version 17.  Am I
    looking at an upgrade of Slurm in the near future or can I just
    add MaxArraySize=somenumber?

    The defaults as of 17.11.8 are:

    # SCHEDULING

    #SchedulerAuth=

    #SchedulerPort=

    #SchedulerRootFilter=

    #PriorityType=priority/multifactor

    #PriorityDecayHalfLife=14-0

    #PriorityUsageResetPeriod=14-0

    #PriorityWeightFairshare=100000

    #PriorityWeightAge=1000

    #PriorityWeightPartition=10000

    #PriorityWeightJobSize=1000

    #PriorityMaxAge=1-0

    *Chad Julius*

    Cyberinfrastructure Engineer Specialist

    *Division of Technology & Security*

    SOHO 207, Box 2231

    Brookings, SD 57007

    Phone: 605-688-5767

    www.sdstate.edu <http://www.sdstate.edu/>

    cid:image007.png@01D24AF4.6CEECA30



--
Kilian

Reply via email to