I believe it is still the case, but I haven't tested it. I put this in
way back when partition_job_depth was first introduced (which was eons
ago now). We run about 100 or so partitions, so this has served us well
as a general rule. What happens is that if you set partition job depth
too deep it may not get through all the partitions before it has to give
up and start again. This lead to partition starvation in the past where
there were jobs waiting to be scheduled in a partition that had space
but they never started because the main loop never got to them. The
backfill loop took to long to clean up thus those jobs took forever to
schedule.
With the various improvements to the scheduler this may no longer be the
case, but I haven't taken the time to test it on our cluster as our
current set up has worked well.
-Paul Edmon-
On 5/29/19 11:04 AM, Kilian Cavalotti wrote:
Hi Paul,
I'm wondering about this part in your SchedulerParameters:
### default_queue_depth should be some multiple of the
partition_job_depth,
### ideally number_of_partitions * partition_job_depth, but typically
the main
### loop exits prematurely if you go over about 400. A
partition_job_depth of
### 10 seems to work well.
Do you remember if that's still the case, or if it's in relation with
a reported issue? That sure sounds like something that would need to
be fixed if it hasn't been already.
Cheers,
--
Kilian
On Wed, May 29, 2019 at 7:42 AM Paul Edmon <ped...@cfa.harvard.edu
<mailto:ped...@cfa.harvard.edu>> wrote:
For reference we are running 18.08.7
-Paul Edmon-
On 5/29/19 10:39 AM, Paul Edmon wrote:
Sure. Here is what we have:
########################## Scheduling
#####################################
### This section is specific to scheduling
### Tells the scheduler to enforce limits for all partitions
### that a job submits to.
EnforcePartLimits=ALL
### Let's slurm know that we have a jobsubmit.lua script
JobSubmitPlugins=lua
### When a job is launched this has slurmctld send the user
information
### instead of having AD do the lookup on the node itself.
LaunchParameters=send_gids
### Maximum sizes for Jobs.
MaxJobCount=200000
MaxArraySize=10000
DefMemPerCPU=100
### Job Timers
CompleteWait=0
### We set the EpilogMsgTime long so that Epilog Messages don't
pile up all
### at one time due to forced exit which can cause problems for
the master.
EpilogMsgTime=3000000
InactiveLimit=0
KillWait=30
### This only applies to the reservation time limit, the job must
still obey
### the partition time limit.
ResvOverRun=UNLIMITED
MinJobAge=600
Waittime=0
### Scheduling parameters
### FastSchedule 2 lets slurm know not to auto detect the node config
### but rather follow our definition. We also use setting 2 as
due to our geographic
### size nodes may drop out of slurm and then reconnect. If we
had 1 they would be
### set to drain when they reconnect. Setting it to 2 allows
them to rejoin with out
### issue.
FastSchedule=2
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
### Govern's default preemption behavior
PreemptType=preempt/partition_prio
PreemptMode=REQUEUE
### default_queue_depth should be some multiple of the
partition_job_depth,
### ideally number_of_partitions * partition_job_depth, but
typically the main
### loop exits prematurely if you go over about 400. A
partition_job_depth of
### 10 seems to work well.
SchedulerParameters=\
default_queue_depth=1150,\
partition_job_depth=10,\
max_sched_time=50,\
bf_continue,\
bf_interval=30,\
bf_resolution=600,\
bf_window=11520,\
bf_max_job_part=0,\
bf_max_job_user=10,\
bf_max_job_test=10000,\
bf_max_job_start=1000,\
bf_ignore_newly_avail_nodes,\
kill_invalid_depend,\
pack_serial_at_end,\
nohold_on_prolog_fail,\
preempt_strict_order,\
preempt_youngest_first,\
max_rpc_cnt=8
################################ Fairshare
################################
### This section sets the fairshare calculations
PriorityType=priority/multifactor
### Settings for fairshare calculation frequency and shape.
FairShareDampeningFactor=1
PriorityDecayHalfLife=28-0
PriorityCalcPeriod=1
### Settings for fairshare weighting.
PriorityMaxAge=7-0
PriorityWeightAge=10000000
PriorityWeightFairshare=20000000
PriorityWeightJobSize=0
PriorityWeightPartition=0
PriorityWeightQOS=1000000000
I'm happy to chat about any of the settings if you want, or share
our full config.
-Paul Edmon-
On 5/29/19 10:17 AM, Julius, Chad wrote:
All,
We rushed our Slurm install due to a short timeframe and missed
some important items. We are now looking to implement a better
system than the first in, first out we have now. My question,
are the defaults listed in the slurm.conf file a good start?
Would anyone be willing to share their Scheduling section in
their .conf? Also we are looking to increase the maximum array
size but I don’t see that in the slurm.conf in version 17. Am I
looking at an upgrade of Slurm in the near future or can I just
add MaxArraySize=somenumber?
The defaults as of 17.11.8 are:
# SCHEDULING
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
*Chad Julius*
Cyberinfrastructure Engineer Specialist
*Division of Technology & Security*
SOHO 207, Box 2231
Brookings, SD 57007
Phone: 605-688-5767
www.sdstate.edu <http://www.sdstate.edu/>
cid:image007.png@01D24AF4.6CEECA30
--
Kilian