There are also a couple of DebugFlags that show in great (i.e. very
verbose) detail what the backfill scheduler is doing.
Quoting Christopher Samuel <[email protected]>:
On 22/11/14 05:39, Trey Dockendorf wrote:
Currently this is one user who has the 1500 pending jobs and the reasons
in squeue is either (Resources) , (Priority) with the vast majority
being (None).
To me that sounds like the backfill scheduler is not getting to the ones
labelled "None".
This is our current SchedulerParameters:
This is what we use on our clusters and our BlueGene/Q, all of which can
have many thousands of jobs queued waiting to run - for example one of
our Intel clusters currently has over 1,400 jobs waiting and none are
labelled as "None".
SchedulerParameters=bf_window=43200,bf_resolution=600,bf_max_job_user=5,max_job_bf=10000,bf_continue,defer
Everything seems to perform well with those settings, slurmctld is at
around 8GB virtual and only ~35MB RSS for instance.
Best of luck!
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: [email protected] Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
--
Morris "Moe" Jette
CTO, SchedMD LLC