The job_req_node_filter(job_ptr, *avail_bitmap) call should take almost
no time to execute. I still haven't been able see any significant delays in
backfill scheduling. I have attached a patch which might help you. If you
do give it a try, please let me know what the results are. The patch is
well documented and you could experiment with the parameters. It can
be used to decrease the time consumed by backfill scheduling by
decreasing its resolution.
Index: src/plugins/select/cons_res/select_cons_res.c
===================================================================
--- src/plugins/select/cons_res/select_cons_res.c (revision 22492)
+++ src/plugins/select/cons_res/select_cons_res.c (working copy)
@@ -109,6 +109,23 @@
#define NODEINFO_MAGIC 0x82aa
+/* The following parameters can be used to speed up the logic used to determine
+ * when and where a pending job will be able to start, but it will reduce
+ * the accuracy with which backfill scheduling is performed. It may improve
+ * performance of sched/backfill considerably if there are many running jobs.
+ * The original logic would simulate removing running jobs one at a time from
+ * an emulated system and test if the pending job could start on the resources
+ * then available. The new logic will remove jobs individually until
+ * SCHED_SKIP_START jobs have been removed then remove SCHED_SKIP_COUNT jobs
+ * at each iteration until the job can be scheduled. */
+#ifndef SCHED_SKIP_START
+#define SCHED_SKIP_START 50
+#endif
+
+#ifndef SCHED_SKIP_COUNT
+#define SCHED_SKIP_COUNT 10
+#endif
+
/* These are defined here so when we link with something other than
* the slurmctld we will have these symbols defined. They will get
* overwritten when linking with the slurmctld.
@@ -1467,8 +1484,12 @@
}
/* Remove the running jobs one at a time from exp_node_cr and try
- * scheduling the pending job after each one */
+ * scheduling the pending job after each one. For larger job counts,
+ * remove multiple jobs between tests to reduce overhead. */
if (rc != SLURM_SUCCESS) {
+ int jobs_rm_last_test = 0;
+ int jobs_rm_total = 0;
+ int jobs_run_total = list_count(cr_job_list);
list_sort(cr_job_list, _cr_job_list_sort);
job_iterator = list_iterator_create(cr_job_list);
if (job_iterator == NULL)
@@ -1476,6 +1497,13 @@
while ((tmp_job_ptr = list_next(job_iterator))) {
_rm_job_from_res(future_part, future_usage,
tmp_job_ptr, 0);
+ jobs_rm_total++;
+ jobs_rm_last_test++;
+ if ((jobs_rm_total > SCHED_SKIP_START) &&
+ (jobs_rm_total < jobs_run_total) &&
+ (jobs_rm_last_test < SCHED_SKIP_COUNT))
+ continue;
+ jobs_rm_last_test = 0;
bit_or(bitmap, orig_map);
rc = cr_job_test(job_ptr, bitmap, min_nodes,
max_nodes, req_nodes,
________________________________________
From: [email protected] [[email protected]] On Behalf
Of Bjørn-Helge Mevik [[email protected]]
Sent: Wednesday, February 16, 2011 6:22 AM
To: [email protected]
Subject: Re: [slurm-dev] Slow backfill testing of some jobs.
"Jette, Moe" <[email protected]> writes:
> My tests of this show try_sched() completing in in a few milliseconds
> and I don't see how the existence of a constraint would measurably
> impact performance.
The only thing I can see (from the code, but I don't know whether it makes
any difference in practice), is that if the job has constraints with
counts, _try_sched() runs job_req_node_filter(job_ptr, *avail_bitmap) to
reduce the nodes to test, but not if the job only has constraints
without counts.
I'm sorry for the lack of details.
> What version of SLURM are you using?
2.2.1
> What is your configuration?
Rough overview: ~ 650 nodes, ~ 4400 cpus, priority/multifactor, sched/backfill,
select/cons_res with CR_CPU_Memory, PreemptMode requeue, preempt/qos.
Details: see attached slurm.conf
> Do you have job preemption configured and if so, how?
Yes, preemption by requeueing, PreemptType=preempt/qos. Basically, all
QoS'es can preempt jobs in the lowpri QoS.
> How many active and queued jobs are there?
At the time, about 1000 running jobs, and about 1000 queued jobs.
The problem is most likely related to the load of the cluster, so it is
hard to investigate this on our test cluster. Is there some
debug/logging output that would help us figure out what happens?
Index: src/plugins/select/cons_res/select_cons_res.c
===================================================================
--- src/plugins/select/cons_res/select_cons_res.c (revision 22492)
+++ src/plugins/select/cons_res/select_cons_res.c (working copy)
@@ -109,6 +109,23 @@
#define NODEINFO_MAGIC 0x82aa
+/* The following parameters can be used to speed up the logic used to determine
+ * when and where a pending job will be able to start, but it will reduce
+ * the accuracy with which backfill scheduling is performed. It may improve
+ * performance of sched/backfill considerably if there are many running jobs.
+ * The original logic would simulate removing running jobs one at a time from
+ * an emulated system and test if the pending job could start on the resources
+ * then available. The new logic will remove jobs individually until
+ * SCHED_SKIP_START jobs have been removed then remove SCHED_SKIP_COUNT jobs
+ * at each iteration until the job can be scheduled. */
+#ifndef SCHED_SKIP_START
+#define SCHED_SKIP_START 50
+#endif
+
+#ifndef SCHED_SKIP_COUNT
+#define SCHED_SKIP_COUNT 10
+#endif
+
/* These are defined here so when we link with something other than
* the slurmctld we will have these symbols defined. They will get
* overwritten when linking with the slurmctld.
@@ -1467,8 +1484,12 @@
}
/* Remove the running jobs one at a time from exp_node_cr and try
- * scheduling the pending job after each one */
+ * scheduling the pending job after each one. For larger job counts,
+ * remove multiple jobs between tests to reduce overhead. */
if (rc != SLURM_SUCCESS) {
+ int jobs_rm_last_test = 0;
+ int jobs_rm_total = 0;
+ int jobs_run_total = list_count(cr_job_list);
list_sort(cr_job_list, _cr_job_list_sort);
job_iterator = list_iterator_create(cr_job_list);
if (job_iterator == NULL)
@@ -1476,6 +1497,13 @@
while ((tmp_job_ptr = list_next(job_iterator))) {
_rm_job_from_res(future_part, future_usage,
tmp_job_ptr, 0);
+ jobs_rm_total++;
+ jobs_rm_last_test++;
+ if ((jobs_rm_total > SCHED_SKIP_START) &&
+ (jobs_rm_total < jobs_run_total) &&
+ (jobs_rm_last_test < SCHED_SKIP_COUNT))
+ continue;
+ jobs_rm_last_test = 0;
bit_or(bitmap, orig_map);
rc = cr_job_test(job_ptr, bitmap, min_nodes,
max_nodes, req_nodes,