"Jette, Moe" <[email protected]> writes:

> I still haven't been able see any significant delays in backfill
> scheduling. I have attached a patch which might help you. If you do
> give it a try, please let me know what the results are.

We tried it, but it did not give any speedup for the problematic jobs.

After much code-reading, log-forensics (and a bit of statistics :-), we
found out that the slow backfill tests happen for jobs that ask for
features or resources that only very few nodes have.  In our case,
asking for the feature hugemem (5 nodes out of 680), asking for a
specific rack or a single node.

For jobs like that, typically very many jobs have to be removed with
_rm_job_from_res() before nodes that the job can use become available.
We also discovered that not only the resulting many calls to
cr_job_test() took time, but also the calls to _rm_job_from_res() could
add up to several seconds for one backfill test.

We've thus created a patch that we are now using on our production
cluster.  It reduces the backfill time from about 6 seconds to typically
less than 1 second for these jobs.  We've run it for about 30 hours now,
and virtually all jobs are tested in at most 1 second (see attached graph)

<<attachment: backfilltimes.png>>

The patch does two things:

1) In _attempt_backfill() in backfill.c, if a job is waiting due to an
association resource limit, it is skipped.  (The idea is that the
association has spent its resources, so it should not compete with other
jobs for starting, and it will not start anyway.  We can sometimes have
up to 2000-3000 jobs in this state, so this removes a lot of unneeded
work from the backfiller.)

2) In _will_run_test() in select_cons_res.c, if a running/suspended job
has no bitmap overlap with the avail bitmap of the job we are testing,
we skip it without calling _rm_job_from_res() or cr_job_test().  (The
idea is that it will not help to remove it because it does not occupy
resources our job can use.)

There are also a couple of debug messages, that you might wish to remove
or change.

Here is the patch:

diff -B -c -r slurm-2.2.1/src/plugins/sched/backfill/backfill.c slurm-2.2.1.patched/src/plugins/sched/backfill/backfill.c
*** slurm-2.2.1/src/plugins/sched/backfill/backfill.c	Fri Jan 21 21:30:17 2011
--- slurm-2.2.1.patched/src/plugins/sched/backfill/backfill.c	Wed Feb 23 15:45:45 2011
***************
*** 511,516 ****
--- 511,528 ----
  		if (debug_flags & DEBUG_FLAG_BACKFILL)
  			info("backfill test for job %u", job_ptr->job_id);
  
+ 		if ((job_ptr->state_reason == WAIT_ASSOC_JOB_LIMIT) ||
+ 		    (job_ptr->state_reason == WAIT_ASSOC_RESOURCE_LIMIT) ||
+ 		    (job_ptr->state_reason == WAIT_ASSOC_TIME_LIMIT)) {
+ 			debug2("backfill: job %u is not allowed to run now. "
+ 			      "Skipping it. State=%s. Reason=%s. Priority=%u.",
+ 			      job_ptr->job_id,
+ 			      job_state_string(job_ptr->job_state),
+ 			      job_reason_string(job_ptr->state_reason),
+ 			      job_ptr->priority);
+ 			continue;
+ 		}
+ 
  		if (((part_ptr->state_up & PARTITION_SCHED) == 0) ||
  		    (part_ptr->node_bitmap == NULL))
  		 	continue;
***************
*** 628,635 ****
--- 640,651 ----
  			}
  		}
  		/* this is the time consuming operation */
+ 		debug2("backfill: entering _try_sched for job %u.",
+ 		       job_ptr->job_id);
  		j = _try_sched(job_ptr, &avail_bitmap,
  			       min_nodes, max_nodes, req_nodes);
+ 		debug2("backfill: finished _try_sched for job %u.",
+ 		       job_ptr->job_id);
  		now = time(NULL);
  		if (j != SLURM_SUCCESS) {
  			job_ptr->time_limit = orig_time_limit;
diff -B -c -r slurm-2.2.1/src/plugins/select/cons_res/select_cons_res.c slurm-2.2.1.patched/src/plugins/select/cons_res/select_cons_res.c
*** slurm-2.2.1/src/plugins/select/cons_res/select_cons_res.c	Tue Jan 25 19:42:20 2011
--- slurm-2.2.1.patched/src/plugins/select/cons_res/select_cons_res.c	Wed Feb 23 15:48:36 2011
***************
*** 1474,1482 ****
  		if (job_iterator == NULL)
  			fatal ("memory allocation failure");
  		while ((tmp_job_ptr = list_next(job_iterator))) {
  			_rm_job_from_res(future_part, future_usage,
  					 tmp_job_ptr, 0);
- 			bit_or(bitmap, orig_map);
  			rc = cr_job_test(job_ptr, bitmap, min_nodes,
  					 max_nodes, req_nodes,
  					 SELECT_MODE_WILL_RUN, cr_type,
--- 1474,1488 ----
  		if (job_iterator == NULL)
  			fatal ("memory allocation failure");
  		while ((tmp_job_ptr = list_next(job_iterator))) {
+ 		        int ovrlap;
+ 			bit_or(bitmap, orig_map);
+ 			ovrlap = bit_overlap(bitmap, tmp_job_ptr->node_bitmap);
+ 			if (ovrlap == 0)
+ 				continue;
+ 			debug2("cons_res: _will_run_test, job %u: overlap=%d",
+ 				tmp_job_ptr->job_id, ovrlap);
  			_rm_job_from_res(future_part, future_usage,
  					 tmp_job_ptr, 0);
  			rc = cr_job_test(job_ptr, bitmap, min_nodes,
  					 max_nodes, req_nodes,
  					 SELECT_MODE_WILL_RUN, cr_type,
I hope you will find it usable.  It seems to work fine for us.

(We'd also like to know if we are doing something stupid here. :-)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Reply via email to