Hi,

Running simulator with a long trace shows a bug in the backfilling code.
Although I'm using a 2.2.6 version it seems it remains in 2.3.

Line number 570 at plugins/sched/backfill/backfill.c checks for a job
being from a qos with NoReserve flags on, but qos_ptr variable is
updated just at the end of the loop so when used this is pointing to a
wrong job. I do not add any code for solving this in the patch attached.

So, line 571 modifies time_limit for a job to 1 minute. I can not
understand why this is done since it can lead to a job from a NoReserve
qos overtaking a more priority job. Maybe there's a reason for this but
I can not see it.

This modification needs to be changed back to avoid a job runnig with
the wrong time_limit value, but this is not done in all the places.

Patch attached solves this problem.


WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer.htm
--- backfill.c.orig	2011-09-29 17:35:54.000000000 +0200
+++ backfill.c	2011-09-29 17:44:24.000000000 +0200
@@ -634,15 +634,21 @@
 		bit_not(resv_bitmap);
 
 		if ((time(NULL) - sched_start) >= this_sched_timeout) {
+			int save_time_limit;
+
 			debug("backfill: loop taking too long, yielding locks");
+			save_time_limit = job_ptr->time_limit;
+			job_ptr->time_limit = orig_time_limit;
 			if (_yield_locks()) {
 				debug("backfill: system state changed, "
 				      "breaking out");
 				rc = 1;
+				job_ptr->time_limit = orig_time_limit;
 				break;
 			} else {
 				this_sched_timeout += sched_timeout;
 			}
+		   	job_ptr->time_limit = orig_time_limit;
 		}
 		/* this is the time consuming operation */
 		debug2("backfill: entering _try_sched for job %u.",
@@ -664,8 +670,10 @@
 		}
 		if (job_ptr->start_time <= now) {
 			int rc = _start_job(job_ptr, resv_bitmap);
-			if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE))
+			if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
 				job_ptr->time_limit = orig_time_limit;
+				job_ptr->end_time = job_ptr->start_time + (orig_time_limit * 60);
+			}
 			else if ((rc == SLURM_SUCCESS) && job_ptr->time_min) {
 				/* Set time limit as high as possible */
 				job_ptr->time_limit = comp_time_limit;

Reply via email to