I find this in version 2.6.9. I am not sure if it still exists in the later versions. The patch is against version 16.05.
The root cause of this bug is that sometimes time limit of a job is INFINITE and in such case backfill will fail to start a the job, since in backfilling some drained nodes are not excluded when testing whether the job is runnable.
From c4b58603301f3c885499ba2663fa6d09755fa881 Mon Sep 17 00:00:00 2001 From: Hongjia Cao <[email protected]> Date: Fri, 16 Oct 2015 12:53:51 +0800 Subject: [PATCH] fix bug in sched/backfill --- src/plugins/sched/backfill/backfill.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/plugins/sched/backfill/backfill.c b/src/plugins/sched/backfill/backfill.c index 4aeabc2..23ee1f4 100644 --- a/src/plugins/sched/backfill/backfill.c +++ b/src/plugins/sched/backfill/backfill.c @@ -1069,7 +1069,7 @@ next_task: part_time_limit = YEAR_MINUTES; else part_time_limit = part_ptr->max_time; - if (job_ptr->time_limit == NO_VAL) { + if (job_ptr->time_limit == NO_VAL || job_ptr->time_limit == INFINITE) { time_limit = part_time_limit; job_ptr->limit_set.time = 1; } else { @@ -1163,6 +1163,8 @@ next_task: end_time = (time_limit * 60) + start_res; else end_time = (time_limit * 60) + now; + if (end_time < now) + end_time = INFINITE; resv_end = find_resv_end(start_res); /* Identify usable nodes for this job */ bit_and(avail_bitmap, part_ptr->node_bitmap); -- 2.6.1
