After giving this some more thought, treating any job with a reservation as
having a higher priority
will result in better scheduling performance with or without the sched/backfill
plugin. The attached
patch will not change the job priorities, but will treat a job with a
reservation like it has a higher
priority than jobs without reservations. There is a patch attached which is
just a few lines of code.
It changes the sort algorithm to consider job reservations in addition to job
priority and preemptability.
This will be included in SLURM v2.2.4.
________________________________________
From: [email protected] [[email protected]] On Behalf
Of [email protected] [[email protected]]
Sent: Wednesday, March 16, 2011 8:25 AM
To: [email protected]
Subject: [slurm-dev] Jobs requesting a reservation prevented from running
Here is a very simple scenario on SLURM 2.2.3 that prevents a job requesting
use of a reservation from running.
1. A reservation is created for user 'A' for some number of nodes.
2. Prior to the start time of the reservation, user 'B' requests a job that
needs one or more of the nodes in the reservation, but the time limit on his
job would run into the reservation time so his job is PENDING for reason
'Resources'.
3. Still prior to the start time of the reservation, user A requests a job
using the reservation. Because the start time has not arrived, his job is also
PENDING, with reason 'Reservation'.
4. When the start time of the reservation arrives, user A's job now goes to
reason 'Resources'. But because user A's job is in the queue behind user B's
job, it can't start.
5. When the reservation reaches its end time, the reservation expires. Now
B's job can run, so it does. When B's job finishes, A's job is the next in
the queue, but it can't run because the reservation has expired. And the
reservation can't be cleaned up because there is a job attached. So you end
up with a job that won't start and an expired reservation that won't go away.
This only occurs on a job that is queued up on the reservation before the
StartTime. A job that is submitted by A after the reservation StartTime
bypasses the waiting B's job on the same partition and runs. However, if a
second job by A is requested, and gets set to PENDING with reason 'Resources'
because it has to wait for the first 'A' job, then it also ends up waiting
behind user B's job.
I am not sure what is supposed to happen here. The "Resource Reservation
Guide" at "https://computing.llnl.gov/linux/slurm/reservations.html" does not
state what occurs when a user that is not on the reservation tries to run a job
that uses some of the resources in the reservation. Should user B's job have
been rejected instead of going to the PENDING state? Should the priority of
user A's jobs been increased to allow them to go to the head of the queue? Or
am I just misunderstanding something about how the reservations work?
-Don Albert-
Index: src/slurmctld/job_scheduler.c
===================================================================
--- src/slurmctld/job_scheduler.c (revision 22791)
+++ src/slurmctld/job_scheduler.c (working copy)
@@ -605,12 +605,20 @@
{
job_queue_rec_t *job_rec1 = (job_queue_rec_t *) x;
job_queue_rec_t *job_rec2 = (job_queue_rec_t *) y;
+ bool has_resv1, has_resv2;
if (slurm_job_preempt_check(job_rec1, job_rec2))
return -1;
if (slurm_job_preempt_check(job_rec2, job_rec1))
return 1;
+ has_resv1 = (job_rec1->job_ptr->resv_id != 0);
+ has_resv2 = (job_rec2->job_ptr->resv_id != 0);
+ if (has_resv1 && !has_resv2)
+ return -1;
+ if (!has_resv1 && has_resv2)
+ return 1;
+
if (job_rec1->job_ptr->priority < job_rec2->job_ptr->priority)
return 1;
if (job_rec1->job_ptr->priority > job_rec2->job_ptr->priority)