I'm having what I think is some odd reservation behavior, and I'm
wondering if SLURM is working as designed, if this is a bug, or something
else. This case happens with SLURM 2.6.3. Here's a simplified version of
the scenario:
* On a full system with two partitions (a and b) that overlap in their
node definitions, create a reservation:
scontrol create reservationname=test users=bbarth starttime=now
duration=5:00:00 nodecnt=500 partition=a flags=IGNORE_JOBS
* Wait and watch
sinfo -T | grep test | awk "{print \$6}" | xargs sinfo -p a -h -n
* If the system was full-ish when you started, there will only be a
handful of nodes in the reservation which are idle, and most of them will
be allocated to other jobs. As time goes on, jobs will finish on the
system, but their nodes will not be swapped into the reservation. Instead,
you must wait until the nodes originally selected for the reservation at
creation become free.
This is easiest to test by creating running a job first on, say, 50 nodes,
then creating the reservation, and then killing the job. Once the nodes
from the job finish completing, they are not picked up by the reservation
unless they happen to have been reserved by it already.
The dual, overlapping queue nature of this system might be a red herring,
but I thought I'd include it in the description in case that rings a bell
for someone. We use the overlapping partition definitions for another
purpose which might be suboptimal, but it works for us for now.
Any thoughts on the matter would be helpful.
Thanks,
Bill.
--
Bill Barth, Ph.D., Director, HPC
[email protected] | Phone: (512) 232-7069
Office: ROC 1.435 | Fax: (512) 475-9445