Hi Bill,
I was also able to reproduce this behavior on 14.11.5. The conflicting
reservation time rolled around, and with -vvv on ctld, no errors are
logged. However, a job submitted to the one-time res doesn't (didn't)
start. The job submitted to the DAILY (test1) res shows a start time that
defers to the one-time (test2) res:
# scontrol show res
ReservationName=test1 StartTime=2015-07-17T17:00:03
EndTime=2015-07-17T19:00:03 Duration=02:00:00
Nodes=node001 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=compute
Flags=DAILY
Users=root Accounts=(null) Licenses=(null) State=ACTIVE
ReservationName=test2 StartTime=2015-07-17T17:00:00
EndTime=2015-07-17T19:00:00 Duration=02:00:00
Nodes=node001 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=compute
Flags=
Users=root Accounts=(null) Licenses=(null) State=ACTIVE
# scontrol show jobs
JobId=1502 JobName=sbatch
UserId=root(0) GroupId=root(0)
Priority=1 Nice=0 Account=root QOS=normal WCKey=*
JobState=PENDING Reason=Reservation Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
SubmitTime=2015-07-17T17:35:40 EligibleTime=2015-07-17T17:35:40
StartTime=Unknown EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=compute AllocNode:Sid=master:5384
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=test2
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/var/log/slurm
StdErr=/var/log/slurm/slurm-1502.out
StdIn=/dev/null
StdOut=/var/log/slurm/slurm-1502.out
JobId=1503 JobName=sbatch
UserId=root(0) GroupId=root(0)
Priority=1 Nice=0 Account=root QOS=normal WCKey=*
JobState=PENDING Reason=Reservation Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
SubmitTime=2015-07-17T17:35:48 EligibleTime=2015-07-17T17:35:48
StartTime=2015-07-17T19:00:01 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=compute AllocNode:Sid=master:5384
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=node001
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=test1
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/var/log/slurm
StdErr=/var/log/slurm/slurm-1503.out
StdIn=/dev/null
StdOut=/var/log/slurm/slurm-1503.out
# sq
JOBID PRIOR QOS PARTITION
NAME USER ST TIME TIME_LIMIT START_TIME NODE
CPUS FEAT NODELIST(REASON)
1503 1 normal compute
sbatch root PD 0:00 1:00 2015-07-17T19:00:01 1
1 (nul (Reservation)
1502 1 normal compute
sbatch root PD 0:00 1:00 N/A 1
1 (nul (Reservation)
Interestingly, the job that requested the one-time (test2) res ran after
its reservation had ended, and the job requesting the daily (test1) res did
not run (however, we have ResvOverRun=0, and were at the end of the res):
# sacct -j 1502
JobID QOS Partition User
JobName NNod Allo MaxVMSize State Start
Elapsed ExitCod NodeList
---------- -------- ---------- --------
----------------------------------------------------- ---- ---- ----------
--------- -------------------- -------- ------- ----------
1502 normal compute root
sbatch 1 1 COMPLETED 2015-07-17T19:01:27
00:00:30 0:0 node001
1502.batch
batch 1 1 207016K COMPLETED 2015-07-17T19:01:27
00:00:30 0:0 node001
# sq
JOBID PRIOR QOS PARTITION
NAME USER ST TIME TIME_LIMIT START_TIME NODE
CPUS FEAT NODELIST(REASON)
1503 1 normal compute
sbatch root PD 0:00 1:00 N/A 1
1 (nul (Reservation)
Best,
Lyn
On Thu, Jul 16, 2015 at 9:15 AM, Jared David Baker <[email protected]>
wrote:
> Hello Bill,
>
> I upgraded to 14.11.8 on Monday and have followed your test specs below. I
> did get the same result that it selects the same node on our system.
>
> --
> $ scontrol show res
> ReservationName=test1 StartTime=07.16-10:01:34 EndTime=07.16-12:01:34
> Duration=02:00:00
> Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran
> Flags=DAILY
> Users=jbaker2 Accounts=(null) Licenses=(null) State=ACTIVE
>
> ReservationName=test2 StartTime=07.17-10:05:00 EndTime=07.17-12:05:00
> Duration=02:00:00
> Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran
> Flags=
> Users=jbaker2 Accounts=(null) Licenses=(null) State=INACTIVE
> --
>
> I think that I would assume this as a bug since the OVERLAP flag was not
> applied during the creation of the reservations. Perhaps this is a
> 'feature' that a reservation with the DAILY flag applied automatically can
> consume a single reservation instance, although I wouldn't agree with that
> ideology. I even tried to reverse your steps such that I create a
> reservation for the future, then create a daily reservation in hopes it
> would select different nodes, but it did not.
>
> --
> $ scontrol show res
> ReservationName=testA StartTime=07.17-10:30:00 EndTime=07.17-11:30:00
> Duration=01:00:00
> Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran
> Flags=
> Users=jbaker2 Accounts=(null) Licenses=(null) State=INACTIVE
>
> ReservationName=testB StartTime=07.16-10:31:58 EndTime=07.16-11:31:58
> Duration=01:00:00
> Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran
> Flags=DAILY
> Users=jbaker2 Accounts=(null) Licenses=(null) State=ACTIVE
> --
>
> Does this help your investigation?
>
> -Jared
>
>
>
> -----Original Message-----
> From: Bill Barth [mailto:[email protected]]
> Sent: Thursday, July 16, 2015 9:54 AM
> To: slurm-dev
> Subject: [slurm-dev] DAILY reservations end up overlapped with other
> reservations on the same partition
>
>
> This is my third post on the subject, but I'd like to see if anyone on the
> list who is running 14.11.3 or later can reproduce. We upgraded to 14.11.7
> on Tuesday, but the problem hasn't gone away. Below are the simple steps
> to reproduce:
>
> 1. Create a reservation for right now on a free node in a partition
>
> scontrol create reservation=test1 StartTime=now Duration=2:00:00
> Partition=osu NodeCnt=1 flags=DAILY Users=bbarth
>
> 2. Create a reservation for 24 hours later on one node in the same
> partition:
>
> scontrol create reservation=test2 StartTime=2015-07-16T13:30:00
> Duration=2:00:00 Partition=osu NodeCnt=1 Users=bbarth
>
> These reservations select the same node in my experience, because test2
> does not appear to take into account the fact that the node could be used
> by test1 the next day. Come today, the reservations now overlap. FWIW,
> this is exactly what scontrol show res showed during right after creation
> yesterday, except that test1 was listed as ACTIVE.
>
> ReservationName=test1 StartTime=2015-07-16T13:18:39
> EndTime=2015-07-16T15:18:39 Duration=02:00:00
> Nodes=c445-001 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=osu
> Flags=DAILY
> Users=bbarth Accounts=(null) Licenses=(null) State=INACTIVE
>
> ReservationName=test2 StartTime=2015-07-16T13:30:00
> EndTime=2015-07-16T15:30:00 Duration=02:00:00
> Nodes=c445-001 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=osu
> Flags=
> Users=bbarth Accounts=(null) Licenses=(null) State=INACTIVE
>
> I would be much obliged if someone out there would test this on their
> recent SLURM system to see if they can reproduce the problem. I intend to
> test this on our SLURM test set of VMs with a fresh 14.11.7 installation
> as well, but I wanted to get this message out immediately first while
> we're setting up that test.
>
>
> Has anyone else seen this?
>
>
> Thanks,
> Bill.
>
>
> --
> Bill Barth, Ph.D., Director, HPC
> [email protected] | Phone: (512) 232-7069
> Office: ROC 1.435 | Fax: (512) 475-9445
>
>
>
>