Thanks, Lyn. That's very interesting. I'm glad that this is reproducible. I was driving myself crazy over here.
Thanks again, Bill. -- Bill Barth, Ph.D., Director, HPC [email protected] | Phone: (512) 232-7069 Office: ROC 1.435 | Fax: (512) 475-9445 On 7/17/15, 2:24 PM, "Lyn Gerner" <[email protected]> wrote: >Hi Bill, > > >I was also able to reproduce this behavior on 14.11.5. The conflicting >reservation time rolled around, and with -vvv on ctld, no errors are >logged. However, a job submitted to the one-time res doesn't (didn't) >start. The job submitted to the DAILY (test1) > res shows a start time that defers to the one-time (test2) res: > > ># scontrol show res > >ReservationName=test1 StartTime=2015-07-17T17:00:03 >EndTime=2015-07-17T19:00:03 Duration=02:00:00 > Nodes=node001 NodeCnt=1 CoreCnt=1 Features=(null) >PartitionName=compute Flags=DAILY > Users=root Accounts=(null) Licenses=(null) State=ACTIVE > > >ReservationName=test2 StartTime=2015-07-17T17:00:00 >EndTime=2015-07-17T19:00:00 Duration=02:00:00 > Nodes=node001 NodeCnt=1 CoreCnt=1 Features=(null) >PartitionName=compute Flags= > Users=root Accounts=(null) Licenses=(null) State=ACTIVE > > > ># scontrol show jobs > >JobId=1502 JobName=sbatch > UserId=root(0) GroupId=root(0) > Priority=1 Nice=0 Account=root QOS=normal WCKey=* > JobState=PENDING Reason=Reservation Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A > SubmitTime=2015-07-17T17:35:40 EligibleTime=2015-07-17T17:35:40 > StartTime=Unknown EndTime=Unknown > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > Partition=compute AllocNode:Sid=master:5384 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=(null) > NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 > Features=(null) Gres=(null) Reservation=test2 > Shared=OK Contiguous=0 Licenses=(null) Network=(null) > Command=(null) > WorkDir=/var/log/slurm > StdErr=/var/log/slurm/slurm-1502.out > StdIn=/dev/null > StdOut=/var/log/slurm/slurm-1502.out > > >JobId=1503 JobName=sbatch > UserId=root(0) GroupId=root(0) > Priority=1 Nice=0 Account=root QOS=normal WCKey=* > JobState=PENDING Reason=Reservation Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A > SubmitTime=2015-07-17T17:35:48 EligibleTime=2015-07-17T17:35:48 > StartTime=2015-07-17T19:00:01 EndTime=Unknown > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > Partition=compute AllocNode:Sid=master:5384 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=(null) SchedNodeList=node001 > NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 > Features=(null) Gres=(null) Reservation=test1 > Shared=OK Contiguous=0 Licenses=(null) Network=(null) > Command=(null) > WorkDir=/var/log/slurm > StdErr=/var/log/slurm/slurm-1503.out > StdIn=/dev/null > StdOut=/var/log/slurm/slurm-1503.out > > ># sq > > JOBID PRIOR QOS PARTITION > NAME USER ST TIME TIME_LIMIT START_TIME >NODE CPUS FEAT NODELIST(REASON) > 1503 1 normal compute > sbatch root PD 0:00 1:00 2015-07-17T19:00:01 > 1 1 (nul (Reservation) > 1502 1 normal compute > sbatch root PD 0:00 1:00 N/A > 1 1 (nul (Reservation) > > >Interestingly, the job that requested the one-time (test2) res ran after >its reservation had ended, and the job requesting the daily (test1) res >did not run (however, we have ResvOverRun=0, > and were at the end of the res): > > ># sacct -j 1502 > JobID QOS Partition User > JobName NNod Allo MaxVMSize State Start > Elapsed > ExitCod NodeList >---------- -------- ---------- -------- >----------------------------------------------------- ---- ---- >---------- --------- -------------------- -------- > ------- ---------- >1502 normal compute root > sbatch 1 1 COMPLETED 2015-07-17T19:01:27 >00:00:30 > 0:0 node001 >1502.batch > batch 1 1 207016K COMPLETED 2015-07-17T19:01:27 >00:00:30 > 0:0 node001 > > > > ># sq > JOBID PRIOR QOS PARTITION > NAME USER ST TIME TIME_LIMIT START_TIME >NODE CPUS > FEAT NODELIST(REASON) > 1503 1 normal compute > sbatch root PD 0:00 1:00 N/A > 1 > 1 (nul (Reservation) > > >Best, >Lyn > > > > > > > > >On Thu, Jul 16, 2015 at 9:15 AM, Jared David Baker ><[email protected]> wrote: > >Hello Bill, > >I upgraded to 14.11.8 on Monday and have followed your test specs below. >I did get the same result that it selects the same node on our system. > >-- >$ scontrol show res >ReservationName=test1 StartTime=07.16-10:01:34 EndTime=07.16-12:01:34 >Duration=02:00:00 > Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran >Flags=DAILY > Users=jbaker2 Accounts=(null) Licenses=(null) State=ACTIVE > >ReservationName=test2 StartTime=07.17-10:05:00 EndTime=07.17-12:05:00 >Duration=02:00:00 > Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran >Flags= > Users=jbaker2 Accounts=(null) Licenses=(null) State=INACTIVE >-- > >I think that I would assume this as a bug since the OVERLAP flag was not >applied during the creation of the reservations. Perhaps this is a >'feature' that a reservation with the DAILY flag applied automatically >can consume a single reservation instance, although > I wouldn't agree with that ideology. I even tried to reverse your steps >such that I create a reservation for the future, then create a daily >reservation in hopes it would select different nodes, but it did not. > >-- >$ scontrol show res >ReservationName=testA StartTime=07.17-10:30:00 EndTime=07.17-11:30:00 >Duration=01:00:00 > Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran >Flags= > Users=jbaker2 Accounts=(null) Licenses=(null) State=INACTIVE > >ReservationName=testB StartTime=07.16-10:31:58 EndTime=07.16-11:31:58 >Duration=01:00:00 > Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran >Flags=DAILY > Users=jbaker2 Accounts=(null) Licenses=(null) State=ACTIVE >-- > >Does this help your investigation? > >-Jared > > > >-----Original Message----- >From: Bill Barth [mailto:[email protected]] >Sent: Thursday, July 16, 2015 9:54 AM >To: slurm-dev >Subject: [slurm-dev] DAILY reservations end up overlapped with other >reservations on the same partition > > >This is my third post on the subject, but I'd like to see if anyone on the >list who is running 14.11.3 or later can reproduce. We upgraded to 14.11.7 >on Tuesday, but the problem hasn't gone away. Below are the simple steps >to reproduce: > >1. Create a reservation for right now on a free node in a partition > > scontrol create reservation=test1 StartTime=now Duration=2:00:00 >Partition=osu NodeCnt=1 flags=DAILY Users=bbarth > >2. Create a reservation for 24 hours later on one node in the same >partition: > > scontrol create reservation=test2 StartTime=2015-07-16T13:30:00 >Duration=2:00:00 Partition=osu NodeCnt=1 Users=bbarth > >These reservations select the same node in my experience, because test2 >does not appear to take into account the fact that the node could be used >by test1 the next day. Come today, the reservations now overlap. FWIW, >this is exactly what scontrol show res showed during right after creation >yesterday, except that test1 was listed as ACTIVE. > >ReservationName=test1 StartTime=2015-07-16T13:18:39 >EndTime=2015-07-16T15:18:39 Duration=02:00:00 > Nodes=c445-001 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=osu >Flags=DAILY > Users=bbarth Accounts=(null) Licenses=(null) State=INACTIVE > >ReservationName=test2 StartTime=2015-07-16T13:30:00 >EndTime=2015-07-16T15:30:00 Duration=02:00:00 > Nodes=c445-001 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=osu >Flags= > Users=bbarth Accounts=(null) Licenses=(null) State=INACTIVE > >I would be much obliged if someone out there would test this on their >recent SLURM system to see if they can reproduce the problem. I intend to >test this on our SLURM test set of VMs with a fresh 14.11.7 installation >as well, but I wanted to get this message out immediately first while >we're setting up that test. > > >Has anyone else seen this? > > >Thanks, >Bill. > > >-- >Bill Barth, Ph.D., Director, HPC >[email protected] | Phone: >(512) 232-7069 <tel:%28512%29%20232-7069> >Office: ROC 1.435 | Fax: (512) 475-9445 ><tel:%28512%29%20475-9445> > > > > > > > >
