Thanks, Lyn. That's very interesting. I'm glad that this is reproducible.
I was driving myself crazy over here.

Thanks again,
Bill.
-- 
Bill Barth, Ph.D., Director, HPC
[email protected]        |   Phone: (512) 232-7069
Office: ROC 1.435             |   Fax:   (512) 475-9445







On 7/17/15, 2:24 PM, "Lyn Gerner" <[email protected]> wrote:

>Hi Bill,
>
>
>I was also able to reproduce this behavior on 14.11.5.  The conflicting
>reservation time rolled around, and with -vvv on ctld, no errors are
>logged.  However, a job submitted to the one-time res doesn't (didn't)
>start.  The job submitted to the DAILY (test1)
> res shows a start time that defers to the one-time (test2) res:
>
>
># scontrol show res
>
>ReservationName=test1 StartTime=2015-07-17T17:00:03
>EndTime=2015-07-17T19:00:03 Duration=02:00:00
>   Nodes=node001 NodeCnt=1 CoreCnt=1 Features=(null)
>PartitionName=compute Flags=DAILY
>   Users=root Accounts=(null) Licenses=(null) State=ACTIVE
>
>
>ReservationName=test2 StartTime=2015-07-17T17:00:00
>EndTime=2015-07-17T19:00:00 Duration=02:00:00
>   Nodes=node001 NodeCnt=1 CoreCnt=1 Features=(null)
>PartitionName=compute Flags=
>   Users=root Accounts=(null) Licenses=(null) State=ACTIVE
>
>
>
># scontrol show jobs
>
>JobId=1502 JobName=sbatch
>   UserId=root(0) GroupId=root(0)
>   Priority=1 Nice=0 Account=root QOS=normal WCKey=*
>   JobState=PENDING Reason=Reservation Dependency=(null)
>   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>   RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
>   SubmitTime=2015-07-17T17:35:40 EligibleTime=2015-07-17T17:35:40
>   StartTime=Unknown EndTime=Unknown
>   PreemptTime=None SuspendTime=None SecsPreSuspend=0
>   Partition=compute AllocNode:Sid=master:5384
>   ReqNodeList=(null) ExcNodeList=(null)
>   NodeList=(null)
>   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>   Features=(null) Gres=(null) Reservation=test2
>   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
>   Command=(null)
>   WorkDir=/var/log/slurm
>   StdErr=/var/log/slurm/slurm-1502.out
>   StdIn=/dev/null
>   StdOut=/var/log/slurm/slurm-1502.out
>
>
>JobId=1503 JobName=sbatch
>   UserId=root(0) GroupId=root(0)
>   Priority=1 Nice=0 Account=root QOS=normal WCKey=*
>   JobState=PENDING Reason=Reservation Dependency=(null)
>   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>   RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
>   SubmitTime=2015-07-17T17:35:48 EligibleTime=2015-07-17T17:35:48
>   StartTime=2015-07-17T19:00:01 EndTime=Unknown
>   PreemptTime=None SuspendTime=None SecsPreSuspend=0
>   Partition=compute AllocNode:Sid=master:5384
>   ReqNodeList=(null) ExcNodeList=(null)
>   NodeList=(null) SchedNodeList=node001
>   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>   Features=(null) Gres=(null) Reservation=test1
>   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
>   Command=(null)
>   WorkDir=/var/log/slurm
>   StdErr=/var/log/slurm/slurm-1503.out
>   StdIn=/dev/null
>   StdOut=/var/log/slurm/slurm-1503.out
>
>
># sq
>
>   JOBID PRIOR      QOS  PARTITION
>            NAME     USER ST       TIME TIME_LIMIT           START_TIME
>NODE CPUS FEAT NODELIST(REASON)
>    1503     1   normal    compute
>          sbatch     root PD       0:00       1:00  2015-07-17T19:00:01
> 1    1 (nul (Reservation)
>    1502     1   normal    compute
>          sbatch     root PD       0:00       1:00                  N/A
> 1    1 (nul (Reservation)
>
>
>Interestingly, the job that requested the one-time (test2) res ran after
>its reservation had ended, and the job requesting the daily (test1) res
>did not run (however, we have ResvOverRun=0,
> and were at the end of the res):
>
>
># sacct -j 1502
>     JobID      QOS  Partition     User
>              JobName NNod Allo  MaxVMSize     State                Start
> Elapsed
> ExitCod   NodeList
>---------- -------- ---------- --------
>----------------------------------------------------- ---- ----
>---------- --------- -------------------- --------
> ------- ----------
>1502         normal    compute     root
>               sbatch    1    1            COMPLETED  2015-07-17T19:01:27
>00:00:30
>     0:0    node001
>1502.batch        
>                batch    1    1    207016K COMPLETED  2015-07-17T19:01:27
>00:00:30
>     0:0    node001
>
>
>
>
># sq
>   JOBID PRIOR      QOS  PARTITION
>            NAME     USER ST       TIME TIME_LIMIT           START_TIME
>NODE CPUS
> FEAT NODELIST(REASON)
>    1503     1   normal    compute
>          sbatch     root PD       0:00       1:00                  N/A
> 1   
> 1 (nul (Reservation)
>
>
>Best,
>Lyn
>
>
>
>
>
>
>
>
>On Thu, Jul 16, 2015 at 9:15 AM, Jared David Baker
><[email protected]> wrote:
>
>Hello Bill,
>
>I upgraded to 14.11.8 on Monday and have followed your test specs below.
>I did get the same result that it selects the same node on our system.
>
>--
>$ scontrol show res
>ReservationName=test1 StartTime=07.16-10:01:34 EndTime=07.16-12:01:34
>Duration=02:00:00
>   Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran
>Flags=DAILY
>   Users=jbaker2 Accounts=(null) Licenses=(null) State=ACTIVE
>
>ReservationName=test2 StartTime=07.17-10:05:00 EndTime=07.17-12:05:00
>Duration=02:00:00
>   Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran
>Flags=
>   Users=jbaker2 Accounts=(null) Licenses=(null) State=INACTIVE
>--
>
>I think that I would assume this as a bug since the OVERLAP flag was not
>applied during the creation of the reservations. Perhaps this is a
>'feature' that a reservation with the DAILY flag applied automatically
>can consume a single reservation instance, although
> I wouldn't agree with that ideology. I even tried to reverse your steps
>such that I create a reservation for the future, then create a daily
>reservation in hopes it would select different nodes, but it did not.
>
>--
>$  scontrol show res
>ReservationName=testA StartTime=07.17-10:30:00 EndTime=07.17-11:30:00
>Duration=01:00:00
>   Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran
>Flags=
>   Users=jbaker2 Accounts=(null) Licenses=(null) State=INACTIVE
>
>ReservationName=testB StartTime=07.16-10:31:58 EndTime=07.16-11:31:58
>Duration=01:00:00
>   Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran
>Flags=DAILY
>   Users=jbaker2 Accounts=(null) Licenses=(null) State=ACTIVE
>--
>
>Does this help your investigation?
>
>-Jared
>
>
>
>-----Original Message-----
>From: Bill Barth [mailto:[email protected]]
>Sent: Thursday, July 16, 2015 9:54 AM
>To: slurm-dev
>Subject: [slurm-dev] DAILY reservations end up overlapped with other
>reservations on the same partition
>
>
>This is my third post on the subject, but I'd like to see if anyone on the
>list who is running 14.11.3 or later can reproduce. We upgraded to 14.11.7
>on Tuesday, but the problem hasn't gone away. Below are the simple steps
>to reproduce:
>
>1. Create a reservation for right now on a free node in a partition
>
>   scontrol create reservation=test1 StartTime=now Duration=2:00:00
>Partition=osu NodeCnt=1 flags=DAILY Users=bbarth
>
>2. Create a reservation for 24 hours later on one node in the same
>partition:
>
>   scontrol create reservation=test2 StartTime=2015-07-16T13:30:00
>Duration=2:00:00 Partition=osu NodeCnt=1 Users=bbarth
>
>These reservations select the same node in my experience, because test2
>does not appear to take into account the fact that the node could be used
>by test1 the next day. Come today, the reservations now overlap. FWIW,
>this is exactly what scontrol show res showed during right after creation
>yesterday, except that test1 was listed as ACTIVE.
>
>ReservationName=test1 StartTime=2015-07-16T13:18:39
>EndTime=2015-07-16T15:18:39 Duration=02:00:00
>   Nodes=c445-001 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=osu
>Flags=DAILY
>   Users=bbarth Accounts=(null) Licenses=(null) State=INACTIVE
>
>ReservationName=test2 StartTime=2015-07-16T13:30:00
>EndTime=2015-07-16T15:30:00 Duration=02:00:00
>   Nodes=c445-001 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=osu
>Flags=
>   Users=bbarth Accounts=(null) Licenses=(null) State=INACTIVE
>
>I would be much obliged if someone out there would test this on their
>recent SLURM system to see if they can reproduce the problem. I intend to
>test this on our SLURM test set of VMs with a fresh 14.11.7 installation
>as well, but I wanted to get this message out immediately first while
>we're setting up that test.
>
>
>Has anyone else seen this?
>
>
>Thanks,
>Bill.
>
>
>--
>Bill Barth, Ph.D., Director, HPC
>[email protected]        |   Phone:
>(512) 232-7069 <tel:%28512%29%20232-7069>
>Office: ROC 1.435             |   Fax:   (512) 475-9445
><tel:%28512%29%20475-9445>
>
>
>
>
>
>
>
>

Reply via email to