Hi Bill,

I was also able to reproduce this behavior on 14.11.5.  The conflicting
reservation time rolled around, and with -vvv on ctld, no errors are
logged.  However, a job submitted to the one-time res doesn't (didn't)
start.  The job submitted to the DAILY (test1) res shows a start time that
defers to the one-time (test2) res:


# scontrol show res

ReservationName=test1 StartTime=2015-07-17T17:00:03
EndTime=2015-07-17T19:00:03 Duration=02:00:00

   Nodes=node001 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=compute
Flags=DAILY

   Users=root Accounts=(null) Licenses=(null) State=ACTIVE


ReservationName=test2 StartTime=2015-07-17T17:00:00
EndTime=2015-07-17T19:00:00 Duration=02:00:00

   Nodes=node001 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=compute
Flags=

   Users=root Accounts=(null) Licenses=(null) State=ACTIVE


# scontrol show jobs

JobId=1502 JobName=sbatch

   UserId=root(0) GroupId=root(0)

   Priority=1 Nice=0 Account=root QOS=normal WCKey=*

   JobState=PENDING Reason=Reservation Dependency=(null)

   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

   RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A

   SubmitTime=2015-07-17T17:35:40 EligibleTime=2015-07-17T17:35:40

   StartTime=Unknown EndTime=Unknown

   PreemptTime=None SuspendTime=None SecsPreSuspend=0

   Partition=compute AllocNode:Sid=master:5384

   ReqNodeList=(null) ExcNodeList=(null)

   NodeList=(null)

   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0

   Features=(null) Gres=(null) Reservation=test2

   Shared=OK Contiguous=0 Licenses=(null) Network=(null)

   Command=(null)

   WorkDir=/var/log/slurm

   StdErr=/var/log/slurm/slurm-1502.out

   StdIn=/dev/null

   StdOut=/var/log/slurm/slurm-1502.out


JobId=1503 JobName=sbatch

   UserId=root(0) GroupId=root(0)

   Priority=1 Nice=0 Account=root QOS=normal WCKey=*

   JobState=PENDING Reason=Reservation Dependency=(null)

   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

   RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A

   SubmitTime=2015-07-17T17:35:48 EligibleTime=2015-07-17T17:35:48

   StartTime=2015-07-17T19:00:01 EndTime=Unknown

   PreemptTime=None SuspendTime=None SecsPreSuspend=0

   Partition=compute AllocNode:Sid=master:5384

   ReqNodeList=(null) ExcNodeList=(null)

   NodeList=(null) SchedNodeList=node001

   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0

   Features=(null) Gres=(null) Reservation=test1

   Shared=OK Contiguous=0 Licenses=(null) Network=(null)

   Command=(null)

   WorkDir=/var/log/slurm

   StdErr=/var/log/slurm/slurm-1503.out

   StdIn=/dev/null

   StdOut=/var/log/slurm/slurm-1503.out


# sq

   JOBID PRIOR      QOS  PARTITION
          NAME     USER ST       TIME TIME_LIMIT           START_TIME NODE
CPUS FEAT NODELIST(REASON)

    1503     1   normal    compute
        sbatch     root PD       0:00       1:00  2015-07-17T19:00:01    1
  1 (nul (Reservation)

    1502     1   normal    compute
        sbatch     root PD       0:00       1:00                  N/A    1
  1 (nul (Reservation)


Interestingly, the job that requested the one-time (test2) res ran after
its reservation had ended, and the job requesting the daily (test1) res did
not run (however, we have ResvOverRun=0, and were at the end of the res):


# sacct -j 1502

     JobID      QOS  Partition     User
            JobName NNod Allo  MaxVMSize     State                Start
Elapsed ExitCod   NodeList

---------- -------- ---------- --------
----------------------------------------------------- ---- ---- ----------
--------- -------------------- -------- ------- ----------

1502         normal    compute     root
              sbatch    1    1            COMPLETED  2015-07-17T19:01:27
00:00:30     0:0    node001

1502.batch
              batch    1    1    207016K COMPLETED  2015-07-17T19:01:27
00:00:30     0:0    node001



# sq

   JOBID PRIOR      QOS  PARTITION
          NAME     USER ST       TIME TIME_LIMIT           START_TIME NODE
CPUS FEAT NODELIST(REASON)

    1503     1   normal    compute
        sbatch     root PD       0:00       1:00                  N/A    1
  1 (nul (Reservation)


Best,

Lyn


On Thu, Jul 16, 2015 at 9:15 AM, Jared David Baker <[email protected]>
wrote:

> Hello Bill,
>
> I upgraded to 14.11.8 on Monday and have followed your test specs below. I
> did get the same result that it selects the same node on our system.
>
> --
> $ scontrol show res
> ReservationName=test1 StartTime=07.16-10:01:34 EndTime=07.16-12:01:34
> Duration=02:00:00
>    Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran
> Flags=DAILY
>    Users=jbaker2 Accounts=(null) Licenses=(null) State=ACTIVE
>
> ReservationName=test2 StartTime=07.17-10:05:00 EndTime=07.17-12:05:00
> Duration=02:00:00
>    Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran
> Flags=
>    Users=jbaker2 Accounts=(null) Licenses=(null) State=INACTIVE
> --
>
> I think that I would assume this as a bug since the OVERLAP flag was not
> applied during the creation of the reservations. Perhaps this is a
> 'feature' that a reservation with the DAILY flag applied automatically can
> consume a single reservation instance, although I wouldn't agree with that
> ideology. I even tried to reverse your steps such that I create a
> reservation for the future, then create a daily reservation in hopes it
> would select different nodes, but it did not.
>
> --
> $  scontrol show res
> ReservationName=testA StartTime=07.17-10:30:00 EndTime=07.17-11:30:00
> Duration=01:00:00
>    Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran
> Flags=
>    Users=jbaker2 Accounts=(null) Licenses=(null) State=INACTIVE
>
> ReservationName=testB StartTime=07.16-10:31:58 EndTime=07.16-11:31:58
> Duration=01:00:00
>    Nodes=mmm218 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=moran
> Flags=DAILY
>    Users=jbaker2 Accounts=(null) Licenses=(null) State=ACTIVE
> --
>
> Does this help your investigation?
>
> -Jared
>
>
>
> -----Original Message-----
> From: Bill Barth [mailto:[email protected]]
> Sent: Thursday, July 16, 2015 9:54 AM
> To: slurm-dev
> Subject: [slurm-dev] DAILY reservations end up overlapped with other
> reservations on the same partition
>
>
> This is my third post on the subject, but I'd like to see if anyone on the
> list who is running 14.11.3 or later can reproduce. We upgraded to 14.11.7
> on Tuesday, but the problem hasn't gone away. Below are the simple steps
> to reproduce:
>
> 1. Create a reservation for right now on a free node in a partition
>
>    scontrol create reservation=test1 StartTime=now Duration=2:00:00
> Partition=osu NodeCnt=1 flags=DAILY Users=bbarth
>
> 2. Create a reservation for 24 hours later on one node in the same
> partition:
>
>    scontrol create reservation=test2 StartTime=2015-07-16T13:30:00
> Duration=2:00:00 Partition=osu NodeCnt=1 Users=bbarth
>
> These reservations select the same node in my experience, because test2
> does not appear to take into account the fact that the node could be used
> by test1 the next day. Come today, the reservations now overlap. FWIW,
> this is exactly what scontrol show res showed during right after creation
> yesterday, except that test1 was listed as ACTIVE.
>
> ReservationName=test1 StartTime=2015-07-16T13:18:39
> EndTime=2015-07-16T15:18:39 Duration=02:00:00
>    Nodes=c445-001 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=osu
> Flags=DAILY
>    Users=bbarth Accounts=(null) Licenses=(null) State=INACTIVE
>
> ReservationName=test2 StartTime=2015-07-16T13:30:00
> EndTime=2015-07-16T15:30:00 Duration=02:00:00
>    Nodes=c445-001 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=osu
> Flags=
>    Users=bbarth Accounts=(null) Licenses=(null) State=INACTIVE
>
> I would be much obliged if someone out there would test this on their
> recent SLURM system to see if they can reproduce the problem. I intend to
> test this on our SLURM test set of VMs with a fresh 14.11.7 installation
> as well, but I wanted to get this message out immediately first while
> we're setting up that test.
>
>
> Has anyone else seen this?
>
>
> Thanks,
> Bill.
>
>
> --
> Bill Barth, Ph.D., Director, HPC
> [email protected]        |   Phone: (512) 232-7069
> Office: ROC 1.435             |   Fax:   (512) 475-9445
>
>
>
>

Reply via email to