Mark,

This was easy to reproduce in SLURM version 2.3 and there is a simple two line patch to correct this problem. In case you have difficulty applying the patch to SLURM version 2.2, you can reference the change in git here:

https://github.com/SchedMD/slurm/commit/176c3ce9e7d311a36185530cb113e1203a188506



Quoting Mark Nelson <[email protected]>:

Hi there,

We're running SLURM 2.2.7 and we've found some odd behaviour of salloc
that I wanted to ask about:
If a user asks for an allocation using salloc that ends up pending
because the machine is full and they then Ctrl-C's that salloc, the
resulting job ends up in a "COMPLETED" state with a negative RunTime
because the StartTime was set to be some time in the future (and the
EndTime is the time that the salloc was Ctrl-C'ed).

This isn't seen if the salloc is cancelled externally with scancel -
in this case the JobState ends up being CANCELLED with a RunTime of 0
and StartTime is set to be equal to EndTime (the time of the scancel).

So it seems that when scancel gets the SIGINT and exits it doesn't set
the StartTime to be equal to the EndTime (or set RunTime to 0). Should
it do this?

I haven't reproduced it with SLURM 2.3 yet but I'm pretty sure we saw
this behaviour in SLURM 2.1.x as well. I also haven't started digging
into the code yet but I figured I'd ask first just in case it's an
obvious fix for someone more familiar with the code.

Here's the output of scontrol show job for the couple of cases outlined above:

# the Ctrl-C'ed case:
~> salloc -N 128 --time=20:0
salloc: Pending job allocation 17654
salloc: job 17654 queued and waiting for resources
salloc: Job aborted due to signal
salloc: Job allocation 17654 has been revoked.

~> scontrol show job 17654
JobId=17654 Name=bash
   UserId=markn(589) GroupId=ibm(502)
   Priority=2 Account=ibm QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A
   SubmitTime=2011-07-26T12:08:14 EligibleTime=2011-07-26T12:08:14
   StartTime=2011-07-26T17:11:57 EndTime=Unknown
   SuspendTime=None SecsPreSuspend=0
   Partition=main AllocNode:Sid=tambo:7313
   ReqBP_List=(null) ExcBP_List=(null)
   BP_List=(null)
   NumNodes=128-128 NumCPUs=512-512 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/vlsci/IBM/markn/gromacs/workshop/d.dppc
   Block_ID=unassigned
   Connection=Small Reboot=no Rotate=yes Geometry=0x0x0
   CnloadImage=default
   MloaderImage=default
   IoloadImage=default

~> scontrol show job 17654
JobId=17654 Name=bash
   UserId=markn(589) GroupId=ibm(502)
   Priority=2 Account=ibm QOS=normal
   JobState=COMPLETED Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
   RunTime=-05:-03:-32 TimeLimit=00:20:00 TimeMin=N/A
   SubmitTime=2011-07-26T12:08:14 EligibleTime=2011-07-26T12:08:14
   StartTime=2011-07-26T17:11:57 EndTime=2011-07-26T12:08:25
   SuspendTime=None SecsPreSuspend=0
   Partition=main AllocNode:Sid=tambo:7313
   ReqBP_List=(null) ExcBP_List=(null)
   BP_List=(null)
   NumNodes=128-128 NumCPUs=512-512 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/vlsci/IBM/markn/gromacs/workshop/d.dppc
   Block_ID=unassigned
   Connection=Small Reboot=no Rotate=yes Geometry=0x0x0
   CnloadImage=default
   MloaderImage=default
   IoloadImage=default


# the scancel case:
~> salloc -N 128 --time=20:0
salloc: Pending job allocation 17658
salloc: job 17658 queued and waiting for resources
salloc: Job allocation 17658 has been revoked.
salloc: Job has been cancelled
salloc: error: Failed to allocate resources: No error

~> scontrol show job 17658
JobId=17658 Name=bash
   UserId=markn(589) GroupId=ibm(502)
   Priority=2 Account=ibm QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A
   SubmitTime=2011-07-26T13:00:31 EligibleTime=2011-07-26T13:00:31
   StartTime=2011-07-26T17:23:47 EndTime=Unknown
   SuspendTime=None SecsPreSuspend=0
   Partition=main AllocNode:Sid=tambo:7313
   ReqBP_List=(null) ExcBP_List=(null)
   BP_List=(null)
   NumNodes=128-128 NumCPUs=512-512 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/vlsci/IBM/markn/gromacs/workshop/d.dppc
   Block_ID=unassigned
   Connection=Small Reboot=no Rotate=yes Geometry=0x0x0
   CnloadImage=default
   MloaderImage=default
   IoloadImage=default

~> scancel 17658

~> scontrol show job 17658
JobId=17658 Name=bash
   UserId=markn(589) GroupId=ibm(502)
   Priority=2 Account=ibm QOS=normal
   JobState=CANCELLED Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A
   SubmitTime=2011-07-26T13:00:31 EligibleTime=2011-07-26T13:00:31
   StartTime=2011-07-26T13:01:02 EndTime=2011-07-26T13:01:02
   SuspendTime=None SecsPreSuspend=0
   Partition=main AllocNode:Sid=tambo:7313
   ReqBP_List=(null) ExcBP_List=(null)
   BP_List=(null)
   NumNodes=128-128 NumCPUs=512-512 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/vlsci/IBM/markn/gromacs/workshop/d.dppc
   Block_ID=unassigned
   Connection=Small Reboot=no Rotate=yes Geometry=0x0x0
   CnloadImage=default
   MloaderImage=default
   IoloadImage=default


Thanks!
Mark.



diff --git a/src/slurmctld/job_mgr.c b/src/slurmctld/job_mgr.c
index 5307d03..5ecbd13 100644
--- a/src/slurmctld/job_mgr.c
+++ b/src/slurmctld/job_mgr.c
@@ -3131,6 +3131,8 @@ extern int job_complete(uint32_t job_id, uid_t uid, bool requeue,
 
 	if (IS_JOB_RUNNING(job_ptr))
 		job_comp_flag = JOB_COMPLETING;
+	else if (IS_JOB_PENDING(job_ptr))
+		job_ptr->start_time = now;
 
 	if ((job_return_code == NO_VAL) &&
 	    (IS_JOB_RUNNING(job_ptr) || IS_JOB_PENDING(job_ptr))) {

Reply via email to