Hi there, We're running SLURM 2.2.7 and we've found some odd behaviour of salloc that I wanted to ask about: If a user asks for an allocation using salloc that ends up pending because the machine is full and they then Ctrl-C's that salloc, the resulting job ends up in a "COMPLETED" state with a negative RunTime because the StartTime was set to be some time in the future (and the EndTime is the time that the salloc was Ctrl-C'ed).
This isn't seen if the salloc is cancelled externally with scancel - in this case the JobState ends up being CANCELLED with a RunTime of 0 and StartTime is set to be equal to EndTime (the time of the scancel). So it seems that when scancel gets the SIGINT and exits it doesn't set the StartTime to be equal to the EndTime (or set RunTime to 0). Should it do this? I haven't reproduced it with SLURM 2.3 yet but I'm pretty sure we saw this behaviour in SLURM 2.1.x as well. I also haven't started digging into the code yet but I figured I'd ask first just in case it's an obvious fix for someone more familiar with the code. Here's the output of scontrol show job for the couple of cases outlined above: # the Ctrl-C'ed case: ~> salloc -N 128 --time=20:0 salloc: Pending job allocation 17654 salloc: job 17654 queued and waiting for resources salloc: Job aborted due to signal salloc: Job allocation 17654 has been revoked. ~> scontrol show job 17654 JobId=17654 Name=bash UserId=markn(589) GroupId=ibm(502) Priority=2 Account=ibm QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A SubmitTime=2011-07-26T12:08:14 EligibleTime=2011-07-26T12:08:14 StartTime=2011-07-26T17:11:57 EndTime=Unknown SuspendTime=None SecsPreSuspend=0 Partition=main AllocNode:Sid=tambo:7313 ReqBP_List=(null) ExcBP_List=(null) BP_List=(null) NumNodes=128-128 NumCPUs=512-512 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/vlsci/IBM/markn/gromacs/workshop/d.dppc Block_ID=unassigned Connection=Small Reboot=no Rotate=yes Geometry=0x0x0 CnloadImage=default MloaderImage=default IoloadImage=default ~> scontrol show job 17654 JobId=17654 Name=bash UserId=markn(589) GroupId=ibm(502) Priority=2 Account=ibm QOS=normal JobState=COMPLETED Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0 RunTime=-05:-03:-32 TimeLimit=00:20:00 TimeMin=N/A SubmitTime=2011-07-26T12:08:14 EligibleTime=2011-07-26T12:08:14 StartTime=2011-07-26T17:11:57 EndTime=2011-07-26T12:08:25 SuspendTime=None SecsPreSuspend=0 Partition=main AllocNode:Sid=tambo:7313 ReqBP_List=(null) ExcBP_List=(null) BP_List=(null) NumNodes=128-128 NumCPUs=512-512 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/vlsci/IBM/markn/gromacs/workshop/d.dppc Block_ID=unassigned Connection=Small Reboot=no Rotate=yes Geometry=0x0x0 CnloadImage=default MloaderImage=default IoloadImage=default # the scancel case: ~> salloc -N 128 --time=20:0 salloc: Pending job allocation 17658 salloc: job 17658 queued and waiting for resources salloc: Job allocation 17658 has been revoked. salloc: Job has been cancelled salloc: error: Failed to allocate resources: No error ~> scontrol show job 17658 JobId=17658 Name=bash UserId=markn(589) GroupId=ibm(502) Priority=2 Account=ibm QOS=normal JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A SubmitTime=2011-07-26T13:00:31 EligibleTime=2011-07-26T13:00:31 StartTime=2011-07-26T17:23:47 EndTime=Unknown SuspendTime=None SecsPreSuspend=0 Partition=main AllocNode:Sid=tambo:7313 ReqBP_List=(null) ExcBP_List=(null) BP_List=(null) NumNodes=128-128 NumCPUs=512-512 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/vlsci/IBM/markn/gromacs/workshop/d.dppc Block_ID=unassigned Connection=Small Reboot=no Rotate=yes Geometry=0x0x0 CnloadImage=default MloaderImage=default IoloadImage=default ~> scancel 17658 ~> scontrol show job 17658 JobId=17658 Name=bash UserId=markn(589) GroupId=ibm(502) Priority=2 Account=ibm QOS=normal JobState=CANCELLED Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A SubmitTime=2011-07-26T13:00:31 EligibleTime=2011-07-26T13:00:31 StartTime=2011-07-26T13:01:02 EndTime=2011-07-26T13:01:02 SuspendTime=None SecsPreSuspend=0 Partition=main AllocNode:Sid=tambo:7313 ReqBP_List=(null) ExcBP_List=(null) BP_List=(null) NumNodes=128-128 NumCPUs=512-512 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/vlsci/IBM/markn/gromacs/workshop/d.dppc Block_ID=unassigned Connection=Small Reboot=no Rotate=yes Geometry=0x0x0 CnloadImage=default MloaderImage=default IoloadImage=default Thanks! Mark.
