The situation is that a number of nodes are allocated with the --no-kill
option. One node failed, and was rebooted. Now the user cannot allocate
that single node (job state PENDING with Reason being "Resources")
although sinfo shows the node as "idle."

In this example, original allocation was noews j-[097-128]. Node j-125
failed (memory exhaustion) and was rebooted.

Still running v. 14.11.7; if this is a known/fixed issue that would be
good motivation for an upgrade.


>From slurm.conf:

# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
[...]
PartitionName=18core Nodes=j-[097-128] Default=NO Shared=EXCLUSIVE 
MaxTime=INFINITE State=UP


Original job:

JobId=4552 JobName=bash
   UserId=fg474admin(1101) GroupId=users(100)
   Priority=4294900519 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=4-21:08:41 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2017-03-08T15:15:33 EligibleTime=2017-03-08T15:15:33
   ResizeTime=2017-03-13T12:34:36
   StartTime=2017-03-08T15:16:03 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=18core AllocNode:Sid=j-login1:23750
   ReqNodeList=j-[097-128] ExcNodeList=(null)
   NodeList=j-[097-124,126-128]
   BatchHost=j-097
   NumNodes=31 NumCPUs=1116 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/N/u/fg474admin


node j-125 failed (memory exhaustion) and was rebooted.


Now job allocating that node:

JobId=5099 JobName=bash
   UserId=fg474admin(1101) GroupId=users(100)
   Priority=4294899944 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2017-03-13T10:46:25 EligibleTime=2017-03-13T10:46:25
   StartTime=2018-03-08T15:16:03 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=18core AllocNode:Sid=j-login1:141767
   ReqNodeList=j-125 ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/N/u/fg474admin


sinfo shows:

PARTITION  AVAIL  TIMELIMIT  NODES   GRES  STATE NODELIST
18core        up   infinite      1 (null)   idle j-125

Reply via email to