The situation is that a number of nodes are allocated with the --no-kill option. One node failed, and was rebooted. Now the user cannot allocate that single node (job state PENDING with Reason being "Resources") although sinfo shows the node as "idle."
In this example, original allocation was noews j-[097-128]. Node j-125 failed (memory exhaustion) and was rebooted. Still running v. 14.11.7; if this is a known/fixed issue that would be good motivation for an upgrade. >From slurm.conf: # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res SelectTypeParameters=CR_CPU [...] PartitionName=18core Nodes=j-[097-128] Default=NO Shared=EXCLUSIVE MaxTime=INFINITE State=UP Original job: JobId=4552 JobName=bash UserId=fg474admin(1101) GroupId=users(100) Priority=4294900519 Nice=0 Account=(null) QOS=(null) JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=4-21:08:41 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2017-03-08T15:15:33 EligibleTime=2017-03-08T15:15:33 ResizeTime=2017-03-13T12:34:36 StartTime=2017-03-08T15:16:03 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=18core AllocNode:Sid=j-login1:23750 ReqNodeList=j-[097-128] ExcNodeList=(null) NodeList=j-[097-124,126-128] BatchHost=j-097 NumNodes=31 NumCPUs=1116 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/N/u/fg474admin node j-125 failed (memory exhaustion) and was rebooted. Now job allocating that node: JobId=5099 JobName=bash UserId=fg474admin(1101) GroupId=users(100) Priority=4294899944 Nice=0 Account=(null) QOS=(null) JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2017-03-13T10:46:25 EligibleTime=2017-03-13T10:46:25 StartTime=2018-03-08T15:16:03 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=18core AllocNode:Sid=j-login1:141767 ReqNodeList=j-125 ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/N/u/fg474admin sinfo shows: PARTITION AVAIL TIMELIMIT NODES GRES STATE NODELIST 18core up infinite 1 (null) idle j-125