[slurm-users] not allocating the node for job execution even resources are available.

navin srivastava Tue, 31 Mar 2020 04:07:54 -0700

Hi ,

have an issue with the resource allocation.


In the environment have partition like below:

PartitionName=small_jobs Nodes=Node[17,20]  Default=NO MaxTime=INFINITE
State=UP Shared=YES Priority=8000
PartitionName=large_jobs Nodes=Node[17,20]  Default=NO MaxTime=INFINITE
State=UP Shared=YES Priority=100

Also the node allocated with less cpu and lot of cpu resources available

NodeName=Node17 Arch=x86_64 CoresPerSocket=18
   CPUAlloc=4 CPUErr=0 CPUTot=36 CPULoad=4.09
   AvailableFeatures=K2200
   ActiveFeatures=K2200
   Gres=gpu:2
   NodeAddr=Node1717 NodeHostName=Node17 Version=17.11
   OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31 12:25:04 UTC 2018
(3090901)
   RealMemory=1 AllocMem=0 FreeMem=225552 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=small_jobs,large_jobs
   BootTime=2020-03-21T18:56:48 SlurmdStartTime=2020-03-31T09:07:03
   CfgTRES=cpu=36,mem=1M,billing=36
   AllocTRES=cpu=4
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

there is no other job in small_jobs partition but several jobs are in
pending in the large_jobs and the resources are available but jobs are not
going through.

one of the job pening output is:

scontrol show job 1250258
   JobId=1250258 JobName=import_workflow
   UserId=m209767(100468) GroupId=oled(4289) MCS_label=N/A
   Priority=363157 Nice=0 Account=oledgrp QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2020-03-28T22:00:13 EligibleTime=2020-03-28T22:00:13
   StartTime=2070-03-19T11:59:09 EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2020-03-31T12:58:48
   Partition=large_jobs AllocNode:Sid=deda1x1466:62260
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

this is my slurm.conf file for scheduling.


SchedulerType=sched/builtin
#SchedulerParameters=enable_user_top
SelectType=select/cons_res
#SelectTypeParameters=CR_Core_Memory
SelectTypeParameters=CR_Core


Any idea why the job is not going for execution if cpu cores are avaiable.

Also would like to know if any jobs are running on a particular node and if
i restart the Slurmd service then in what scenario my job will get killed.
Generally it should not kill the job.

Regards
Navin.

[slurm-users] not allocating the node for job execution even resources are available.

Reply via email to