In addition to the above problem . oversubscription is NO then according to the document.so in this scenario even if resources are available it is ot accepting the job from other partition. Even i made the same priority for both the partition but it didn't help. Any Suggestion here.
Slurm Workload Manager - Sharing Consumable Resources Two OverSubscribe=NO partitions assigned the same set of nodes Jobs from either partition will be assigned to all available consumable resources. No consumable resource will be shared. One node could have 2 jobs running on it, and each job could be from a different partition. On Tue, Mar 31, 2020 at 4:34 PM navin srivastava <navin.alt...@gmail.com> wrote: > Hi , > > have an issue with the resource allocation. > > In the environment have partition like below: > > PartitionName=small_jobs Nodes=Node[17,20] Default=NO MaxTime=INFINITE > State=UP Shared=YES Priority=8000 > PartitionName=large_jobs Nodes=Node[17,20] Default=NO MaxTime=INFINITE > State=UP Shared=YES Priority=100 > > Also the node allocated with less cpu and lot of cpu resources available > > NodeName=Node17 Arch=x86_64 CoresPerSocket=18 > CPUAlloc=4 CPUErr=0 CPUTot=36 CPULoad=4.09 > AvailableFeatures=K2200 > ActiveFeatures=K2200 > Gres=gpu:2 > NodeAddr=Node1717 NodeHostName=Node17 Version=17.11 > OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31 12:25:04 UTC 2018 > (3090901) > RealMemory=1 AllocMem=0 FreeMem=225552 Sockets=2 Boards=1 > State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A > Partitions=small_jobs,large_jobs > BootTime=2020-03-21T18:56:48 SlurmdStartTime=2020-03-31T09:07:03 > CfgTRES=cpu=36,mem=1M,billing=36 > AllocTRES=cpu=4 > CapWatts=n/a > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > there is no other job in small_jobs partition but several jobs are in > pending in the large_jobs and the resources are available but jobs are not > going through. > > one of the job pening output is: > > scontrol show job 1250258 > JobId=1250258 JobName=import_workflow > UserId=m209767(100468) GroupId=oled(4289) MCS_label=N/A > Priority=363157 Nice=0 Account=oledgrp QOS=normal > JobState=PENDING Reason=Priority Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A > SubmitTime=2020-03-28T22:00:13 EligibleTime=2020-03-28T22:00:13 > StartTime=2070-03-19T11:59:09 EndTime=Unknown Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2020-03-31T12:58:48 > Partition=large_jobs AllocNode:Sid=deda1x1466:62260 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=(null) > NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=1,node=1 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > Gres=(null) Reservation=(null) > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > > this is my slurm.conf file for scheduling. > > > SchedulerType=sched/builtin > #SchedulerParameters=enable_user_top > SelectType=select/cons_res > #SelectTypeParameters=CR_Core_Memory > SelectTypeParameters=CR_Core > > > Any idea why the job is not going for execution if cpu cores are avaiable. > > Also would like to know if any jobs are running on a particular node and > if i restart the Slurmd service then in what scenario my job will get > killed. Generally it should not kill the job. > > Regards > Navin. > > > > >