Re: [slurm-users] not allocating the node for job execution even resources are available.

2020-04-01 Thread navin srivastava
In addition to the above problem . oversubscription is NO then according to
the document.so in this scenario even if resources are available it is  ot
accepting the job from other partition.  Even i made the same priority for
both the partition but it didn't help. Any Suggestion here.

Slurm Workload Manager - Sharing Consumable Resources
Two OverSubscribe=NO partitions assigned the same set of nodes Jobs from
either partition will be assigned to all available consumable resources. No
consumable resource will be shared. One node could have 2 jobs running on
it, and each job could be from a different partition.

On Tue, Mar 31, 2020 at 4:34 PM navin srivastava 
wrote:

> Hi ,
>
> have an issue with the resource allocation.
>
> In the environment have partition like below:
>
> PartitionName=small_jobs Nodes=Node[17,20]  Default=NO MaxTime=INFINITE
> State=UP Shared=YES Priority=8000
> PartitionName=large_jobs Nodes=Node[17,20]  Default=NO MaxTime=INFINITE
> State=UP Shared=YES Priority=100
>
> Also the node allocated with less cpu and lot of cpu resources available
>
> NodeName=Node17 Arch=x86_64 CoresPerSocket=18
>CPUAlloc=4 CPUErr=0 CPUTot=36 CPULoad=4.09
>AvailableFeatures=K2200
>ActiveFeatures=K2200
>Gres=gpu:2
>NodeAddr=Node1717 NodeHostName=Node17 Version=17.11
>OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31 12:25:04 UTC 2018
> (3090901)
>RealMemory=1 AllocMem=0 FreeMem=225552 Sockets=2 Boards=1
>State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>Partitions=small_jobs,large_jobs
>BootTime=2020-03-21T18:56:48 SlurmdStartTime=2020-03-31T09:07:03
>CfgTRES=cpu=36,mem=1M,billing=36
>AllocTRES=cpu=4
>CapWatts=n/a
>CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> there is no other job in small_jobs partition but several jobs are in
> pending in the large_jobs and the resources are available but jobs are not
> going through.
>
> one of the job pening output is:
>
> scontrol show job 1250258
>JobId=1250258 JobName=import_workflow
>UserId=m209767(100468) GroupId=oled(4289) MCS_label=N/A
>Priority=363157 Nice=0 Account=oledgrp QOS=normal
>JobState=PENDING Reason=Priority Dependency=(null)
>Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
>SubmitTime=2020-03-28T22:00:13 EligibleTime=2020-03-28T22:00:13
>StartTime=2070-03-19T11:59:09 EndTime=Unknown Deadline=N/A
>PreemptTime=None SuspendTime=None SecsPreSuspend=0
>LastSchedEval=2020-03-31T12:58:48
>Partition=large_jobs AllocNode:Sid=deda1x1466:62260
>ReqNodeList=(null) ExcNodeList=(null)
>NodeList=(null)
>NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>TRES=cpu=1,node=1
>Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>Features=(null) DelayBoot=00:00:00
>Gres=(null) Reservation=(null)
>OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>
> this is my slurm.conf file for scheduling.
>
>
> SchedulerType=sched/builtin
> #SchedulerParameters=enable_user_top
> SelectType=select/cons_res
> #SelectTypeParameters=CR_Core_Memory
> SelectTypeParameters=CR_Core
>
>
> Any idea why the job is not going for execution if cpu cores are avaiable.
>
> Also would like to know if any jobs are running on a particular node and
> if i restart the Slurmd service then in what scenario my job will get
> killed. Generally it should not kill the job.
>
> Regards
> Navin.
>
>
>
>
>


[slurm-users] not allocating the node for job execution even resources are available.

2020-03-31 Thread navin srivastava
Hi ,

have an issue with the resource allocation.

In the environment have partition like below:

PartitionName=small_jobs Nodes=Node[17,20]  Default=NO MaxTime=INFINITE
State=UP Shared=YES Priority=8000
PartitionName=large_jobs Nodes=Node[17,20]  Default=NO MaxTime=INFINITE
State=UP Shared=YES Priority=100

Also the node allocated with less cpu and lot of cpu resources available

NodeName=Node17 Arch=x86_64 CoresPerSocket=18
   CPUAlloc=4 CPUErr=0 CPUTot=36 CPULoad=4.09
   AvailableFeatures=K2200
   ActiveFeatures=K2200
   Gres=gpu:2
   NodeAddr=Node1717 NodeHostName=Node17 Version=17.11
   OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31 12:25:04 UTC 2018
(3090901)
   RealMemory=1 AllocMem=0 FreeMem=225552 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=small_jobs,large_jobs
   BootTime=2020-03-21T18:56:48 SlurmdStartTime=2020-03-31T09:07:03
   CfgTRES=cpu=36,mem=1M,billing=36
   AllocTRES=cpu=4
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

there is no other job in small_jobs partition but several jobs are in
pending in the large_jobs and the resources are available but jobs are not
going through.

one of the job pening output is:

scontrol show job 1250258
   JobId=1250258 JobName=import_workflow
   UserId=m209767(100468) GroupId=oled(4289) MCS_label=N/A
   Priority=363157 Nice=0 Account=oledgrp QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2020-03-28T22:00:13 EligibleTime=2020-03-28T22:00:13
   StartTime=2070-03-19T11:59:09 EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2020-03-31T12:58:48
   Partition=large_jobs AllocNode:Sid=deda1x1466:62260
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

this is my slurm.conf file for scheduling.


SchedulerType=sched/builtin
#SchedulerParameters=enable_user_top
SelectType=select/cons_res
#SelectTypeParameters=CR_Core_Memory
SelectTypeParameters=CR_Core


Any idea why the job is not going for execution if cpu cores are avaiable.

Also would like to know if any jobs are running on a particular node and if
i restart the Slurmd service then in what scenario my job will get killed.
Generally it should not kill the job.

Regards
Navin.