Thanks Denial for detailed Description Regards Navin
On Sun, May 3, 2020, 13:35 Daniel Letai <d...@letai.org.il> wrote: > > On 29/04/2020 12:00:13, navin srivastava wrote: > > Thanks Daniel. > > All jobs went into run state so unable to provide the details but > definitely will reach out later if we see similar issue. > > i am more interested to understand the FIFO with Fair Tree.it will be good > if anybody provide some insight on this combination and also if we will > enable the backfilling here how the behaviour will change. > > what is the role of the Fair tree here? > > Fair tree is the algorithm used to calculate the interim priority, before > applying weight, but I think after the halflife decay. > > > To make it simple - fifo without fairshare would assign priority based > only on submission time. With faishare, that naive priority is adjusted > based on prior usage by the applicable entities (users/departments - > accounts). > > > Backfill will let you utilize your resources better, since it will allow > "inserting" low priority jobs before higher priority jobs, provided all > jobs have defined wall times, and any inserted job doesn't affect in any > way the start time of a higher priority job, thus allowing utilization of > "holes" when the scheduler waits for resources to free up, in order to > insert some large job. > > > Suppose the system is at 60% utilization of cores, and the next fifo job > requires 42% - it will wait until 2% are free so it can begin, meanwhile > not allowing any job to start, even if it would tke only 30% of the > resources (whic are currently free) and would finish before the 2% are free > anyway. > > Backfill would allow such job to start, as long as it's wall time ensures > it would finish before the 42% job would've started. > > > Fairtree in either case (fifo or backfill) calculates the priority for > each job the same - if the account had used more resources recently (the > halflife decay factor) it would get a lower priority even though it was > submitted earlier than a job from an account that didn't use any resources > recently. > > > As can be expected, backtree has to loop over all jobs in the queue, in > order to see if any job can fit out of order. In very busy/active systems, > that can lead to poor response times, unless tuned correctly in slurm conf > - look at SchedulerParameters, all params starting with bf_ and in > particular bf_max_job_test= ,bf_max_time= and bf_continue (but bf_window= > can also have some impact if set too high). > > see the man page at > https://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters > > > PriorityType=priority/multifactor > PriorityDecayHalfLife=2 > PriorityUsageResetPeriod=DAILY > PriorityWeightFairshare=500000 > PriorityFlags=FAIR_TREE > > Regards > Navin. > > > > On Mon, Apr 27, 2020 at 9:37 PM Daniel Letai <d...@letai.org.il> wrote: > >> Are you sure there are enough resources available? The node is in mixed >> state, so it's configured for both partitions - it's possible that earlier >> lower priority jobs are already running thus blocking the later jobs, >> especially since it's fifo. >> >> >> It would really help if you pasted the results of: >> >> squeue >> >> sinfo >> >> >> As well as the exact sbatch line, so we can see how many resources per >> node are requested. >> >> >> On 26/04/2020 12:00:06, navin srivastava wrote: >> >> Thanks Brian, >> >> As suggested i gone through document and what i understood that the fair >> tree leads to the Fairshare mechanism and based on that the job should be >> scheduling. >> >> so it mean job scheduling will be based on FIFO but priority will be >> decided on the Fairshare. i am not sure if both conflicts here.if i see the >> normal jobs priority is lower than the GPUsmall priority. so resources are >> available with gpusmall partition then it should go. there is no job pend >> due to gpu resources. the gpu resources itself not asked with the job. >> >> is there any article where i can see how the fairshare works and which >> are setting should not be conflict with this. >> According to document it never says that if fair-share is applied then >> FIFO should be disabled. >> >> Regards >> Navin. >> >> >> >> >> >> On Sat, Apr 25, 2020 at 12:47 AM Brian W. Johanson <bjoha...@psc.edu> >> wrote: >> >>> >>> If you haven't looked at the man page for slurm.conf, it will answer >>> most if not all your questions. >>> https://slurm.schedmd.com/slurm.conf.html but I would depend on the the >>> manual version that was distributed with the version you have installed as >>> options do change. >>> >>> There is a ton of information that is tedious to get through but reading >>> through it multiple times opens many doors. >>> >>> DefaultTime is listed in there as a Partition option. >>> If you are scheduling gres/gpu resources, it's quite possible there are >>> cores available with no corresponding gpus avail. >>> >>> -b >>> >>> On 4/24/20 2:49 PM, navin srivastava wrote: >>> >>> Thanks Brian. >>> >>> I need to check the jobs order. >>> >>> Is there any way to define the default timeline of the job if user not >>> specifying time limit. >>> >>> Also what does the meaning of fairtree in priorities in slurm.Conf >>> file. >>> >>> The set of nodes are different in partitions.FIFO does not care for >>> any partitiong. >>> Is it like strict odering means the job came 1st will go and until it >>> runs it will not allow others. >>> >>> Also priorities is high for gpusmall partition and low for normal jobs >>> and the nodes of the normal partition is full but gpusmall cores are >>> available. >>> >>> Regards >>> Navin >>> >>> On Fri, Apr 24, 2020, 23:49 Brian W. Johanson <bjoha...@psc.edu> wrote: >>> >>>> Without seeing the jobs in your queue, I would expect the next job in >>>> FIFO order to be too large to fit in the current idle resources. >>>> >>>> Configure it to use the backfill scheduler: >>>> SchedulerType=sched/backfill >>>> >>>> SchedulerType >>>> Identifies the type of scheduler to be used. Note the >>>> slurmctld daemon must be restarted for a change in scheduler type to become >>>> effective (reconfiguring a running daemon has no effect for this >>>> parameter). The scontrol command can be used to manually change job >>>> priorities if desired. Acceptable values include: >>>> >>>> sched/backfill >>>> For a backfill scheduling module to augment the >>>> default FIFO scheduling. Backfill scheduling will initiate lower-priority >>>> jobs if doing so does not delay the expected initiation time of any >>>> higher priority job. Effectiveness of backfill scheduling is >>>> dependent upon users specifying job time limits, otherwise all jobs will >>>> have the same time limit and backfilling is impossible. Note documentation >>>> for the SchedulerParameters option above. This is the default >>>> configuration. >>>> >>>> sched/builtin >>>> This is the FIFO scheduler which initiates jobs >>>> in priority order. If any job in the partition can not be scheduled, no >>>> lower priority job in that partition will be scheduled. An exception is >>>> made for jobs that can not run due to partition constraints (e.g. the time >>>> limit) or down/drained nodes. In that case, lower priority jobs can be >>>> initiated and not impact the higher priority job. >>>> >>>> >>>> >>>> Your partitions are set with maxtime=INFINITE, if your users are not >>>> specifying a reasonable timelimit to their jobs, this won't help either. >>>> >>>> >>>> -b >>>> >>>> >>>> On 4/24/20 1:52 PM, navin srivastava wrote: >>>> >>>> In addition to the above when i see the sprio of both the jobs it says >>>> :- >>>> >>>> for normal queue jobs all jobs showing the same priority >>>> >>>> JOBID PARTITION PRIORITY FAIRSHARE >>>> 1291352 normal 15789 15789 >>>> >>>> for GPUsmall all jobs showing the same priority. >>>> >>>> JOBID PARTITION PRIORITY FAIRSHARE >>>> 1291339 GPUsmall 21052 21053 >>>> >>>> On Fri, Apr 24, 2020 at 11:14 PM navin srivastava < >>>> navin.alt...@gmail.com> wrote: >>>> >>>>> Hi Team, >>>>> >>>>> we are facing some issue in our environment. The resources are free >>>>> but job is going into the QUEUE state but not running. >>>>> >>>>> i have attached the slurm.conf file here. >>>>> >>>>> scenario:- >>>>> >>>>> There are job only in the 2 partitions: >>>>> 344 jobs are in PD state in normal partition and the node belongs >>>>> from the normal partitions are full and no more job can run. >>>>> >>>>> 1300 JOBS are in GPUsmall partition are in queue and enough CPU is >>>>> avaiable to execute the jobs but i see the jobs are not scheduling on free >>>>> nodes. >>>>> >>>>> Rest there are no pend jobs in any other partition . >>>>> eg:- >>>>> node status:- node18 >>>>> >>>>> NodeName=node18 Arch=x86_64 CoresPerSocket=18 >>>>> CPUAlloc=6 CPUErr=0 CPUTot=36 CPULoad=4.07 >>>>> AvailableFeatures=K2200 >>>>> ActiveFeatures=K2200 >>>>> Gres=gpu:2 >>>>> NodeAddr=node18 NodeHostName=node18 Version=17.11 >>>>> OS=Linux 4.4.140-94.42-default #1 SMP Tue Jul 17 07:44:50 UTC 2018 >>>>> (0b375e4) >>>>> RealMemory=1 AllocMem=0 FreeMem=79532 Sockets=2 Boards=1 >>>>> State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A >>>>> MCS_label=N/A >>>>> Partitions=GPUsmall,pm_shared >>>>> BootTime=2019-12-10T14:16:37 SlurmdStartTime=2019-12-10T14:24:08 >>>>> CfgTRES=cpu=36,mem=1M,billing=36 >>>>> AllocTRES=cpu=6 >>>>> CapWatts=n/a >>>>> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 >>>>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >>>>> >>>>> node19:- >>>>> >>>>> NodeName=node19 Arch=x86_64 CoresPerSocket=18 >>>>> CPUAlloc=16 CPUErr=0 CPUTot=36 CPULoad=15.43 >>>>> AvailableFeatures=K2200 >>>>> ActiveFeatures=K2200 >>>>> Gres=gpu:2 >>>>> NodeAddr=node19 NodeHostName=node19 Version=17.11 >>>>> OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31 12:25:04 UTC 2018 >>>>> (3090901) >>>>> RealMemory=1 AllocMem=0 FreeMem=63998 Sockets=2 Boards=1 >>>>> State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A >>>>> MCS_label=N/A >>>>> Partitions=GPUsmall,pm_shared >>>>> BootTime=2020-03-12T06:51:54 SlurmdStartTime=2020-03-12T06:53:14 >>>>> CfgTRES=cpu=36,mem=1M,billing=36 >>>>> AllocTRES=cpu=16 >>>>> CapWatts=n/a >>>>> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 >>>>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >>>>> >>>>> could you please help me to understand what could be the reason? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> -- >> Regards, >> >> Daniel Letai >> +972 (0)505 870 456 >> >> -- > Regards, > > Daniel Letai > +972 (0)505 870 456 > >