date:20200429

Re: [slurm-users] one job at a time - how to set?

2020-04-29 Thread Rutger Vos

One job total!

On Wed, Apr 29, 2020 at 10:37 PM Killian Murphy 
wrote:

> Hi Rutger.
>
> Are you trying to have one job *per user* running in your partition? Or
> just one job total?
>
> Killian
>
> On Wed, 29 Apr 2020 at 21:27, Doug Meyer  wrote:
>
>> Change node definition in slurm.conf for that one node to 1 CPU.
>>
>>
>>
>> *Doug Meyer*
>>
>>
>>
>> *From:* slurm-users  *On Behalf
>> Of *Rutger Vos
>> *Sent:* Wednesday, April 29, 2020 1:20 PM
>> *To:* Slurm User Community List 
>> *Subject:* [External] Re: [slurm-users] one job at a time - how to set?
>>
>>
>>
>> Hi Michael,
>>
>>
>>
>> thanks very much for your swift reply. So here we would have to convince
>> the users they'd have to specify this when submitting, right? I.e. 'sbatch
>> --exclusive myjob.sh', if I understand correctly. Would there be a way to
>> simply enforce this, i.e. at the slurm.conf level or something?
>>
>>
>>
>> Thanks again!
>>
>>
>>
>> Rutger
>>
>>
>>
>> On Wed, Apr 29, 2020 at 10:06 PM Renfro, Michael 
>> wrote:
>>
>> That’s a *really* old version, but
>> https://slurm.schedmd.com/archive/slurm-15.08.13/sbatch.html indicates
>> there’s an exclusive flag you can set.
>>
>> On Apr 29, 2020, at 1:54 PM, Rutger Vos  wrote:
>>
>> *.*
>>
>> Hi,
>>
>>
>>
>> for a smallish machine that has been having degraded performance we want
>> to implement a policy where only one job (submitted with sbatch) is allowed
>> to run and any others submitted after it are supposed to wait in line.
>>
>>
>>
>> I assumed this was straightforward but I can't seem to figure it out. Can
>> I set that up in slurm.conf or in some other way? Thank you very much for
>> your help. BTW we are running slurm 15.08.7 if that is at all relevant.
>>
>>
>>
>> Best wishes,
>>
>>
>>
>> Dr. Rutger A. Vos
>>
>> Researcher / Bioinformatician
>>
>>
>>
>>
>>
>>
>>
>> +31717519600 - +31627085806
>>
>> rutger@naturalis.nl - www.naturalis.nl
>>
>> Darwinweg 2, 2333 CR Leiden
>>
>> Postbus 9517, 2300 RA Leiden
>>
>>
>>
>> 
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>>
>>
>> Met vriendelijke groet,
>>
>>
>>
>> Dr. Rutger A. Vos
>>
>> Researcher / Bioinformatician
>>
>>
>>
>>
>>
>>
>>
>> +31717519600 - +31627085806
>>
>> rutger@naturalis.nl - www.naturalis.nl
>>
>> Darwinweg 2, 2333 CR Leiden
>>
>> Postbus 9517, 2300 RA Leiden
>>
>>
>>
>> 
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> --
> Killian Murphy
> Research Software Engineer
>
> Wolfson Atmospheric Chemistry Laboratories
> University of York
> Heslington
> York
> YO10 5DD
> +44 (0)1904 32 4753
>
> e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm
>


-- 

Met vriendelijke groet,

Dr. Rutger A. Vos
Researcher / Bioinformatician






+31717519600 - +31627085806
rutger@naturalis.nl - www.naturalis.nl
Darwinweg 2, 2333 CR Leiden
Postbus 9517, 2300 RA Leiden

Re: [slurm-users] one job at a time - how to set?

2020-04-29 Thread Baer, Troy

I don’t think there’s a way to do that in Slurm using just the node 
declaration, other than the previously mentioned way of configuring it to show 
up as having only 1 core.  However, you could put the node in a partition that 
has OverSubscribe=EXCLUSIVE set, and have that partition be the only way to get 
to it:

# in slurm.conf
NodeName=singlejobnode […settings…]
PartitionName=onejobatatime  Nodes=singlejobnode OverSubscribe=EXCLUSIVE
# singlenodejob isn’t in any other partitions.

--Troy

From: slurm-users  on behalf of Rutger 
Vos 
Reply-To: Slurm User Community List 
Date: Wednesday, April 29, 2020 at 2:55 PM
To: "slurm-users@lists.schedmd.com" 
Subject: [slurm-users] one job at a time - how to set?

Hi,

for a smallish machine that has been having degraded performance we want to 
implement a policy where only one job (submitted with sbatch) is allowed to run 
and any others submitted after it are supposed to wait in line.

I assumed this was straightforward but I can't seem to figure it out. Can I set 
that up in slurm.conf or in some other way? Thank you very much for your help. 
BTW we are running slurm 15.08.7 if that is at all relevant.

Best wishes,

Dr. Rutger A. Vos
Researcher / Bioinformatician
[Image removed by sender.]






+31717519600 - +31627085806
rutger@naturalis.nl - 
www.naturalis.nl
Darwinweg 2, 2333 CR Leiden
Postbus 9517, 2300 RA Leiden

[Image removed by 
sender.]

Re: [slurm-users] one job at a time - how to set?

2020-04-29 Thread Doug Meyer

Change node definition in slurm.conf for that one node to 1 CPU.

Doug Meyer

From: slurm-users  On Behalf Of Rutger 
Vos
Sent: Wednesday, April 29, 2020 1:20 PM
To: Slurm User Community List 
Subject: [External] Re: [slurm-users] one job at a time - how to set?

Hi Michael,

thanks very much for your swift reply. So here we would have to convince the 
users they'd have to specify this when submitting, right? I.e. 'sbatch 
--exclusive myjob.sh', if I understand correctly. Would there be a way to 
simply enforce this, i.e. at the slurm.conf level or something?

Thanks again!

Rutger

On Wed, Apr 29, 2020 at 10:06 PM Renfro, Michael 
mailto:ren...@tntech.edu>> wrote:
That’s a *really* old version, but 
https://slurm.schedmd.com/archive/slurm-15.08.13/sbatch.html indicates there’s 
an exclusive flag you can set.
On Apr 29, 2020, at 1:54 PM, Rutger Vos 
mailto:rutger@naturalis.nl>> wrote:

.
Hi,

for a smallish machine that has been having degraded performance we want to 
implement a policy where only one job (submitted with sbatch) is allowed to run 
and any others submitted after it are supposed to wait in line.

I assumed this was straightforward but I can't seem to figure it out. Can I set 
that up in slurm.conf or in some other way? Thank you very much for your help. 
BTW we are running slurm 15.08.7 if that is at all relevant.

Best wishes,

Dr. Rutger A. Vos
Researcher / Bioinformatician
[https://06ecba7b-a-deac235a-s-sites.googlegroups.com/a/naturalis.nl/signatures/home/logo-new.png]

+31717519600 - +31627085806
rutger@naturalis.nl - 
www.naturalis.nl
Darwinweg 2, 2333 CR Leiden
Postbus 9517, 2300 RA Leiden

[https://06ecba7b-a-deac235a-s-sites.googlegroups.com/a/naturalis.nl/signatures/home/schildpad.gif]

--

Met vriendelijke groet,

Dr. Rutger A. Vos
Researcher / Bioinformatician
[https://06ecba7b-a-deac235a-s-sites.googlegroups.com/a/naturalis.nl/signatures/home/logo-new.png]

+31717519600 - +31627085806
rutger@naturalis.nl - 
www.naturalis.nl
Darwinweg 2, 2333 CR Leiden
Postbus 9517, 2300 RA Leiden

[https://06ecba7b-a-deac235a-s-sites.googlegroups.com/a/naturalis.nl/signatures/home/schildpad.gif]

Re: [slurm-users] one job at a time - how to set?

2020-04-29 Thread Killian Murphy

Hi Rutger.

Are you trying to have one job *per user* running in your partition? Or
just one job total?

Killian

On Wed, 29 Apr 2020 at 21:27, Doug Meyer  wrote:

> Change node definition in slurm.conf for that one node to 1 CPU.
>
>
>
> *Doug Meyer*
>
>
>
> *From:* slurm-users  *On Behalf Of
> *Rutger Vos
> *Sent:* Wednesday, April 29, 2020 1:20 PM
> *To:* Slurm User Community List 
> *Subject:* [External] Re: [slurm-users] one job at a time - how to set?
>
>
>
> Hi Michael,
>
>
>
> thanks very much for your swift reply. So here we would have to convince
> the users they'd have to specify this when submitting, right? I.e. 'sbatch
> --exclusive myjob.sh', if I understand correctly. Would there be a way to
> simply enforce this, i.e. at the slurm.conf level or something?
>
>
>
> Thanks again!
>
>
>
> Rutger
>
>
>
> On Wed, Apr 29, 2020 at 10:06 PM Renfro, Michael 
> wrote:
>
> That’s a *really* old version, but
> https://slurm.schedmd.com/archive/slurm-15.08.13/sbatch.html indicates
> there’s an exclusive flag you can set.
>
> On Apr 29, 2020, at 1:54 PM, Rutger Vos  wrote:
>
> *.*
>
> Hi,
>
>
>
> for a smallish machine that has been having degraded performance we want
> to implement a policy where only one job (submitted with sbatch) is allowed
> to run and any others submitted after it are supposed to wait in line.
>
>
>
> I assumed this was straightforward but I can't seem to figure it out. Can
> I set that up in slurm.conf or in some other way? Thank you very much for
> your help. BTW we are running slurm 15.08.7 if that is at all relevant.
>
>
>
> Best wishes,
>
>
>
> Dr. Rutger A. Vos
>
> Researcher / Bioinformatician
>
>
>
>
>
>
>
> +31717519600 - +31627085806
>
> rutger@naturalis.nl - www.naturalis.nl
>
> Darwinweg 2, 2333 CR Leiden
>
> Postbus 9517, 2300 RA Leiden
>
>
>
> 
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
>
> Met vriendelijke groet,
>
>
>
> Dr. Rutger A. Vos
>
> Researcher / Bioinformatician
>
>
>
>
>
>
>
> +31717519600 - +31627085806
>
> rutger@naturalis.nl - www.naturalis.nl
>
> Darwinweg 2, 2333 CR Leiden
>
> Postbus 9517, 2300 RA Leiden
>
>
>
> 
>
>
>
>
>
>
>
>
>
>
>


-- 
Killian Murphy
Research Software Engineer

Wolfson Atmospheric Chemistry Laboratories
University of York
Heslington
York
YO10 5DD
+44 (0)1904 32 4753

e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm

Re: [slurm-users] one job at a time - how to set?

2020-04-29 Thread Rutger Vos

Hi Michael,

thanks very much for your swift reply. So here we would have to convince
the users they'd have to specify this when submitting, right? I.e. 'sbatch
--exclusive myjob.sh', if I understand correctly. Would there be a way to
simply enforce this, i.e. at the slurm.conf level or something?

Thanks again!

Rutger

On Wed, Apr 29, 2020 at 10:06 PM Renfro, Michael  wrote:

> That’s a *really* old version, but
> https://slurm.schedmd.com/archive/slurm-15.08.13/sbatch.html indicates
> there’s an exclusive flag you can set.
>
> On Apr 29, 2020, at 1:54 PM, Rutger Vos  wrote:
>
> *.*
>
> Hi,
>
> for a smallish machine that has been having degraded performance we want
> to implement a policy where only one job (submitted with sbatch) is allowed
> to run and any others submitted after it are supposed to wait in line.
>
> I assumed this was straightforward but I can't seem to figure it out. Can
> I set that up in slurm.conf or in some other way? Thank you very much for
> your help. BTW we are running slurm 15.08.7 if that is at all relevant.
>
> Best wishes,
>
> Dr. Rutger A. Vos
> Researcher / Bioinformatician
>
>
>
>
>
>
> +31717519600 - +31627085806
> rutger@naturalis.nl - www.naturalis.nl
> Darwinweg 2, 2333 CR Leiden
> Postbus 9517, 2300 RA Leiden
>
>  
>
>
>
>
>
>
>
>
>
>

-- 

Met vriendelijke groet,

Dr. Rutger A. Vos
Researcher / Bioinformatician






+31717519600 - +31627085806
rutger@naturalis.nl - www.naturalis.nl
Darwinweg 2, 2333 CR Leiden
Postbus 9517, 2300 RA Leiden

Re: [slurm-users] one job at a time - how to set?

2020-04-29 Thread Renfro, Michael

That’s a *really* old version, but 
https://slurm.schedmd.com/archive/slurm-15.08.13/sbatch.html indicates there’s 
an exclusive flag you can set.

On Apr 29, 2020, at 1:54 PM, Rutger Vos  wrote:

.

Hi,

for a smallish machine that has been having degraded performance we want to 
implement a policy where only one job (submitted with sbatch) is allowed to run 
and any others submitted after it are supposed to wait in line.

I assumed this was straightforward but I can't seem to figure it out. Can I set 
that up in slurm.conf or in some other way? Thank you very much for your help. 
BTW we are running slurm 15.08.7 if that is at all relevant.

Best wishes,

Dr. Rutger A. Vos
Researcher / Bioinformatician
[https://06ecba7b-a-deac235a-s-sites.googlegroups.com/a/naturalis.nl/signatures/home/logo-new.png]






+31717519600 - +31627085806
rutger@naturalis.nl - 
www.naturalis.nl
Darwinweg 2, 2333 CR Leiden
Postbus 9517, 2300 RA Leiden

[https://06ecba7b-a-deac235a-s-sites.googlegroups.com/a/naturalis.nl/signatures/home/schildpad.gif]

[slurm-users] TensorRT script runs with srun but not from a sbatch file

2020-04-29 Thread Robert Kudyba

I'm using this TensorRT tutorial

with MPS on Slurm 20.02 on Bright Cluster 8.2

Here are the contents of my mpsmovietest sbatch file:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job-name=MPSMovieTest
#SBATCH --gres=gpu:1
#SBATCH --nodelist=node001
#SBATCH --output=mpstest.out
export CUDA_VISIBLE_DEVICES=0
nvidia-smi -i 0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d
module load shared slurm  openmpi/cuda/64 cm-ml-python3deps/3.2.3
 cudnn/7.0 slurm cuda10.1/toolkit ml-pythondeps-py36-cuda10.1-gcc/3.2.3
tensorflow-py36-cuda10.1-gcc tensorrt-cuda10.1-gcc/6.0.1.5 gcc gdb
keras-py36-cuda10.1-gcc nccl2-cuda10.1-gcc
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2

When run in Slurm I get the below errors so perhaps there is a pathing
issue that does not work when I run srun alone:
Could not find movielens_ratings.txt in data directories:
data/samples/movielens/
data/movielens/
 FAILED

I’m trying to use srun to test this but it always fails as it appears to be
trying all nodes. We only have 3 compute nodes. As I’m writing this node002
 and node003 are in use by other users so I just want to use node001.

srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest
--nodes=1 --nodelist=node001 -Z --output=mpstest.out
Tue Apr 14 16:45:10 2020
+-+
| NVIDIA-SMI 440.33.01Driver Version: 440.33.01CUDA Version: 10.2 |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |
|===+==+==|
|   0  Tesla V100-PCIE...  On   | :3B:00.0 Off |0 |
| N/A   67CP0   241W / 250W |  32167MiB / 32510MiB |100%   E. Process |
+---+--+--+

+-+
| Processes:   GPU Memory |
|  GPU   PID   Type   Process name Usage  |
|=|
|0428996  C   python3.6  32151MiB |
+-+
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0 gcc5/5.5.0

Loading cm-ml-python3deps/3.2.3
  Loading requirement: python36

Loading tensorflow-py36-cuda10.1-gcc/1.15.2
  Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6
 RUNNING TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
[03/14/2020-16:45:10] [I] ../../../data/movielens/movielens_ratings.txt
[E] [TRT] CUDA initialization failure with error 999. Please check
your CUDA installation:
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
[E] Could not create builder.
[03/14/2020-16:45:10] [03/14/2020-16:45:10]  FAILED
TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
srun: error: node002: task 0: Exited with exit code 1

So is my syntax wrong with srun? MPS is running:

$ ps -auwx|grep mps
root 108581  0.0  0.0  12780   812 ?Ssl  Mar23   0:54
/cm/local/apps/cuda-

When node002 is available the program runs correctly, albeit with an error
about the log file failing to write:

srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest
 --nodes=1 --nodelist=node001 -Z --output=mpstest.out
Thu Apr 16 10:08:52 2020
+-+
| NVIDIA-SMI 440.33.01Driver Version: 440.33.01CUDA Version: 10.2
  |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute
M. |
|===+==+==|
|   0  Tesla V100-PCIE...  On   | :3B:00.0 Off |
 0 |
| N/A   28CP025W / 250W | 41MiB / 32510MiB |  0%   E.
Process |
+---+--+--+

+-+
| Processes:   GPU
Memory |
|  GPU   PID   Type   Process name Usage
   |

Re: [slurm-users] not allocating jobs even resources are free

2020-04-29 Thread Brian W. Johanson


Navin,
Check out 'sprio', this will give show you how the job priority changes 
with the weight changes you are making.

-b

On 4/29/20 5:00 AM, navin srivastava wrote:

Thanks Daniel.
All jobs went into run state so unable to provide the details but 
definitely will reach out later if we see similar issue.


i am more interested to understand the FIFO with Fair Tree.it will be 
good if anybody provide some insight on this combination and also if 
we will enable the backfilling here how the behaviour will change.


what is the role of the Fair tree here?

PriorityType=priority/multifactor
PriorityDecayHalfLife=2
PriorityUsageResetPeriod=DAILY
PriorityWeightFairshare=50
PriorityFlags=FAIR_TREE

Regards
Navin.



On Mon, Apr 27, 2020 at 9:37 PM Daniel Letai > wrote:


Are you sure there are enough resources available? The node is in
mixed state, so it's configured for both partitions - it's
possible that earlier lower priority jobs are already running thus
blocking the later jobs, especially since it's fifo.


It would really help if you pasted the results of:

squeue

sinfo


As well as the exact sbatch line, so we can see how many resources
per node are requested.


On 26/04/2020 12:00:06, navin srivastava wrote:

Thanks Brian,

As suggested i gone through document and what i understood  that
the fair tree leads to the Fairshare mechanism and based on that
the job should be scheduling.

so it mean job scheduling will be based on FIFO but priority will
be decided on the Fairshare. i am not sure if both conflicts
here.if i see the normal jobs priority is lower than the GPUsmall
priority. so resources are available with gpusmall partition then
it should go. there is no job pend due to gpu resources. the gpu
resources itself not asked with the job.

is there any article where i can see how the fairshare works and
which are setting should not be conflict with this.
According to document it never says that if fair-share is applied
then FIFO should be disabled.

Regards
Navin.





On Sat, Apr 25, 2020 at 12:47 AM Brian W. Johanson
mailto:bjoha...@psc.edu>> wrote:


If you haven't looked at the man page for slurm.conf, it will
answer most if not all your questions.
https://slurm.schedmd.com/slurm.conf.html but I would depend
on the the manual version that was distributed with the
version you have installed as options do change.

There is a ton of information that is tedious to get through
but reading through it multiple times opens many doors.

DefaultTime is listed in there as a Partition option.
If you are scheduling gres/gpu resources, it's quite possible
there are cores available with no corresponding gpus avail.

-b

On 4/24/20 2:49 PM, navin srivastava wrote:

Thanks Brian.

I need  to check the jobs order.

Is there  any way to define the default timeline of the job
if user  not specifying time limit.

Also what does the meaning of fairtree  in priorities in
slurm.Conf file.

The set of nodes are different in partitions.FIFO  does  not
care for any partitiong.
Is it like strict odering means the job came 1st will go and
until  it runs it will  not allow others.

Also priorities is high for gpusmall partition and low for
normal jobs and the nodes of the normal partition is full
but gpusmall cores are available.

Regards
Navin

On Fri, Apr 24, 2020, 23:49 Brian W. Johanson
mailto:bjoha...@psc.edu>> wrote:

Without seeing the jobs in your queue, I would expect
the next job in FIFO order to be too large to fit in the
current idle resources.

Configure it to use the backfill scheduler:
SchedulerType=sched/backfill

  SchedulerType
  Identifies  the type of scheduler to be
used.  Note the slurmctld daemon must be restarted for a
change in scheduler type to become effective
(reconfiguring a running daemon has no effect for this
parameter).  The scontrol command can be used to
manually change job priorities if desired.  Acceptable
values include:

  sched/backfill
 For a backfill scheduling module to
augment the default FIFO scheduling.  Backfill
scheduling will initiate lower-priority jobs if doing so
does not delay the expected initiation time of any 
higher  priority  job. Effectiveness  of  backfill
scheduling is dependent upon users specifying job time
limits, otherwise all jobs will have the same time limit
and backfilling is

[slurm-users] one job at a time - how to set?

2020-04-29 Thread Rutger Vos

Hi,

for a smallish machine that has been having degraded performance we want to
implement a policy where only one job (submitted with sbatch) is allowed to
run and any others submitted after it are supposed to wait in line.

I assumed this was straightforward but I can't seem to figure it out. Can I
set that up in slurm.conf or in some other way? Thank you very much for
your help. BTW we are running slurm 15.08.7 if that is at all relevant.

Best wishes,

Dr. Rutger A. Vos
Researcher / Bioinformatician






+31717519600 - +31627085806
rutger@naturalis.nl - www.naturalis.nl
Darwinweg 2, 2333 CR Leiden
Postbus 9517, 2300 RA Leiden

[slurm-users] Feature request: SBATCH_NTASKS as input environment variable

2020-04-29 Thread Jaume Zaragoza


Hi all,

There are some sbatch parameters that can be passed as input environment 
variables like SBATCH_PARTITION or SBATCH_TIMELIMIT. But, why  the 
number of tasks cannot be passed (SBATCH_NTASKS or SLURM_NTASKS)? I've 
read on the man that srun already reads SLURM_NTASKS as input.



Thanks,
Jaume

Re: [slurm-users] not allocating jobs even resources are free

2020-04-29 Thread navin srivastava

Thanks Daniel.

All jobs went into run state so unable to provide the details but
definitely will reach out later if we see similar issue.

i am more interested to understand the FIFO with Fair Tree.it will be good
if anybody provide some insight on this combination and also if we will
enable the backfilling here how the behaviour will change.

what is the role of the Fair tree here?

PriorityType=priority/multifactor
PriorityDecayHalfLife=2
PriorityUsageResetPeriod=DAILY
PriorityWeightFairshare=50
PriorityFlags=FAIR_TREE

Regards
Navin.



On Mon, Apr 27, 2020 at 9:37 PM Daniel Letai  wrote:

> Are you sure there are enough resources available? The node is in mixed
> state, so it's configured for both partitions - it's possible that earlier
> lower priority jobs are already running thus blocking the later jobs,
> especially since it's fifo.
>
>
> It would really help if you pasted the results of:
>
> squeue
>
> sinfo
>
>
> As well as the exact sbatch line, so we can see how many resources per
> node are requested.
>
>
> On 26/04/2020 12:00:06, navin srivastava wrote:
>
> Thanks Brian,
>
> As suggested i gone through document and what i understood  that the fair
> tree leads to the Fairshare mechanism and based on that the job should be
> scheduling.
>
> so it mean job scheduling will be based on FIFO but priority will be
> decided on the Fairshare. i am not sure if both conflicts here.if i see the
> normal jobs priority is lower than the GPUsmall priority. so resources are
> available with gpusmall partition then it should go. there is no job pend
> due to gpu resources. the gpu resources itself not asked with the job.
>
> is there any article where i can see how the fairshare works and which are
> setting should not be conflict with this.
> According to document it never says that if fair-share is applied then
> FIFO should be disabled.
>
> Regards
> Navin.
>
>
>
>
>
> On Sat, Apr 25, 2020 at 12:47 AM Brian W. Johanson 
> wrote:
>
>>
>> If you haven't looked at the man page for slurm.conf, it will answer most
>> if not all your questions.
>> https://slurm.schedmd.com/slurm.conf.html but I would depend on the the
>> manual version that was distributed with the version you have installed as
>> options do change.
>>
>> There is a ton of information that is tedious to get through but reading
>> through it multiple times opens many doors.
>>
>> DefaultTime is listed in there as a Partition option.
>> If you are scheduling gres/gpu resources, it's quite possible there are
>> cores available with no corresponding gpus avail.
>>
>> -b
>>
>> On 4/24/20 2:49 PM, navin srivastava wrote:
>>
>> Thanks Brian.
>>
>> I need  to check the jobs order.
>>
>> Is there  any way to define the default timeline of the job if user  not
>> specifying time limit.
>>
>> Also what does the meaning of fairtree  in priorities in slurm.Conf file.
>>
>> The set of nodes are different in partitions.FIFO  does  not care for
>> any  partitiong.
>> Is it like strict odering means the job came 1st will go and until  it
>> runs it will  not allow others.
>>
>> Also priorities is high for gpusmall partition and low for normal jobs
>> and the nodes of the normal partition is full but gpusmall cores are
>> available.
>>
>> Regards
>> Navin
>>
>> On Fri, Apr 24, 2020, 23:49 Brian W. Johanson  wrote:
>>
>>> Without seeing the jobs in your queue, I would expect the next job in
>>> FIFO order to be too large to fit in the current idle resources.
>>>
>>> Configure it to use the backfill scheduler: SchedulerType=sched/backfill
>>>
>>>   SchedulerType
>>>   Identifies  the type of scheduler to be used.  Note the
>>> slurmctld daemon must be restarted for a change in scheduler type to become
>>> effective (reconfiguring a running daemon has no effect for this
>>> parameter).  The scontrol command can be used to manually change job
>>> priorities if desired.  Acceptable values include:
>>>
>>>   sched/backfill
>>>  For a backfill scheduling module to augment the
>>> default FIFO scheduling.  Backfill scheduling will initiate lower-priority
>>> jobs if doing so does not delay the expected initiation time of any
>>> higher  priority  job.   Effectiveness  of  backfill scheduling is
>>> dependent upon users specifying job time limits, otherwise all jobs will
>>> have the same time limit and backfilling is impossible.  Note documentation
>>> for the SchedulerParameters option above.  This is the default
>>> configuration.
>>>
>>>   sched/builtin
>>>  This  is  the  FIFO scheduler which initiates jobs
>>> in priority order.  If any job in the partition can not be scheduled, no
>>> lower priority job in that partition will be scheduled.  An exception is
>>> made for jobs that can not run due to partition constraints (e.g. the time
>>> limit) or down/drained nodes.  In that case, lower priority jobs can be
>>> initiated and not impact the higher priority job.
>>>
>>>
>>>
>>>

Re: [slurm-users] one job at a time - how to set?

Re: [slurm-users] one job at a time - how to set?

Re: [slurm-users] one job at a time - how to set?

Re: [slurm-users] one job at a time - how to set?

Re: [slurm-users] one job at a time - how to set?

Re: [slurm-users] one job at a time - how to set?

[slurm-users] TensorRT script runs with srun but not from a sbatch file

Re: [slurm-users] not allocating jobs even resources are free

[slurm-users] one job at a time - how to set?

[slurm-users] Feature request: SBATCH_NTASKS as input environment variable

Re: [slurm-users] not allocating jobs even resources are free

11 matches

Site Navigation

Mail list logo

Footer information