[slurm-users] SLURM Install

2020-07-13 Thread Bas van der Vlies
This is maybe a little of topic but I saw some posts in this list about slurm 
installations. We at SURF(sara) have developed a services framework for 
CFEngine3 that
install/configure/maintain with the aid of templates(mustache) and json files. 
Everything can be configured with external data files. We have written services 
for
 * slurm (packages or tarball installation) 
 * munge
 * nhc
 * enroot (nvidia container software)
 * jupyterhub (tarball installation )
 * many others ...

More info available at:
 * https://github.com/basvandervlies/cf_surfsara_lib
 * https://github.com/basvandervlies/cf_surfsara_lib/blob/master/doc/services.md
--
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG  
Amsterdam
| T +31 (0) 20 800 1300  | bas.vandervl...@surf.nl | www.surf.nl |






smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] CPU allocation for the GPU jobs.

2020-07-13 Thread navin srivastava
Thanks Renfro. My scheduling policy is below.

SchedulerType=sched/builtin
SelectType=select/cons_res
SelectTypeParameters=CR_Core
AccountingStorageEnforce=associations
AccountingStorageHost=192.168.150.223
AccountingStorageType=accounting_storage/slurmdbd
ClusterName=hpc
JobCompType=jobcomp/slurmdbd
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=5
SlurmdDebug=5
Waittime=0
Epilog=/etc/slurm/slurm.epilog.clean
GresTypes=gpu
MaxJobCount=500
SchedulerParameters=enable_user_top,default_queue_depth=100

# JOB PRIORITY
PriorityType=priority/multifactor
PriorityDecayHalfLife=2
PriorityUsageResetPeriod=DAILY
PriorityWeightFairshare=50
PriorityFlags=FAIR_TREE

let me try changing it to the backfill and will see if it helps.


Regards
Navin.





On Mon, Jul 13, 2020 at 5:16 PM Renfro, Michael  wrote:

> “The *SchedulerType* configuration parameter specifies the scheduler
> plugin to use. Options are sched/backfill, which performs backfill
> scheduling, and sched/builtin, which attempts to schedule jobs in a strict
> priority order within each partition/queue.”
>
> https://slurm.schedmd.com/sched_config.html
>
> If you’re using the builtin scheduler, lower priority jobs have no way to
> run ahead of higher priority jobs. If you’re using the backfill scheduler,
> your jobs will need specific wall times specified, since the idea with
> backfill is to run lower priority jobs ahead of time if and only if they
> can complete without delaying the estimated start time of higher priority
> jobs.
>
> On Jul 13, 2020, at 4:18 AM, navin srivastava 
> wrote:
>
> Hi Team,
>
> We have separate partitions for the GPU nodes and only CPU nodes .
>
> scenario: the jobs submitted in our environment is 4CPU+1GPU  as well as
> 4CPU only in  nodeGPUsmall and nodeGPUbig. so when all the GPU exhausted
> and rest other jobs are in queue waiting for the availability of GPU
> resources.the job submitted with only CPU is not going through even
> though plenty of CPU resources are available but the job which is only
> looking CPU, also on pend because of these GPU based jobs( priority of GPU
> jobs is higher than CPU one).
>
> Is there any option here we can do,so that when all GPU resources are
> exhausted then it should allow the CPU jobs. Is there a way to deal with
> it? or some custom solution which we can think of.  There is no issue with
> CPU only partitions.
>
> Below is the my slurm configuration file
>
>
> NodeName=node[1-12] NodeAddr=node[1-12] Sockets=2 CoresPerSocket=10
> RealMemory=128833 State=UNKNOWN
> NodeName=node[13-16] NodeAddr=node[13-16] Sockets=2 CoresPerSocket=10
> RealMemory=515954 Feature=HIGHMEM State=UNKNOWN
> NodeName=node[28-32]  NodeAddr=node[28-32] Sockets=2 CoresPerSocket=28
> RealMemory=257389
> NodeName=node[32-33]  NodeAddr=node[32-33] Sockets=2 CoresPerSocket=24
> RealMemory=773418
> NodeName=node[17-27]  NodeAddr=node[17-27] Sockets=2 CoresPerSocket=18
> RealMemory=257687 Feature=K2200 Gres=gpu:2
> NodeName=node[34]  NodeAddr=node34 Sockets=2 CoresPerSocket=24
> RealMemory=773410 Feature=RTX Gres=gpu:8
>
>
> PartitionName=node Nodes=node[1-10,14-16,28-33,35]  Default=YES
> MaxTime=INFINITE State=UP Shared=YES
> PartitionName=nodeGPUsmall Nodes=node[17-27]  Default=NO MaxTime=INFINITE
> State=UP Shared=YES
> PartitionName=nodeGPUbig Nodes=node[34]  Default=NO MaxTime=INFINITE
> State=UP Shared=YES
>
> Regards
> Navin.
>
>
>


Re: [slurm-users] CPU allocation for the GPU jobs.

2020-07-13 Thread Renfro, Michael
“The SchedulerType configuration parameter specifies the scheduler plugin to 
use. Options are sched/backfill, which performs backfill scheduling, and 
sched/builtin, which attempts to schedule jobs in a strict priority order 
within each partition/queue.”

https://slurm.schedmd.com/sched_config.html

If you’re using the builtin scheduler, lower priority jobs have no way to run 
ahead of higher priority jobs. If you’re using the backfill scheduler, your 
jobs will need specific wall times specified, since the idea with backfill is 
to run lower priority jobs ahead of time if and only if they can complete 
without delaying the estimated start time of higher priority jobs.

On Jul 13, 2020, at 4:18 AM, navin srivastava  wrote:

Hi Team,

We have separate partitions for the GPU nodes and only CPU nodes .

scenario: the jobs submitted in our environment is 4CPU+1GPU  as well as 4CPU 
only in  nodeGPUsmall and nodeGPUbig. so when all the GPU exhausted and rest 
other jobs are in queue waiting for the availability of GPU resources.the job 
submitted with only CPU is not going through even though plenty of CPU 
resources are available but the job which is only looking CPU, also on pend 
because of these GPU based jobs( priority of GPU jobs is higher than CPU one).

Is there any option here we can do,so that when all GPU resources are exhausted 
then it should allow the CPU jobs. Is there a way to deal with it? or some 
custom solution which we can think of.  There is no issue with CPU only 
partitions.

Below is the my slurm configuration file


NodeName=node[1-12] NodeAddr=node[1-12] Sockets=2 CoresPerSocket=10 
RealMemory=128833 State=UNKNOWN
NodeName=node[13-16] NodeAddr=node[13-16] Sockets=2 CoresPerSocket=10 
RealMemory=515954 Feature=HIGHMEM State=UNKNOWN
NodeName=node[28-32]  NodeAddr=node[28-32] Sockets=2 CoresPerSocket=28 
RealMemory=257389
NodeName=node[32-33]  NodeAddr=node[32-33] Sockets=2 CoresPerSocket=24 
RealMemory=773418
NodeName=node[17-27]  NodeAddr=node[17-27] Sockets=2 CoresPerSocket=18 
RealMemory=257687 Feature=K2200 Gres=gpu:2
NodeName=node[34]  NodeAddr=node34 Sockets=2 CoresPerSocket=24 
RealMemory=773410 Feature=RTX Gres=gpu:8


PartitionName=node Nodes=node[1-10,14-16,28-33,35]  Default=YES 
MaxTime=INFINITE State=UP Shared=YES
PartitionName=nodeGPUsmall Nodes=node[17-27]  Default=NO MaxTime=INFINITE 
State=UP Shared=YES
PartitionName=nodeGPUbig Nodes=node[34]  Default=NO MaxTime=INFINITE State=UP 
Shared=YES

Regards
Navin.




[slurm-users] Is preempt_reorder_count compatible with preempt_strict_order ?

2020-07-13 Thread Marc Odunlami
Hello, 
I would like to know if "preempt_reorder_count=#" is really used when using 
"preempt_strict_order". 
In my understanding: 


* "preempt_reorder_count=#" sets the number of iterations to reorder the 
preemption job list based on the job size. Allows to minimize the number of 
preempted jobs. 
* "preempt_strict_order" does not reoder the preemption job list, only 
selects in this list the jobs to be preempted based on their priority (low 
priority jobs will be preempted). 

If I am right, it means that "preempt_reorder_count" should not be used (or 
implicitely deactivated) when using "preempt_strict_order". 
Thank you for your help. 
Cheers. 



Marc Odunlami 
Responsable du Centre de Services Partagés (CSP) Numérique Recherche 

[ http://www.univ-pau.fr/ |   ] 
IPREM (UMR 5254) 
Pôle Numérique 
Technopole Hélioparc 
2, avenue du Président Pierre Angot 
F-64053 Pau Cedex 9 
Tél : (+33)5 59 40 75 22 
[ http://iprem.univ-pau.fr/fr/_plugins/mypage/mypage/content/modunlam.html | 
https://iprem.univ-pau.fr/fr/_plugins/mypage/mypage/content/modunlam.html ] 



[slurm-users] CPU allocation for the GPU jobs.

2020-07-13 Thread navin srivastava
Hi Team,

We have separate partitions for the GPU nodes and only CPU nodes .

scenario: the jobs submitted in our environment is 4CPU+1GPU  as well as
4CPU only in  nodeGPUsmall and nodeGPUbig. so when all the GPU exhausted
and rest other jobs are in queue waiting for the availability of GPU
resources.the job submitted with only CPU is not going through even
though plenty of CPU resources are available but the job which is only
looking CPU, also on pend because of these GPU based jobs( priority of GPU
jobs is higher than CPU one).

Is there any option here we can do,so that when all GPU resources are
exhausted then it should allow the CPU jobs. Is there a way to deal with
it? or some custom solution which we can think of.  There is no issue with
CPU only partitions.

Below is the my slurm configuration file


NodeName=node[1-12] NodeAddr=node[1-12] Sockets=2 CoresPerSocket=10
RealMemory=128833 State=UNKNOWN
NodeName=node[13-16] NodeAddr=node[13-16] Sockets=2 CoresPerSocket=10
RealMemory=515954 Feature=HIGHMEM State=UNKNOWN
NodeName=node[28-32]  NodeAddr=node[28-32] Sockets=2 CoresPerSocket=28
RealMemory=257389
NodeName=node[32-33]  NodeAddr=node[32-33] Sockets=2 CoresPerSocket=24
RealMemory=773418
NodeName=node[17-27]  NodeAddr=node[17-27] Sockets=2 CoresPerSocket=18
RealMemory=257687 Feature=K2200 Gres=gpu:2
NodeName=node[34]  NodeAddr=node34 Sockets=2 CoresPerSocket=24
RealMemory=773410 Feature=RTX Gres=gpu:8


PartitionName=node Nodes=node[1-10,14-16,28-33,35]  Default=YES
MaxTime=INFINITE State=UP Shared=YES
PartitionName=nodeGPUsmall Nodes=node[17-27]  Default=NO MaxTime=INFINITE
State=UP Shared=YES
PartitionName=nodeGPUbig Nodes=node[34]  Default=NO MaxTime=INFINITE
State=UP Shared=YES

Regards
Navin.


Re: [slurm-users] squeue reports ReqNodeNotAvail but node is available

2020-07-13 Thread Ole Holm Nielsen

Hi Janna,

If you're running an old Slurm version, there may be bugs already resolved 
in the later versions.  You can search for bugs with ReqNodeNotAvail in 
the title:

https://bugs.schedmd.com/buglist.cgi?quicksearch=ReqNodeNotAvail

For example, this one might be relevant:
https://bugs.schedmd.com/show_bug.cgi?id=9257

Upgrade to Slurm 20.02 is highly recommended.

/Ole

On 7/12/20 3:36 PM, Ole Holm Nielsen wrote:

In case your Arp cache is the problem, there is some advice in the Wiki page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks 



I think there are other causes for ReqNodeNotAvail, for example, the node 
being allocated for other jobs.  The "scontrol show node/job" should 
reveal more details.


/Ole


On 11-07-2020 06:00, mercan wrote:

Hi Janna;

It sounds like a Arp cache table problem to me. If your slurm head node 
can reachable ~1000 or more network devices (all connected network 
cards, switches etc., even they are reachable by different ports of the 
server), you need to increse some network settings at headnode and 
servers which can reach same amount of network device :


http://docs.adaptivecomputing.com/torque/5-0-3/Content/topics/torque/12-appendices/otherConsiderations.htm 



Also some advices for big cluster at slurm documentation:

https://slurm.schedmd.com/big_sys.html

Regards,

Ahmet M.


11.07.2020 01:34 tarihinde Janna Ore Nugent yazdı:


Hi All,

I’ve got an intermittent situation with gpu nodes that sinfo says are 
available and idle, but squeue reports as “ReqNodeNotAvail”.  We’ve 
cycled the nodes to restart services but it hasn’t helped.  Any 
suggestions for resolving this or digging into it more deeply?