[slurm-dev] Re: Meaning of error messages from sbatch

2017-01-16 Thread Barbara Krašovec
Hello!


On 01/13/2017 08:18 AM, Nigella Sanders wrote:
> Meaning of error messages from sbatch
>
>
> Hi all,
>
>
> When launching my simple task farming jobs (described here
> ),
> sbatch usually reports errors like these:
>
> * ibwarn: [100226] _do_madrpc: recv failed: Interrupted system call
>  ibwarn: [100226] mad_rpc: _do_madrpc failed; dport (Lid 150)
>  slurmstepd: ofed: No error
>  ibwarn: [12463] _do_madrpc: recv failed: Interrupted system call
>  ibwarn: [12463] mad_rpc: _do_madrpc failed; dport (Lid 114)
>  slurmstepd: classportinfo query: No error*
>
> What do they mean?
> I guess they are related to infiniband I but don't know how to manage
> them or what to look into to fix them.
> Most of times they don't affect the results and jobs complete ok.
>
> Any clue on this would be much appreciated,
>
> Regards,
> Nigella.
I wouldn't worry too much, if the interrupted syscalls do not turn into
errors (for now you only see warnings). I would try to run some
diagnostic with ibdiagnet and try ibping..
If the warnings are present on every node, maybe you should consider a
firmware upgrade.

Cheers,
Barbara



[slurm-dev] Re: A little bit help from my slurm-friends

2017-01-16 Thread Loris Bennett

Hello David,

David WALTER  writes:

> Hello everyone,
>
> I need some advice or some good practices as I’m a new SLURM’s administrator… 
> in
> fact a new cluster manager !
>
> Everything is OK, jobs running well etc… But now I would like to configure
> priority on jobs to improve the efficiency of my cluster. I see I have to
> activate “Multifactor Priority plugin” to get rid of the FIFO's default 
> behavior
> of SLURM.
>
> So there are 6 factors and the fair share one is interesting me but do you 
> some
> advices ? I’m managing a small cluster (I think), 40 nodes, with 4 different
> generations (and different hardware) and I would like to optimize it. For now 
> I
> set 4 partitions, 1 per generation that may be not the best solution ?

An alternative would be to have just one partition and to distinguish
the the machines via 'features defined in slurm.conf.  It depends a bit
on how different the machines are and how interested in these
differences the users are.

> Do you think I can just use the “job size” and “partition” and maybe the “age”
> factors ? Maybe you need more information ?

I would have thought that in general you want to use 'fairshare' as
well, but that obviously depends on what you are trying to achieve.

> In any case thanks for your help
>
> David

Regards

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] A little bit help from my slurm-friends

2017-01-16 Thread David WALTER
Hello everyone,

 

I need some advice or some good practices as I'm a new SLURM's
administrator. in fact a new cluster manager !

 

Everything is OK, jobs running well etc. But now I would like to configure
priority on jobs to improve the efficiency of my cluster. I see I have to
activate "Multifactor Priority plugin" to get rid of the FIFO's default
behavior of SLURM.

 

So there are 6 factors and the fair share one is interesting me but do you
some advices ? I'm managing a small cluster (I think), 40 nodes, with 4
different generations (and different hardware) and I would like to optimize
it. For now I set 4 partitions, 1 per generation that may be not the best
solution ?

 

Do you think I can just use the "job size" and "partition" and maybe the
"age" factors ? Maybe you need more information ?

 

In any case thanks for your help

 

David

 



[slurm-dev] RE: A little bit help from my slurm-friends

2017-01-16 Thread David WALTER

Dear Loris,

Thanks for your response !

I'm going to look on this features in slurm.conf.  I only configured the CPUs, 
Sockets per node. Do you have any example or link to explain me how it's 
working and what can I use ?

My goal is to respond to people needs and launch their jobs as fast as possible 
without losing time when one partition is idle whereas the others are fully 
loaded. That's why I thought the fair share factor was the best solution

Thanks

Greetings

--
David WALTER
The computer guy
david.wal...@ens.fr
01/44/32/27/94

INSERM U960
Laboratoire de Neurosciences Cognitives
Ecole Normale Supérieure
29, rue d'Ulm
75005 Paris

-Message d'origine-
De : Loris Bennett [mailto:loris.benn...@fu-berlin.de] 
Envoyé : lundi 16 janvier 2017 13:12
À : David WALTER
Cc : slurm-dev
Objet : Re: [slurm-dev] A little bit help from my slurm-friends

Hello David,

David WALTER  writes:

> Hello everyone,
>
> I need some advice or some good practices as I’m a new SLURM’s 
> administrator… in fact a new cluster manager !
>
> Everything is OK, jobs running well etc… But now I would like to 
> configure priority on jobs to improve the efficiency of my cluster. I 
> see I have to activate “Multifactor Priority plugin” to get rid of the 
> FIFO's default behavior of SLURM.
>
> So there are 6 factors and the fair share one is interesting me but do 
> you some advices ? I’m managing a small cluster (I think), 40 nodes, 
> with 4 different generations (and different hardware) and I would like 
> to optimize it. For now I set 4 partitions, 1 per generation that may be not 
> the best solution ?

An alternative would be to have just one partition and to distinguish the the 
machines via 'features defined in slurm.conf.  It depends a bit on how 
different the machines are and how interested in these differences the users 
are.

> Do you think I can just use the “job size” and “partition” and maybe the “age”
> factors ? Maybe you need more information ?

I would have thought that in general you want to use 'fairshare' as well, but 
that obviously depends on what you are trying to achieve.

> In any case thanks for your help
>
> David

Regards

Loris

--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] RE: A little bit help from my slurm-friends

2017-01-16 Thread Loris Bennett

David WALTER  writes:

> Dear Loris,
>
> Thanks for your response !
>
> I'm going to look on this features in slurm.conf.  I only configured
> the CPUs, Sockets per node. Do you have any example or link to
> explain me how it's working and what can I use ?

It's not very complicated.  A feature is just a label, so if you had
some nodes with Intel processors and some with AMD, you could attach the
features, e.g.

NodeName=node[001,002] Procs=12 Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=42000 State=unknown Feature=intel
NodeName=node[003,004] Procs=12 Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=42000 State=unknown Feature=amd

Users then just request the required CPU type in their batch scripts as
a constraint, e.g:

#SBATCH --constraint="intel"

> My goal is to respond to people needs and launch their jobs as fast as
> possible without losing time when one partition is idle whereas the
> others are fully loaded.

The easiest way to avoid the problem you describe is to just have one
partition.  If you have multiple partitions, the users have to
understand what the differences are so that they can choose sensibly.

> That's why I thought the fair share factor was the best solution

Fairshare won't really help you with the problem that one partition
might be full while another is empty.  It will just affect the ordering
of jobs in the full partition, although the weight of the partition term
in the priority expression can affect the relative attractiveness of the
partitions.

In general, however, I would suggest you start with a simple set-up.
You can always add to it later to address specific issues as they arise.
For instance, you could start with one partition and two QOS: one for
normal jobs and one for test jobs.  The latter could have a higher
priority, but only a short maximum run-time and possibly a low maximum
number of jobs per user.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Slurm for render farm

2017-01-16 Thread John Hearns
Is anyone out there using SLurm in conjunction with Renderpal   
http://www.renderpal.com/http://www.renderpal.com/

Forestalling the obvious replies... yes I know that a render farm manager and a 
scheduler do basically the same thing. In a rational universe I would be using 
one or 'tother. Perhaps in the next life...

The concept at the moment is to run the Renderpal server, which is a Windows 
application and it can detect the Linux render clients via  a 'heartbeat' 
mechanism.
I would spawn the Linux clients as needed via slurm.

Thinking out loud, I could use slurm to run render clients on all compute nodes 
in the cluster, then use job preemption to kill the jobs when other compute 
jobs need the nodes.  I guess that very much risks 'live' Renderpal jobs being 
killed off.

Any experiences in this area gratefully received.

John H


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP


[slurm-dev] build slurm on musl?

2017-01-16 Thread Rowan, Jim

Hi,

We're trying to bring up slurm (compute nodes, not the master) on a platform 
that uses musl for libc.   Musl doesn’t support lazy binding of symbols in 
dynamic objects — a feature that seems to be a cornerstone of the plugin 
implementation.

Has anyone done work on getting around this issue?Ideas to share on what 
approach to take?   Our cluster is very simple; we don’t need any of the fancy 
things that plugins provide, but it appears that even linear scheduling is a 
plugin.




Jim Rowan
j...@codeaurora.org
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by 
the Linux Foundation