[slurm-dev] Re: LDAP required?

2017-04-10 Thread Uwe Sauter

For someone with no experience in LDAP deployment, yes, LDAP is a big issue. 
And depending on the cluster size, there are
different possibilities.

>From a different point of view: tools like Salt/Ansible/… will require almost 
>everytime some kind of local storage (local
installation of OS) while with LDAP you could have a diskles cluster booting 
from NFS. But this kind of setup wouldn't need a LDAP
too: you could just manage the files on the NFS server and all nodes will get 
the change immediatelly.


Am 11.04.2017 um 07:49 schrieb Marcin Stolarek:
> but... is LDAP such a big issue?
> 
> 2017-04-10 22:03 GMT+02:00 Jeff White  >:
> 
> Using Salt/Ansible/Chef/Puppet/Engine is another way to get it done.  
> Define your users in states/playbooks/whatever and don't
> bother with painful LDAP or ancient NIS solutions.
> 
> -- 
> Jeff White
> HPC Systems Engineer
> Information Technology Services - WSU
> 
> On 04/10/2017 09:39 AM, Alexey Safonov wrote:
>> If you don't want to share passwd and setup LDAP which is complex task 
>> you can setup NIS. It will take 30 minutes of your time
>>
>> Alex
>>
>> 11 апр. 2017 г. 0:35 пользователь "Raymond Wan" > > написал:
>>
>>
>> Dear all,
>>
>> I'm trying to set up a small cluster of computers (i.e., less than 5
>> nodes).  I don't expect the number of nodes to ever get larger than
>> this.
>>
>> For SLURM to work, I understand from web pages such as
>> https://slurm.schedmd.com/accounting.html
>> 
>> 
>> that UIDs need to be shared
>> across nodes.  Based on this web page, it seems sharing /etc/passwd
>> between nodes appears sufficient.  The word LDAP is mentioned at the
>> end of the paragraph as an alternative.
>>
>> I guess what I would like to know is whether it is acceptable to
>> completely avoid LDAP and use the approach mentioned there?  The
>> reason I'm asking is that I seem to be having a very nasty time
>> setting up LDAP.  It doesn't seem as "easy" as I thought it would be
>> [perhaps it was my fault for thinking it would be easy...].
>>
>> If I can set up a small cluster without LDAP, that would be great.
>> But beyond this web page, I am wondering if there are suggestions for
>> "best practices".  For example, in practice, do most administrators
>> use LDAP?  If so and if it'll pay off in the end, then I can consider
>> continuing with setting it up...
>>
>> Thanks a lot!
>>
>> Ray
>>
> 
> 


[slurm-dev] Re: LDAP required?

2017-04-10 Thread Marcin Stolarek
but... is LDAP such a big issue?

2017-04-10 22:03 GMT+02:00 Jeff White :

> Using Salt/Ansible/Chef/Puppet/Engine is another way to get it done.
> Define your users in states/playbooks/whatever and don't bother with
> painful LDAP or ancient NIS solutions.
>
> --
> Jeff White
> HPC Systems Engineer
> Information Technology Services - WSU
>
> On 04/10/2017 09:39 AM, Alexey Safonov wrote:
>
> If you don't want to share passwd and setup LDAP which is complex task you
> can setup NIS. It will take 30 minutes of your time
>
> Alex
>
> 11 апр. 2017 г. 0:35 пользователь "Raymond Wan" 
> написал:
>
>>
>> Dear all,
>>
>> I'm trying to set up a small cluster of computers (i.e., less than 5
>> nodes).  I don't expect the number of nodes to ever get larger than
>> this.
>>
>> For SLURM to work, I understand from web pages such as
>> https://slurm.schedmd.com/accounting.html
>> 
>> that UIDs need to be shared
>> across nodes.  Based on this web page, it seems sharing /etc/passwd
>> between nodes appears sufficient.  The word LDAP is mentioned at the
>> end of the paragraph as an alternative.
>>
>> I guess what I would like to know is whether it is acceptable to
>> completely avoid LDAP and use the approach mentioned there?  The
>> reason I'm asking is that I seem to be having a very nasty time
>> setting up LDAP.  It doesn't seem as "easy" as I thought it would be
>> [perhaps it was my fault for thinking it would be easy...].
>>
>> If I can set up a small cluster without LDAP, that would be great.
>> But beyond this web page, I am wondering if there are suggestions for
>> "best practices".  For example, in practice, do most administrators
>> use LDAP?  If so and if it'll pay off in the end, then I can consider
>> continuing with setting it up...
>>
>> Thanks a lot!
>>
>> Ray
>>
>
>


[slurm-dev] Re: set next job ID in scheduler

2017-04-10 Thread Edward Walter

Perfect!  Worked like a charm.

Thank you.

-Ed

From: Nicholas McCollum 
Sent: Monday, April 10, 2017 3:48 PM
To: slurm-dev
Subject: [slurm-dev] Re: set next job ID in scheduler

Set FirstJobId in your slurm.conf

FirstJobId=12345


--
Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority

On Mon, 2017-04-10 at 12:45 -0700, Edward Walter wrote:
> Hi All,
>
> We recently experienced a RAID failure on one of our clusters
> running
> slurm.  We have everything back up and running now.  We would like
> to
> set the next job that the scheduler uses so that there are no
> collisions
> in the job numbers between the pre-crash and post-crash job IDs.
>
> Can anyone point me at the appropriate scontrol (or other) command
> to
> set the next job ID?
>
> Thank you in advance.
>
> -Ed
>


[slurm-dev] Re: Deleting jobs in Completing state on hung nodes

2017-04-10 Thread Paul Edmon


Sometimes restarting slurm on the node and the master can purge the jobs 
as well.


-Paul Edmon-


On 04/10/2017 03:59 PM, Douglas Meyer wrote:

Set node to drain if other jobs running.  Then down and then resume.  Down will 
kill and clear any jobs.

scontrol update nodename= state=drain reason=job_sux

scontrol update nodename= state=down reason=job_sux
scontrol update nodename= state=resume

If it happens again either reboot or stop and restart slurm.  Make sure you 
verify it has stopped.

Doug

-Original Message-
From: Gene Soudlenkov [mailto:g.soudlen...@auckland.ac.nz]
Sent: Monday, April 10, 2017 12:56 PM
To: slurm-dev 
Subject: [slurm-dev] Re: Deleting jobs in Completing state on hung nodes


It happens sometimes - in our case epilogue code got stuck. Either check the 
processes and kill whicehver ones belong to the user or simply reboot the nodes.

Cheers,
Gene

--
New Zealand eScience Infrastructure
Centre for eResearch
The University of Auckland
e: g.soudlen...@auckland.ac.nz
p: +64 9 3737599 ext 89834 c: +64 21 840 825 f: +64 9 373 7453
w: www.nesi.org.nz

On 11/04/17 07:52, Tus wrote:

I have 2 nodes that have hardware issues and died with jobs running on
them. I am not able to fix the nodes at the moment but want to delete
the jobs that are stuck in completing state from slurm. I have set the
nodes to DRAIN and tried scancel which did not work.

How do I remove these jobs?




[slurm-dev] Re: LDAP required?

2017-04-10 Thread Jeff White
Using Salt/Ansible/Chef/Puppet/Engine is another way to get it done.  
Define your users in states/playbooks/whatever and don't bother with 
painful LDAP or ancient NIS solutions.


--
Jeff White
HPC Systems Engineer
Information Technology Services - WSU

On 04/10/2017 09:39 AM, Alexey Safonov wrote:

Re: [slurm-dev] LDAP required?
If you don't want to share passwd and setup LDAP which is complex task 
you can setup NIS. It will take 30 minutes of your time


Alex

11 апр. 2017 г. 0:35 пользователь "Raymond Wan" > написал:



Dear all,

I'm trying to set up a small cluster of computers (i.e., less than 5
nodes).  I don't expect the number of nodes to ever get larger than
this.

For SLURM to work, I understand from web pages such as
https://slurm.schedmd.com/accounting.html


that UIDs need to be shared
across nodes.  Based on this web page, it seems sharing /etc/passwd
between nodes appears sufficient.  The word LDAP is mentioned at the
end of the paragraph as an alternative.

I guess what I would like to know is whether it is acceptable to
completely avoid LDAP and use the approach mentioned there? The
reason I'm asking is that I seem to be having a very nasty time
setting up LDAP.  It doesn't seem as "easy" as I thought it would be
[perhaps it was my fault for thinking it would be easy...].

If I can set up a small cluster without LDAP, that would be great.
But beyond this web page, I am wondering if there are suggestions for
"best practices".  For example, in practice, do most administrators
use LDAP?  If so and if it'll pay off in the end, then I can consider
continuing with setting it up...

Thanks a lot!

Ray





[slurm-dev] Re: Deleting jobs in Completing state on hung nodes

2017-04-10 Thread Douglas Meyer
Set node to drain if other jobs running.  Then down and then resume.  Down will 
kill and clear any jobs.

scontrol update nodename= state=drain reason=job_sux

scontrol update nodename= state=down reason=job_sux
scontrol update nodename= state=resume

If it happens again either reboot or stop and restart slurm.  Make sure you 
verify it has stopped.

Doug

-Original Message-
From: Gene Soudlenkov [mailto:g.soudlen...@auckland.ac.nz] 
Sent: Monday, April 10, 2017 12:56 PM
To: slurm-dev 
Subject: [slurm-dev] Re: Deleting jobs in Completing state on hung nodes


It happens sometimes - in our case epilogue code got stuck. Either check the 
processes and kill whicehver ones belong to the user or simply reboot the nodes.

Cheers,
Gene

--
New Zealand eScience Infrastructure
Centre for eResearch
The University of Auckland
e: g.soudlen...@auckland.ac.nz
p: +64 9 3737599 ext 89834 c: +64 21 840 825 f: +64 9 373 7453
w: www.nesi.org.nz

On 11/04/17 07:52, Tus wrote:
>
> I have 2 nodes that have hardware issues and died with jobs running on 
> them. I am not able to fix the nodes at the moment but want to delete 
> the jobs that are stuck in completing state from slurm. I have set the 
> nodes to DRAIN and tried scancel which did not work.
>
> How do I remove these jobs?
>
>


[slurm-dev] Re: Deleting jobs in Completing state on hung nodes

2017-04-10 Thread Gene Soudlenkov


It happens sometimes - in our case epilogue code got stuck. Either check 
the processes and kill whicehver ones belong to the user or simply 
reboot the nodes.


Cheers,
Gene

--
New Zealand eScience Infrastructure
Centre for eResearch
The University of Auckland
e: g.soudlen...@auckland.ac.nz
p: +64 9 3737599 ext 89834 c: +64 21 840 825 f: +64 9 373 7453
w: www.nesi.org.nz

On 11/04/17 07:52, Tus wrote:


I have 2 nodes that have hardware issues and died with jobs running on
them. I am not able to fix the nodes at the moment but want to delete the
jobs that are stuck in completing state from slurm. I have set the nodes
to DRAIN and tried scancel which did not work.

How do I remove these jobs?




[slurm-dev] Deleting jobs in Completing state on hung nodes

2017-04-10 Thread Tus

I have 2 nodes that have hardware issues and died with jobs running on
them. I am not able to fix the nodes at the moment but want to delete the
jobs that are stuck in completing state from slurm. I have set the nodes
to DRAIN and tried scancel which did not work.

How do I remove these jobs?




[slurm-dev] Re: set next job ID in scheduler

2017-04-10 Thread Nicholas McCollum
Set FirstJobId in your slurm.conf

FirstJobId=12345


-- 
Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority

On Mon, 2017-04-10 at 12:45 -0700, Edward Walter wrote:
> Hi All,
> 
> We recently experienced a RAID failure on one of our clusters
> running 
> slurm.  We have everything back up and running now.  We would like
> to 
> set the next job that the scheduler uses so that there are no
> collisions 
> in the job numbers between the pre-crash and post-crash job IDs.
> 
> Can anyone point me at the appropriate scontrol (or other) command
> to 
> set the next job ID?
> 
> Thank you in advance.
> 
> -Ed
> 

[slurm-dev] set next job ID in scheduler

2017-04-10 Thread Edward Walter


Hi All,

We recently experienced a RAID failure on one of our clusters running 
slurm.  We have everything back up and running now.  We would like to 
set the next job that the scheduler uses so that there are no collisions 
in the job numbers between the pre-crash and post-crash job IDs.


Can anyone point me at the appropriate scontrol (or other) command to 
set the next job ID?


Thank you in advance.

-Ed

--

Ed Walter
Technical Manager - Unix Engineering
SCS Computing Facilities
Carnegie Mellon University


[slurm-dev] Re: Re:Best Way to Schedule Jobs based on predetermined Lists

2017-04-10 Thread Thomas M. Payerle




On 2017-04-05 16:00, maviko.wag...@fau.de wrote:

Hello Dani and Thomas,

[ ...]

However i think i did not specify clear enough what my cluster looks
like and what i'm trying to achieve.
Compared to a regular HPC-Cluster my testing cluster consists of as
little as 5 nodes (each having the same "grandscale"-features, so no
IB-nodes etc., and only differ in hardware details like cpu, #RAM
etc., including some MCUs akin to a raspberry pi).
The purpose of this cluster is to investigate how smart distribution
of workloads based on predetermined performance and energy data can
benefit hpc-clusters that consist of heterogenous systems that differ
greatly regarding energy consumption and performance.
Its just a small research project.


IF your only intent is to do this on your 5 node test cluster, you
probably do not need Slurm.  If you are looking to have something expand
to real clusters, then you really should be using something like features
and the scheduler.  The scheduler is already taking into account what nodes
provide what resources, what resources on the nodes are currently available
for use, and handling the complex (and generally, at least I found it to
be much more complex than I initially and naively thought it should be when
first started thinking about it) task of scheduling the jobs.

My IB example was just an example.  You could just as easily assign 
Slurm "features" based on the feature set of the CPU, etc.  E.g., if only

some nodes have CPUs with support AVX, label those nodes as "avx" and jobs
requiring AVX can restrict themselves to such nodes.

If you start specifying specific nodes in requests to the scheduler, you
are going to find yourself working against the scheduler, and that is not
likely to have a good outcome.   You are better off telling Slurm which 
nodes have which features (essentially a one-time configuration of the cluster)

and then have your code translate the requirements into a list of "features"
requested for the job under Slurm.

The only part that I see as a potential major problem is that, as I tried to
explain previously, the "features" requested by a job are REQUIREMENTS to
Slurm, not SUGGESTIONS.  E.g., if a job can run either with or without AVX
support, but runs better with AVX support, requiring the "avx" feature will
force the job to wait for a node supporting AVX, even if all the AVX nodes
are in use and there are plenty of non-AVX nodes which are idle.

I am not aware of anything in Slurm which handles such as SUGGESTIONS, and
doing such I believe would greatly complicate an already complex algorithm.
I believe anything done would need to modify the actual C code for the 
scheduler.
It probably is not _too_ bad to have a situation wherein when a job 
"suggesting" avx
starts to run, it picks any avx nodes currently available to it first.  But that
is likely to have only limited success.  The closest thing currently in the
Slurm code base is the stuff for attempting to keep the nodes for a job on the
same leaf-switch (e.g. https://slurm.schedmd.com/topology.html), but I suspect
that would be quite complicated to handle across a number of "features".



[slurm-dev] Re: LDAP required?

2017-04-10 Thread Grigory Shamov

Hi Raymond,

In old days, say 10 years back, NIS and LDAP were avoided altogether on
HPC clusters, with some mechanism of synchronizing /etc/passwd and such
files across the nodes instead. OSCAR, ROCKS etc. all did it this way. We
do it using some rsync-based machinery.

It might be possible to set up LDAP but might need something like name
service caching (nscd) and possibly with  increasing the limits.conf
limits. 

-- 
Grigory Shamov

Westgrid/ComputeCanada Site Lead
University of Manitoba
E2-588 EITC Building,
(204) 474-9625





On 2017-04-10, 11:36 AM, "Raymond Wan"  wrote:

>
>Dear all,
>
>I'm trying to set up a small cluster of computers (i.e., less than 5
>nodes).  I don't expect the number of nodes to ever get larger than
>this.
>
>For SLURM to work, I understand from web pages such as
>https://slurm.schedmd.com/accounting.html that UIDs need to be shared
>across nodes.  Based on this web page, it seems sharing /etc/passwd
>between nodes appears sufficient.  The word LDAP is mentioned at the
>end of the paragraph as an alternative.
>
>I guess what I would like to know is whether it is acceptable to
>completely avoid LDAP and use the approach mentioned there?  The
>reason I'm asking is that I seem to be having a very nasty time
>setting up LDAP.  It doesn't seem as "easy" as I thought it would be
>[perhaps it was my fault for thinking it would be easy...].
>
>If I can set up a small cluster without LDAP, that would be great.
>But beyond this web page, I am wondering if there are suggestions for
>"best practices".  For example, in practice, do most administrators
>use LDAP?  If so and if it'll pay off in the end, then I can consider
>continuing with setting it up...
>
>Thanks a lot!
>
>Ray


[slurm-dev] Re: LDAP required?

2017-04-10 Thread Alexey Safonov
If you don't want to share passwd and setup LDAP which is complex task you
can setup NIS. It will take 30 minutes of your time

Alex

11 апр. 2017 г. 0:35 пользователь "Raymond Wan" 
написал:

>
> Dear all,
>
> I'm trying to set up a small cluster of computers (i.e., less than 5
> nodes).  I don't expect the number of nodes to ever get larger than
> this.
>
> For SLURM to work, I understand from web pages such as
> https://slurm.schedmd.com/accounting.html that UIDs need to be shared
> across nodes.  Based on this web page, it seems sharing /etc/passwd
> between nodes appears sufficient.  The word LDAP is mentioned at the
> end of the paragraph as an alternative.
>
> I guess what I would like to know is whether it is acceptable to
> completely avoid LDAP and use the approach mentioned there?  The
> reason I'm asking is that I seem to be having a very nasty time
> setting up LDAP.  It doesn't seem as "easy" as I thought it would be
> [perhaps it was my fault for thinking it would be easy...].
>
> If I can set up a small cluster without LDAP, that would be great.
> But beyond this web page, I am wondering if there are suggestions for
> "best practices".  For example, in practice, do most administrators
> use LDAP?  If so and if it'll pay off in the end, then I can consider
> continuing with setting it up...
>
> Thanks a lot!
>
> Ray
>


[slurm-dev] Re: Questions about Openmpi, PMI-1 and slurm.conf

2017-04-10 Thread Doug Meyer
Many thanks to all.

Doug

On Sat, Apr 8, 2017 at 9:18 PM, r...@open-mpi.org  wrote:

>
>
> > On Apr 8, 2017, at 8:44 PM, Doug Meyer  wrote:
> >
> > Running 15.x and have run into a next step that is probably us tripping
> over our feet.
> >
> > Engineers were happy clams with SGE but it was time to move one.  We
> have adopted slurm and are moving users forward.  So far, much joy.  As we
> work more with MPI apps we are getting foggy.
> >
> > slurm support automatically included in openmpi builds.  Excellent.
> >
> > Must add build flag for PMI in OpenMPI build.  Got it.
>
> Not necessarily. If you are running 15.x of SLURM, then you can use the
> PMIx support from OpenMPI v2.x. You’ll find jobs start significantly faster
> that way. Or if you decide (see below) to launch via mpirun, then you also
> don’t need to set the flag.
>
> >
> > Mpidefault and Mpiflags??
> >
> > Should Mpidefault be "none"?  If all our MPI work is OpenMPI with PMI-1
> are we better off setting it to openmpi?
>
> I’m not sure the openmpi plugin actually does much of anything. We
> certainly don’t rely on it doing anything.
>
> >
> > Understand we can use srun to change the MPI.
> >
> > From reading it sounds like Mpidefault can be ignored if we are running
> OpenMPI 1.5 or newer and PMI.  Is that correct?
>
> Yes, as I said above. Though I would strongly suggest you start with OMPI
> v2.x
>
> >
> > Finally, from reading it seems slurm is very mature and supplants the
> need for mpirun unless we are using an salloc or running from an external
> script. Using srun or sbatch, forget about mpirun.  Is this correct?
>
> Not completely - while it is true that you can do a lot with srun, there
> are a number of features that mpirun supports and srun does not. So it
> really is a matter of looking at the options each provides, and deciding
> which meets your needs. You won’t find any performance difference
> regardless of which launch method you use, and with OMPI v2.x, both start
> in the same amount of time.
>
> Ralph
>
> >
> > Thank you,
> > Doug Meyer
> >
> >
>