[slurm-dev] Re: question about federation

2017-11-01 Thread Ole Holm Nielsen


I'm pretty sure that a single, central slurmdbd service is required for 
multiple, federated clusters.  I think that's what ties multiple 
clusters together into a single "federation".


You mention a problem with squeue, but you don't list the error 
messages.  Are you sure that all nodes have identical slurm.conf, and 
that daemons have been restarted after changes?  You may want to consult 
my Slurm Wiki at https://wiki.fysik.dtu.dk/niflheim/SLURM for 
configuration details.


Caveat: I just heard the talk at the SLUG conference, but I have no 
intention of working with federated clusters myself.  So I can't help 
you.  Commercial support from SchedMD is recommended, see 
https://www.schedmd.com/services.php


/Ole

On 10/31/2017 06:36 PM, zhangtao102...@126.com wrote:

Thank you very much, Ole
I have read this PDF document, but i'm not sure about the configuration.
I guess the two slurmctld should be configured to use the same slurmdbd.
Is it right? Or which is the right way?
Thanks,regards


zhangtao102...@126.com

    *From:* Ole Holm Nielsen <mailto:ole.h.niel...@fysik.dtu.dk>
*Date:* 2017-10-31 19:08
*To:* slurm-dev <mailto:slurm-dev@schedmd.com>
*Subject:* [slurm-dev] Re: question about federation
On 10/31/2017 09:34 AM, zhangtao102...@126.com wrote:>     I have
noticed that slurm v17.11 will federated cluster, but i
 > cann't find detailed documentation about it.
 >     Now, i have 2 question about federated cluster:
 >     (1) When configuring federated cluster, should i configure
the two
 > slurmctld communicated with the same slurmdbd (or make each
cluster's
 > slurmctld/slurmdbd worked with the same mysql database)?
Federation support was described at the Slurm User Group Meeting last
month. PDFs of the presentations are online at
http://slurm.schedmd.com/publications.html
See the talk: Technical: Federated Cluster Support, Brian Christiansen
and Danny Auble, SchedMD.
Maybe this will help you?
/Ole


[slurm-dev] Re: question about federation

2017-10-31 Thread Ole Holm Nielsen


On 10/31/2017 09:34 AM, zhangtao102...@126.com wrote:>     I have 
noticed that slurm v17.11 will federated cluster, but i

cann't find detailed documentation about it.
    Now, i have 2 question about federated cluster:
    (1) When configuring federated cluster, should i configure the two 
slurmctld communicated with the same slurmdbd (or make each cluster's 
slurmctld/slurmdbd worked with the same mysql database)?


Federation support was described at the Slurm User Group Meeting last 
month. PDFs of the presentations are online at

http://slurm.schedmd.com/publications.html
See the talk: Technical: Federated Cluster Support, Brian Christiansen 
and Danny Auble, SchedMD.


Maybe this will help you?

/Ole


[slurm-dev] Re: SLURM 17.02.8 not optimally scheduling jobs/utilizing resources

2017-10-25 Thread Ole Holm Nielsen


On 10/25/2017 01:52 PM, Holger Naundorf wrote:

I'd really appreciate any help the SLURM wizards can provide! We suspect
it's something to do with how we've set up QoS or maybe, we need to
tweak the scheduler configuration in 17.02.8 however there's no single
clear path forward. Just let me know if there's any further information
I can provide to help troubleshoot or give fodder for suggestions.



While I am in no way a SLURM wizard - one thing i would try is
increasing 'bf_max_job_test' to s.th. much bigger (in the order of the
usual length of your queued up jobs). In this setting (as far as I
understand it) as soon as your 50 top priority queued jobs are waiting
for 'legitimate' reasons (i.e. their designated nodes/QOS is full)
everything below them will not get backfilled anymore.


I agree that the backfill scheduler requires configuration beyond the 
default settings!  This surprised me as well.  I wrote some notes in my 
Wiki which could be used as a starting point: 
https://wiki.fysik.dtu.dk/niflheim/Slurm_scheduler#backfill-scheduler


/Ole


[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Ole Holm Nielsen


Hi Jin,

I think that I always do your steps 3,4 in the opposite order: Restart 
slurmctld, then slurmd on nodes:


> 3. Restart the slurmd on all nodes
> 4. Restart the slurmctld

Since you run a very old Slurm 15.08, perhaps you should upgrade 15.08 
-> 16.05 -> 17.02.  Soon there will be a 17.11.  FYI: I wrote some notes 
about upgrading: 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm


/Ole



On 10/23/2017 02:55 PM, JinSung Kang wrote:

Hi

Thanks everyone for your response. I have also tested my setup to remove 
nodes from the cluster, and the same thing happens.


*To answer some of the previous questions.*
"Node compute004 appears to have a different slurm.conf than the 
slurmctld" error comes up when I replace slurm.conf in all the devices, 
but it goes away when I restart slurmctld.


slurm version that I'm running is slurm 15.08.7

I've included the slurm.conf rather than slurmdbd.conf.

Cheers,

Jin


On Mon, Oct 23, 2017 at 8:25 AM Ole Holm Nielsen 
<ole.h.niel...@fysik.dtu.dk <mailto:Ole.H.Nhttps://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurmiel...@fysik.dtu.dk>> wrote:



Hi Jin,

Your slurmctld.log says "Node compute004 appears to have a different
slurm.conf than the slurmctld" etc.  This will happen if you didn't copy
correctly the slurm.conf to the nodes.  Please correct this
potential error.

Also, please specify which version of Slurm you're running.

/Ole

On 10/22/2017 08:44 PM, JinSung Kang wrote:
 > I am having trouble with adding new nodes into slurm cluster without
 > killing the jobs that are currently running.
 >
 > Right now I
 >
 > 1. Update the slurm.conf and add a new node to it
 > 2. Copy new slurm.conf to all the nodes,
 > 3. Restart the slurmd on all nodes
 > 4. Restart the slurmctld
 >
 > But when I restart slurmctld all the jobs that were currently running
 > are requeued (Begin Time) as reason for not running. The new
added node
 > works perfectly fine.
 >
 > I've included the slurm.conf. I've also included slurmctld.log output
 > when I'm trying to add the new node.



[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Ole Holm Nielsen


Hi Jin,

Your slurmctld.log says "Node compute004 appears to have a different
slurm.conf than the slurmctld" etc.  This will happen if you didn't copy 
correctly the slurm.conf to the nodes.  Please correct this potential error.


Also, please specify which version of Slurm you're running.

/Ole

On 10/22/2017 08:44 PM, JinSung Kang wrote:
I am having trouble with adding new nodes into slurm cluster without 
killing the jobs that are currently running.


Right now I

1. Update the slurm.conf and add a new node to it
2. Copy new slurm.conf to all the nodes,
3. Restart the slurmd on all nodes
4. Restart the slurmctld

But when I restart slurmctld all the jobs that were currently running 
are requeued (Begin Time) as reason for not running. The new added node 
works perfectly fine.


I've included the slurm.conf. I've also included slurmctld.log output 
when I'm trying to add the new node.


[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Ole Holm Nielsen


I have added nodes to an existing partition several times using the same 
procedure which you describe, and no bad side effects have been noticed. 
 This is a very normal kind of operation in a cluster, where hardware 
may be added or retired from time to time, while the cluster of course 
continues its normal production.  We must be able to do this, especially 
when transferring existing nodes into a new Slurm cluster.


Douglas Jacobsen explained very well why problems may arise.  It seems 
to me that this completely rigid nodelist bit mask in the network is a 
Slurm design problem, and that it ought to be fixed.


Question: How can we pinpoint the problem more precisely in a bug report 
to SchedMD (for support-customers only :-).


/Ole


On 10/22/2017 08:44 PM, JinSung Kang wrote:
I am having trouble with adding new nodes into slurm cluster without 
killing the jobs that are currently running.


Right now I

1. Update the slurm.conf and add a new node to it
2. Copy new slurm.conf to all the nodes,
3. Restart the slurmd on all nodes
4. Restart the slurmctld

But when I restart slurmctld all the jobs that were currently running 
are requeued (Begin Time) as reason for not running. The new added node 
works perfectly fine.


I've included the slurm.conf. I've also included slurmctld.log output 
when I'm trying to add the new node.


[slurm-dev] SC17 Slurm BOF session on Nov. 16

2017-10-05 Thread Ole Holm Nielsen


FYI:

For Slurm users participating in the Supercomputing SC17 conference in 
Denver, Colorado, USA:


SchedMD will present a Birds of a Feather (BOF) session:

Time: Thursday, November 16th 12:15pm - 1:15pm
Location: 201-203

http://sc17.supercomputing.org/presentation/?id=bof105=sess312


[slurm-dev] Re: Setting up Environment Modules package

2017-10-05 Thread Ole Holm Nielsen


On 10/04/2017 06:11 PM, Mike Cammilleri wrote:

I'm in search of a best practice for setting up Environment Modules for our 
Slurm 16.05.6 installation (we have not had the time to upgrade to 17.02 yet). 
We're a small group and had no explicit need for this in the beginning, but as 
we are growing larger with more users we clearly need something like this.

I see there are a couple ways to implement Environment Modules and I'm 
wondering which would be the cleanest, most sensible way. I'll list my ideas 
below:

1. Install Environment Modules package and relevant modulefiles on the slurm 
head/submit/login node, perhaps in the default /usr/local/ location. The 
modulefiles modules would define paths to various software packages that exist 
in a location visible/readable to the compute nodes (NFS or similar). The user 
then loads the modules manually at the command line on the submit/login node 
and not in the slurm submit script - but specify #SBATCH --export=ALL and 
import the environment before submitting the sbatch job.

2. Install Environment Modules packages in a location visible to the entire 
cluster (NFS or similar), including the compute nodes, and the user then 
includes their 'module load' commands in their actual slurm submit scripts 
since the command would be available on the compute nodes - loading software 
(either local or from network locations depending on what they're loading) 
visible to the nodes

3. Another variation would be to use a configuration manager like bcfg2 to make 
sure Environment Modules and necessary modulefiles and all configurations are 
present on all compute/submit nodes. Seems like that's potential for a mess 
though.

Is there a preferred approach? I see in the archives some folks have strange 
behavior when a user uses --export=ALL, so it would seem to me that the cleaner 
approach is to have the 'module load' command available on all compute nodes 
and have users do this in their submit scripts. If this is the case, I'll need 
to configure Environment Modules and relevant modulefiles to live in special 
places when I build Environment Modules (./configure --prefix=/mounted-fs 
--modulefilesdir=/mounted-fs, etc.).

We've been testing with modules-tcl-1.923


I strongly recommend uninstalling the Linux distro "environment-modules" 
package because this old Tcl-based software hasn't been maintained for 
5+ years.  I recommend a very readable paper on various module systems:

http://dl.acm.org/citation.cfm?id=2691141

We use the modern and actively maintained Lmod modules developed at TACC 
(https://www.tacc.utexas.edu/research-development/tacc-projects/lmod) 
together with the EasyBuild module building system (a strong HPC 
community effort, https://github.com/easybuilders/easybuild).


I believe that the TACC supercomputer systems provide Slurm as a 
loadable module, but I don't know any details.  We just install Slurm as 
RPMs on CentOS 7.


We're extremely happy with Lmod and EasyBuild because of the simplicity 
with which 1300+ modules are made available.  I've written a Wiki about 
how we have installed this: 
https://wiki.fysik.dtu.dk/niflheim/EasyBuild_modules.  We put all of our 
modules on a shared NFS file system for all nodes.


/Ole


[slurm-dev] Re: Setting up Environment Modules package

2017-10-05 Thread Ole Holm Nielsen


On 10/05/2017 08:38 AM, Blomqvist Janne wrote:

what we do is, roughly, a combination of your options #2 and #3. To start with, 
however, I'd like to point out that we're using Lmod instead of the old Tcl 
environment-modules. I'd really recommend you to do the same.

So basically, we have our modules available on NFS, both the module files 
themselves and the software that modules makes available. Then we use 
configuration management (ansible, in our case) to ensure that Lmod is 
installed on all nodes, and that we have a suitable configuration file in 
/etc/profile.d that adds our NFS location to $MODULEPATH so that Lmod can find 
it.

We also use Easybuild to build (most) software and module files, you might want 
to look into that as well.


We use the same approach.


And yes, we tell our users to load the appropriate modules in the slurm batch 
scripts rather than relying on slurm to transfer the environment correctly.

As to whether this is preferred, well, it works, but provisioning with 
kickstart + config management gets tedious at scale (say, hundreds of nodes or 
more). If we were to rebuild everything from scratch, I think we'd take a long 
hard look at image-based deployment, e.g. openhpc/warewulf.


We use Kickstart including some post-install scripts to automatically 
install compute nodes with CentOS.  At 800 nodes currently, it's not at 
all tedious to perform installation and config management, IMHO.


In the distant past, we used the image-based approach with SystemImager, 
but I think this was no simpler than the Kickstart-based approach.


/Ole


[slurm-dev] Re: Upgrading Slurm

2017-10-03 Thread Ole Holm Nielsen


On 10/03/2017 03:29 PM, Elisabetta Falivene wrote:
I've been asked to upgrade our slurm installation. I have a slurm 2.3.4 
on a Debian 7.0 wheezy cluster (1 master + 8 nodes). I've not installed 
it so I'm a bit confused about how to do this and how to proceed without 
destroying anything.


I was thinking to upgrade at least to Jessie (Debian 8) but what about 
Slurm? I've read carefully the upgrading section 
(https://slurm.schedmd.com/quickstart_admin.html) of the doc, reading 
that the upgrade must be done incrementally and not jumping from 2.3.4 
to 17, for example.


Yes, you may jump max 2 versions per upgrade.
Quoting https://slurm.schedmd.com/quickstart_admin.html#upgrade

Slurm daemons will support RPCs and state files from the two previous minor releases (e.g. a version 16.05.x SlurmDBD will support slurmctld daemons and commands with a version of 16.05.x, 15.08.x or 14.11.x). 



Stil is not clear to me precisely how to do this. How would you proceed 
if asked to upgrade a cluster you just don't know nothing about? What 
would you check? What version of o.s. and slurm would you choose? What 
would you backup? And how would you proceed?


Any info is gold! Thank you


My 2 cents of information:

My Slurm Wiki explains how to upgrade Slurm on CentOS 7:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

Probably the general method is the same for Debian.

/Ole


[slurm-dev] Re: Limiting SSH sessions to cgroups?

2017-09-21 Thread Ole Holm Nielsen


My Wiki page summarizes what's known about the pam_slurm_adopt setup. 
May I remind you of my previous answer:



There is now a better understanding of how to use slurm-pam_slurm with Slurm 
17.02.2 or later for limiting SSH access to nodes, see:
  https://bugs.schedmd.com/show_bug.cgi?id=4098

https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#pam-module-restrictions

http://tech.ryancox.net/2015/04/caller-id-handling-ssh-launched-processes-in-slurm.html

Older discussions recommended UsePAM=1 in slurm.conf, but that's a bad idea. 

/Ole

On 09/19/2017 10:39 PM, Jacob Chappell wrote:
Thanks everyone who has replied. I am trying to get pam_slurm_adopt.so 
implemented. Does it work with batch jobs? I keep getting errors, even 
though I have a job running on the node I'm trying to login to:


jacob@condo:~$ sbatch nvidia-docker-test.sh
Submitted batch job 41
jacob@condo:~$ squeue
  JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)

     41 short nvidia-d  jacob  R   0:19  1 gnode001
jacob@condo:~$ ssh gnode001
Access denied by pam_slurm_adopt: you have no active jobs on this node
Connection to gnode001 closed by remote host.
Connection to gnode001 closed.

Note that I'm running Slurm 17.02.7 on Ubuntu 16.04.3 LTS.

Jacob Chappell

On Tue, Sep 19, 2017 at 3:42 PM, Trafford, Tyler 
> wrote:



 > On 9/19/17, 11:49 AM, "Trafford, Tyler" > wrote:
 >
 >>Have you looked at "pam_slurm_adopt.so"?
 >>
 >>We are using that successfully.  It "adopts" the cgroup of the
user's job.
 >
 > We also use pam_slurm_adopt.so, and I¹m mostly happy with it. One
caution
 > is that the doco describes adding a line to /etc/pam.d/sshd to call
 > pam_access prior to calling pam_slurm_adopt, and adding some lines to
 > /etc/security/access.conf; this allows specifying admin users
that bypass
 > pam_slurm_adopt. However, when editing /etc/security/access.conf,
be aware
 > that (on our systems, at least, running CentOS 6.6) pam_access is
also
 > called by /etc/pam.d/crond and /etc/pam.d/atd. Without a line in
 > /etc/security/access.conf explicitly allowing local root processes to
 > start, cron and at jobs were denied by pam_access, which we
didn¹t notice
 > for a while, since automated node status emails were being driven
by cron
 > jobs.

We didn't run into that side-effect because I made sure to use an
dedicated access file, eg:

# cat /etc/pam.d/slurm-account
account     sufficient    pam_access.so
accessfile=/etc/security/slurm.access
account     sufficient    pam_slurm_adopt.so

(I then replace the "account password-auth" with slurm-account in
our /etc/pam.d/sshd.)

-Tyler


[slurm-dev] Re: systemd slurm not starting on boot

2017-09-19 Thread Ole Holm Nielsen


The Slurm 15.08.7 is really old, the current version is 17.02.7.
Still, if you read my Wiki about Slurm configuration, perhaps the 
missing item will be discovered: 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration


/Ole

On 09/19/2017 05:17 PM, Kyle Mills wrote:

Hi Ole,

I'm using Ubuntu 16.04 on each head/compute node, and have installed 
slurm-wlm from the apt repositories.  It is slurm 15.08.7.


On Tue, Sep 19, 2017 at 11:07 AM, Ole Holm Nielsen 
<ole.h.niel...@fysik.dtu.dk <mailto:ole.h.niel...@fysik.dtu.dk>> wrote:



If your OS is CentOS/RHEL 7, you may want to consult my Wiki page
about setting up Slurm: https://wiki.fysik.dtu.dk/niflheim/SLURM
<https://wiki.fysik.dtu.dk/niflheim/SLURM>.
If you do things correctly, there should be no problems :-)

/Ole



On 09/19/2017 05:02 PM, Kyle Mills wrote:

Hello,

I'm trying to get SLURM set up on a small cluster comprised of a
head node and 4 compute nodes.  On the head node, I have run

```
sudo systemctl enable slurmctld
```

but after a reboot SLURM is not running and
`sudo systemctl status slurmctld` returns:

```
● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/lib/systemd/system/slurmctld.service;
enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2017-09-19
10:38:00 EDT; 9min ago
    Process: 1363 ExecStart=/usr/sbin/slurmctld
$SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 1395 (code=exited, status=1/FAILURE)

Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered state of 4 nodes
Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered information
about 0 jobs
Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered state of 0
reservations
Sep 19 10:38:00 arcesius slurmctld[1395]: read_slurm_conf:
backup_controller not specified.
Sep 19 10:38:00 arcesius slurmctld[1395]: Running as primary
controller
Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered information
about 0 sicp jobs
Sep 19 10:38:00 arcesius slurmctld[1395]: error: Error binding
slurm stream socket: Cannot assign requested address
Sep 19 10:38:00 arcesius systemd[1]: slurmctld.service: Main
process exited, code=exited, status=1/FAILURE
Sep 19 10:38:00 arcesius systemd[1]: slurmctld.service: Unit
entered failed state.
Sep 19 10:38:00 arcesius systemd[1]: slurmctld.service: Failed
with result 'exit-code'.
```

If I then run `sudo systemctl start slurmctld`, it starts up
without any errors and my compute nodes can communicate with the
server.  Launching `slurmctld -Dvv` works, and doesn't print
anything that I deem concerning.

Why would it work manually, but not automatically on boot?  If
you need any more information, please let me know; I'm not sure
what is necessary to diagnose this problem.


[slurm-dev] Re: systemd slurm not starting on boot

2017-09-19 Thread Ole Holm Nielsen


If your OS is CentOS/RHEL 7, you may want to consult my Wiki page about 
setting up Slurm: https://wiki.fysik.dtu.dk/niflheim/SLURM.

If you do things correctly, there should be no problems :-)

/Ole


On 09/19/2017 05:02 PM, Kyle Mills wrote:

Hello,

I'm trying to get SLURM set up on a small cluster comprised of a head 
node and 4 compute nodes.  On the head node, I have run


```
sudo systemctl enable slurmctld
```

but after a reboot SLURM is not running and `sudo systemctl status 
slurmctld` returns:


```
● slurmctld.service - Slurm controller daemon
    Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; 
vendor preset: enabled)
    Active: failed (Result: exit-code) since Tue 2017-09-19 10:38:00 
EDT; 9min ago
   Process: 1363 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS 
(code=exited, status=0/SUCCESS)

  Main PID: 1395 (code=exited, status=1/FAILURE)

Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered state of 4 nodes
Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered information about 0 jobs
Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered state of 0 reservations
Sep 19 10:38:00 arcesius slurmctld[1395]: read_slurm_conf: 
backup_controller not specified.

Sep 19 10:38:00 arcesius slurmctld[1395]: Running as primary controller
Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered information about 0 
sicp jobs
Sep 19 10:38:00 arcesius slurmctld[1395]: error: Error binding slurm 
stream socket: Cannot assign requested address
Sep 19 10:38:00 arcesius systemd[1]: slurmctld.service: Main process 
exited, code=exited, status=1/FAILURE
Sep 19 10:38:00 arcesius systemd[1]: slurmctld.service: Unit entered 
failed state.
Sep 19 10:38:00 arcesius systemd[1]: slurmctld.service: Failed with 
result 'exit-code'.

```

If I then run `sudo systemctl start slurmctld`, it starts up without any 
errors and my compute nodes can communicate with the server.  Launching 
`slurmctld -Dvv` works, and doesn't print anything that I deem 
concerning.


Why would it work manually, but not automatically on boot?  If you need 
any more information, please let me know; I'm not sure what is necessary 
to diagnose this problem.


[slurm-dev] Re: Limiting SSH sessions to cgroups?

2017-09-19 Thread Ole Holm Nielsen


On 09/19/2017 03:25 PM, Jacob Chappell wrote:
I found an old mailing list discussion about this. I'm curious if any 
progress has been made since and if there is a solution now?


There is now a better understanding of how to use slurm-pam_slurm with 
Slurm 17.02.2 or later for limiting SSH access to nodes, see:

  https://bugs.schedmd.com/show_bug.cgi?id=4098

https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#pam-module-restrictions

http://tech.ryancox.net/2015/04/caller-id-handling-ssh-launched-processes-in-slurm.html

Older discussions recommended UsePAM=1 in slurm.conf, but that's a bad idea.

Is there a way to limit the SSH sessions of users to the cgroup defined 
by their jobs? I'm using pam_slurm.so to limit SSH access to only those 
users with running jobs. However, if a user reserves say 2 GPUs on a 4 
GPU system, the cgroups only give their job access to 2 GPUs. But, they 
can login and have access to all 4 GPUs. I want to prevent that.


The slurm-pam_slurm deals with SSH logins only.  I don't know it there 
exists a way to confine logged-in users to their job cgroup.


/Ole


[slurm-dev] Announce: Node status tool "pestat" for Slurm updated

2017-09-11 Thread Ole Holm Nielsen
I'm announcing an updated version of the node status tool "pestat" for 
Slurm.


The job list for each node may now optionally include the (expected) job 
EndTime using the -E option.  This information is very useful when you 
are waiting for a draining node to be cleared of jobs.  For example, 
it's nice to know when the node may be shut down for repairs.  The 
attached screenshot shows an example node status.


Download the tool (a bash script) and other files from GitHub:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat

Usage: pestat [-p partition(s)] [-u username] [-g groupname]
[-q qoslist] [-s statelist] [-n/-w hostlist] [-j joblist]
[-f | -F | -m free_mem | -M free_mem ] [-1] [-E] [-C/-c] [-V] [-h]
where:
-p partition: Select only partion 
-u username: Print only user 
-g groupname: Print only users in UNIX group 
-q qoslist: Print only QOS in the qoslist 
-s statelist: Print only nodes with state in 
-n/-w hostlist: Print only nodes in hostlist
-j joblist: Print only nodes in job 
-f: Print only nodes that are flagged by * (unexpected load etc.)
-F: Like -f, but only nodes flagged in RED are printed.
-m free_mem: Print only nodes with free memory LESS than free_mem MB
	-M free_mem: Print only nodes with free memory GREATER than free_mem MB 
(under-utilized)
	-1: Only 1 line per node (unique nodes in multiple partitions are 
printed once only)

-E: Job EndTime is printed after each jobid/user
-C: Color output is forced ON
-c: Color output is forced OFF
-h: Print this help information
-V: Version information

Global configuration file for pestat: /etc/pestat.conf
Per-user configuration file for pestat: /root/.pestat.conf

/Ole


[slurm-dev] Re: An issue about slurm on CentOS 7.3

2017-08-28 Thread Ole Holm Nielsen


On 08/25/2017 06:19 PM, Nicholas McCollum wrote:

I like your documentation but I would add a few things:

I highly recommend not having the slurmctld start automatically upon
reboot.  If for some reason the slurm spool directory isn't available
(on a shared folder) it will cause all the jobs to die across the
cluster.  I always like to triple check to make sure that the directory
is available before starting the slurmctld.

I also find it helpful, especially in instances like this, to run the
daemon in foreground mode.

# slurmctld -D
# slurmd -D

This will print out any errors directly on the terminal and you can see
right away while the daemon has crashed or failed to start.


Thanks for your nice comments.  I added a section about manual daemon 
startup to cover the scenario you describe:

https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#manual-startup-of-services

It's difficult to foresee every kind of problem which may occur, but 
it's good to have common scenarios in the documentation.


Our Slurm master server only has local storage, but I suppose that you 
need shared remote storage for Slurm HA controllers?


/Ole


[slurm-dev] Re: An issue about slurm on CentOS 7.3

2017-08-25 Thread Ole Holm Nielsen


On 08/25/2017 01:37 PM, Huijun HJ1 Ni wrote:>   I installed 
slurm on my cluster whose OS are CentOS7.3.


  After I completed the configuration, I found that it would be 
hung while executing ‘systemctl start slurm’ on compute nodes(but is ok 
on control node where slurmctld runs).


  But if I used the command ‘systemctl start slurmd’ on compute 
nodes, that were ok.


  So is that a defeat for slurm or any problems in my 
configurations? Can you help me?


  Attachment is my configurations.


Please see my HowTo Wiki about Slurm on CentOS/RHEL 7:
https://wiki.fysik.dtu.dk/niflheim/SLURM

Documentation about starting services:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration

/Ole


[slurm-dev] Re: how to configure 2 servers

2017-08-17 Thread Ole Holm Nielsen


On 08/17/2017 01:58 PM, Shlomit Afgin wrote:
I follow 
https://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/ to 
install Slurm.


In the instruction, it use the command /‘/usr/sbin/create-munge-key 
–r’/ to build munge.key on the server.


Then need copy the key file to each one of the nodes.

I would like to have 2 servers; how can I create the munge.key, on the 
nodes, to be good for both servers?


You might want to take a look at my Slurm HowTo wiki for CentOS7: 
https://wiki.fysik.dtu.dk/niflheim/SLURM


/Ole


[slurm-dev] Re: Slurmd v15 to v17 stopped working (slurmd: fatal: Unable to determine this slurmd's NodeName) on ControlMachine

2017-08-14 Thread Ole Holm Nielsen


Hi Olivier,

You might also want to consult my HowTo wiki for Slurm on CentOS 7:
https://wiki.fysik.dtu.dk/niflheim/SLURM
Lots of little details are discussed in this wiki.

/Ole

On 08/10/2017 03:04 PM, LAHAYE Olivier wrote:


how stupid I am,
your perfectly right!
How by hell was I unable to see that before I upgraded? I really need 
hollydays. Sorry for inconvenience.

Maybe the error message could be enhanced like this:
This is slurm controler host, slurmd doesn't need to run on controller host 
except if you list it as a compute node as well (not recommanded).
--
Olivier LAHAYE
CEA DRT/LIST/DIR


De : Jacek Budzowski [j.budzow...@cyfronet.pl]
Envoyé : jeudi 10 août 2017 14:56
À : slurm-dev
Objet : [slurm-dev] Re: Slurmd v15 to v17 stopped working (slurmd: fatal: 
Unable to determine this slurmd's NodeName) on ControlMachine

Hi,

I think you shouldn't run slurmd on your ControlMachine node (but run
slurmctld and slurmdbd), as in your configuration I don't see that
slurm_master has its NodeName line.
So you should either add slurm_master to your slurm.conf in NodeName
line or not start slurmd on the slurm_master.

Cheers,
Jacek

W dniu 10.08.2017 o 14:36, LAHAYE Olivier pisze:

Hi,

I've upgraded slurm 15.08.3 (built from rpmbuild -tb ) to 17.02.6 on 
centos-7-x86_64.

Since I've done that, slurmd refuse to start on ControlMachine and on 
Backupcontroller. (it starts fine on compute nodes)

The error is: slurmd: fatal: Unable to determine this slurmd's NodeName

If I try to specify the nodename it fails with a different error message:

[root@slurm_master] # slurmd -D -N $(hostname -s)
slurmd: Node configuration differs from hardware: CPUs=0:32(hw) Boards=0:1(hw) 
SocketsPerBoard=0:2(hw) CoresPerSocket=0:8(hw) ThreadsPerCore=0:2(hw)
slurmd: Message aggregation disabled
slurmd: error: find_node_record: lookup failure for slurm_master
slurmd: fatal: ROUTE -- slurm_master not found in node_record_table
[root@slurm_master]# hostname -s
slurm_master

Trying to debug seems to show that the hostname is not in the node hash table.

slurmdbd and slurmctld start fine.
I've googled around, but I only find problems related to compute nodes, not 
Controller or Backup.

Any ideas?


[slurm-dev] Re: ANNOUNCE: A collection of Slurm tools

2017-07-21 Thread Ole Holm Nielsen


Hi Noelia,

I've tested slurmreportmonth on Slurm 16.05 and 17.02 systems.  Do you 
run an older version?


You could run the script to trace the commands executed:
  sh -x slurmreportmonth
The call of sreport I get this way is:
  /usr/bin/sreport -t hourper --tres=cpu,gpu cluster 
AccountUtilizationByUser Start=060117 End=070117 tree


/Ole

On 07/21/2017 01:21 PM, Luque, N.B. wrote:

Thanks for these tools!
I have one question, I download slurmreportmonth file and when I run it 
without any option I get an empty file in /tmp/Slurm_report.June_2017

Am I doing something wrong?
When I do :

# sreport cluster accountutilisationbyuser 
format=Accounts,Cluster,CPUCount,Login,Proper,Used start=06/01/17 
end=06/31/17


Then I get the numbers of CPU minutes used.
I also tried to do :

# ./slurmreportmonth -s 0501 -e 0531
Start date 0501
End date 0531
sreport: unrecognized option '--tres=cpu,gpu'
Try "sreport --help" for more information
Report generated to file /tmp/Slurm_report.0501_0531

But again I got an empty file.

Thanks for your help!
Best regards,
Noelia


Dr. Noelia B. Luque
=
Vrije Universiteit, Theoretical Chemistry
De Boelelaan 1083
1081 HV Amsterdam, The Netherlands
T +31 20 598 7620
(at the office Tuesday and Friday From 9:30am to 1:30pm)

On 20 Jul 2017, at 15:57, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk 
<mailto:ole.h.niel...@fysik.dtu.dk>> wrote:



As a small contribution to the Slurm community, I've moved my 
collection of Slurm tools to GitHub at 
https://github.com/OleHolmNielsen/Slurm_tools.  These are tools which 
I feel makes the daily cluster monitoring and management a little easier.


The following Slurm tools are available:

* pestat Prints a Slurm cluster nodes status with 1 line per node and 
job info.


* slurmreportmonth Generate monthly accounting statistics from Slurm 
using the sreport command.


* showuserjobs Print the current node status and batch jobs status 
broken down into userids.


* slurmibtopology Infiniband topology tool for Slurm.

* Slurm triggers scripts.

* Scripts for managing nodes.

* Scripts for managing jobs.

The tools "pestat" and "slurmibtopology" have previously been 
announced to this list, but future updates will be on GitHub only.


I would also like to mention our Slurm deployment HowTo guide at 
https://wiki.fysik.dtu.dk/niflheim/SLURM


/Ole




[slurm-dev] Re: ANNOUNCE: A collection of Slurm tools

2017-07-21 Thread Ole Holm Nielsen


On 07/21/2017 12:00 PM, Loris Bennett wrote:

Thanks for sharing your tools.  Here are some brief comments


I've updated the following tools on 
https://github.com/OleHolmNielsen/Slurm_tools, see the changes below.



- psjob/psnode
   - The USERLIST variable makes the commands a bit brittle, since ps
 will fail if you pass an unknown username.


I've made  a deselect-list consisting only of existing users so that 
older "ps" commands won't fail.



- showuserjobs
   - Doesn't handle usernames longer than 8-chars (we have longer names)


The maximum username length may now be changed in the script:
* export maxlength=11


   - The grouping doesn't seem quite correct.  As shown in the example
 below, not all the users of the group appear under the group total
 for the appropriate group:


Usage of the sort command has been improved.  Please report any 
incorrect sorting.


/Ole


[slurm-dev] Re: ANNOUNCE: A collection of Slurm tools

2017-07-21 Thread Ole Holm Nielsen


Hi Loris,

Thanks so much for your relevant comments!

On 07/21/2017 12:00 PM, Loris Bennett wrote:


Hi Ole,

Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> writes:


As a small contribution to the Slurm community, I've moved my collection of
Slurm tools to GitHub at https://github.com/OleHolmNielsen/Slurm_tools.  These
are tools which I feel makes the daily cluster monitoring and management a
little easier.

The following Slurm tools are available:

* pestat Prints a Slurm cluster nodes status with 1 line per node and job info.

* slurmreportmonth Generate monthly accounting statistics from Slurm using the
sreport command.

* showuserjobs Print the current node status and batch jobs status broken down
into userids.

* slurmibtopology Infiniband topology tool for Slurm.

* Slurm triggers scripts.

* Scripts for managing nodes.

* Scripts for managing jobs.

The tools "pestat" and "slurmibtopology" have previously been announced to this
list, but future updates will be on GitHub only.

I would also like to mention our Slurm deployment HowTo guide at
https://wiki.fysik.dtu.dk/niflheim/SLURM

/Ole


Thanks for sharing your tools.  Here are some brief comments

- psjob/psnode
   - The USERLIST variable makes the commands a bit brittle, since ps
 will fail if you pass an unknown username.


Good point!


- showuserjobs
   - Doesn't handle usernames longer than 8-chars (we have longer names)


Good point!


   - The grouping doesn't seem quite correct.  As shown in the example
 below, not all the users of the group appear under the group total
 for the appropriate group:


I tried to make the "sort" command do the final sorting, but I couldn't 
make it to the GROUP_TOTAL first.  Maybe I have to move the sorting into 
the awk code...


   
 UsernameJobs  CPUs   Jobs  CPUs  Group Further info

  =    =    
=
 GRAND_TOTAL  168  1089 55   451  ALL   running+idle=1540 CPUs 29 
users
 GROUP_TOTAL   56   349 10   119  group01   running+idle=468 CPUs 8 
users
 user0127   324  452  group02   One, User
 GROUP_TOTAL   27   324  452  group02   running+idle=376 CPUs 1 
users
 user0229   174  1 6  group01   Two, User
 GROUP_TOTAL5   148 18   208  group03   running+idle=356 CPUs 4 
users
 user03 3   120 16   176  group03   Three, User
 user041196  348  group01   Four, User
 ...
 
In general, maybe it would good to have a common config file, where things such as

paths to binaries, USERLIST and username lengths are defined.


Yes, but what's the best way for this?  I'd like to scripts to be 
self-contained so people can pick what they need without doing 
additional setups for users and sysadmins.


/Ole


[slurm-dev] ANNOUNCE: A collection of Slurm tools

2017-07-20 Thread Ole Holm Nielsen


As a small contribution to the Slurm community, I've moved my collection 
of Slurm tools to GitHub at 
https://github.com/OleHolmNielsen/Slurm_tools.  These are tools which I 
feel makes the daily cluster monitoring and management a little easier.


The following Slurm tools are available:

* pestat Prints a Slurm cluster nodes status with 1 line per node and 
job info.


* slurmreportmonth Generate monthly accounting statistics from Slurm 
using the sreport command.


* showuserjobs Print the current node status and batch jobs status 
broken down into userids.


* slurmibtopology Infiniband topology tool for Slurm.

* Slurm triggers scripts.

* Scripts for managing nodes.

* Scripts for managing jobs.

The tools "pestat" and "slurmibtopology" have previously been announced 
to this list, but future updates will be on GitHub only.


I would also like to mention our Slurm deployment HowTo guide at 
https://wiki.fysik.dtu.dk/niflheim/SLURM


/Ole


[slurm-dev] Re: How to set 'future' node state?

2017-07-15 Thread Ole Holm Nielsen


On 14-07-2017 23:26, Robbert Eggermont wrote:
We're adding some nodes to our cluster (17.02.5). In preparation, we've 
defined the nodes in our slurm.conf with "State=FUTURE" (as descibed in 
the man page). But it doesn't work like this, because when we start the 
slurmd on the nodes, the nodes immediately show up as idle.


When we manually run "scontrol update NodeName=XXX State=FUTURE" the 
node becomes invisible, as expected for State=FUTURE. However, after a 
restart of the node (or the slurmd), the node is in state idle again, 
and jobs get scheduled on the node...


So, how do we make the nodes go into State=FUTURE automatically?
Or do we simply remove the node definitions until the nodes are ready?


You may want to consider this as well:

After adding nodes to slurm.conf, the scontrol man-page says that 
slurmctld must be restarted.  It turns out that all slurmd daemons on 
compute nodes must be restarted as well, see 
https://bugs.schedmd.com/show_bug.cgi?id=3973.  Hopefully this will get 
fixed in 17.11.


/Ole


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Ole Holm Nielsen


On 07/06/2017 04:31 PM, Uwe Sauter wrote:


Alternatively you can

   systemctl disable firewalld.service

   systemctl mask firewalld.service

   yum install iptables-services

   systemctl enable iptables.service ip6tables.service

and configure configure iptables in /etc/sysconfig/iptables and 
/etc/sysconfig/ip6tables, then

   systemctl start iptables.service ip6tables.service


Yes, this is possible, but I would say it's discouraged to do so.
With RHEL/CentOS 7 you really should be using firewalld, and forget 
about the old iptables.  Here's a nice introduction: 
https://www.certdepot.net/rhel7-get-started-firewalld/


Having worked with firewalld for a while now, I find it more flexible to 
use. Admittedly, there is a bit of a learning curve.



The crucial part is to ensure that either firewalld *or* iptables is running 
but not both. Or you could run without firewall at
all *if* you trust your network…


Agreed.  The compute node network *has to be* trusted in order for Slurm 
to work.


/Ole


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Ole Holm Nielsen
in a terminal with any 'verbose'
flags set
e) then start on more low-level diagnostics, such as tcpdump
of network adapters and straces of the processes and gstacks


you have been doing steps a b and c very well
I suggest staying with these - I myself am going for Adam
Huffmans suggestion of the NTP clock times.
Are you SURE that on all nodes you have run the 'date'
command and also 'ntpq -p'
Are you SURE the master node and the node OBU-N6   are both
connecting to an NTP server?   ntpq -p will tell you that


And do not lose heart.  This is how we all learn.

















On 5 July 2017 at 16:23, Said Mohamed Said
<said.moha...@oist.jp <mailto:said.moha...@oist.jp>> wrote:

Sinfo -R gives "NODE IS NOT RESPONDING"
ping gives successful results from both nodes

I really can not figure out what is causing the problem.

Regards,
Said


*From:* Felix Willenborg
<felix.willenb...@uni-oldenburg.de
<mailto:felix.willenb...@uni-oldenburg.de>>
*Sent:* Wednesday, July 5, 2017 9:07:05 PM

*To:* slurm-dev
*Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
When the nodes change to the down state, what is 'sinfo
-R' saying? Sometimes it gives you a reason for that.

Best,
Felix

Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:

Thank you Adam, For NTP I did that as well before
posting but didn't fix the issue.

Regards,
Said


*From:* Adam Huffman <adam.huff...@gmail.com>
<mailto:adam.huff...@gmail.com>
*Sent:* Wednesday, July 5, 2017 8:11:03 PM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP

I've seen something similar when node clocks were skewed.

Worth checking that NTP is running and they're all
synchronised.

On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said
<said.moha...@oist.jp> <mailto:said.moha...@oist.jp>
wrote:
> Thank you all for suggestions. I turned off firewall on both 
machines but
> still no luck. I can confirm that No managed switch is 
preventing the nodes
> from communicating. If you check the log file, there is 
communication for
> about 4mins and then the node state goes down.
    > Any other idea?
> 
> From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
<mailto:ole.h.niel...@fysik.dtu.dk>
> Sent: Wednesday, July 5, 2017 7:07:15 PM
> To: slurm-dev
> Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
>
>
> On 07/05/2017 11:40 AM, Felix Willenborg wrote:
>> in my network I encountered that managed switches were 
preventing
>> necessary network communication between the nodes, on which 
SLURM
>> relies. You should check if you're using managed switches to 
connect
>> nodes to the network and if so, if they're blocking 
communication on
>> slurm ports.
>
> Managed switches should permit IP layer 2 traffic just like 
unmanaged
> switches!  We only have managed Ethernet switches, and they 
work without
> problems.
>
> Perhaps you meant that Ethernet switches may perform some 
firewall
> functions by themselves?
>
> Firewalls must be off between Slurm compute nodes as well as 
the
> controller host.  See
> 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons

<https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons>
>
> /Ole







[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Ole Holm Nielsen
5, 2017 8:11:03 PM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP

I've seen something similar when node clocks were skewed.

Worth checking that NTP is running and they're all
synchronised.

On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said
<said.moha...@oist.jp> <mailto:said.moha...@oist.jp> wrote:
> Thank you all for suggestions. I turned off firewall on both 
machines but
> still no luck. I can confirm that No managed switch is preventing 
the nodes
> from communicating. If you check the log file, there is 
communication for
> about 4mins and then the node state goes down.
    > Any other idea?
> 
> From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
<mailto:ole.h.niel...@fysik.dtu.dk>
> Sent: Wednesday, July 5, 2017 7:07:15 PM
> To: slurm-dev
> Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
>
>
> On 07/05/2017 11:40 AM, Felix Willenborg wrote:
>> in my network I encountered that managed switches were preventing
>> necessary network communication between the nodes, on which SLURM
>> relies. You should check if you're using managed switches to 
connect
>> nodes to the network and if so, if they're blocking 
communication on
>> slurm ports.
>
> Managed switches should permit IP layer 2 traffic just like 
unmanaged
> switches!  We only have managed Ethernet switches, and they work 
without
> problems.
>
> Perhaps you meant that Ethernet switches may perform some firewall
> functions by themselves?
>
> Firewalls must be off between Slurm compute nodes as well as the
> controller host.  See
> 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons

<https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons>
>
> /Ole


[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2017-07-06 Thread Ole Holm Nielsen


I'd like a second revival of this thread!  The full thread is available 
at 
https://groups.google.com/forum/#!msg/slurm-devel/oDoHPoAbiPQ/q9pQL2Uw3y0J


We're in the process of upgrading Slurm from 16.05 to 17.02.  I'd like 
to be certain that our MPI libraries don't require a specific library 
version such as libslurm.so.30.  See the thread's example "$ readelf -d 
libmca_common_pmi.so":

 0x0001 (NEEDED) Shared library: [libslurm.so.27]

Question: Can anyone suggest which OpenMPI libraries I have to go 
through with readelf in order to make sure we don't have the 
libslurm.so.xx problem?


The libmca_common_pmi.so file doesn't exist on our systems.  We have 
OpenMPI 1.10.3 and 2.0.2 installed with EasyBuild.


Our builds of OpenMPI were done on top of a Slurm 16.05 base, and our 
build hosts do **not** have the lib64/libpmi2.la and lib64/libpmi.la 
which cause problems.  According to the above thread, these files were 
removed from the slurm-devel RPM package starting from Slurm 16.05.  So 
I hope that we're good...


I expect the consequences of having an undetected libslurm.so.xx problem 
would be that all MPI jobs would start crashing :-(


Thanks for your help,
Ole

On 02/04/2016 11:26 PM, Kilian Cavalotti wrote:

Hi all,

I would like to revive this old thread, as we've been bitten by this
also when moving from 14.11 to 15.08.

On Mon, Oct 5, 2015 at 4:38 AM, Bjørn-Helge Mevik  wrote:

We have verified that we can compile openmpi (1.8.6) against slurm
14.03.7 (with the .la files removed), and then upgrade slurm to 15.08.0
without having to recompile openmpi.

My understanding of linking and libraries is not very thorough,
unfortunately, but according to

https://lists.fedoraproject.org/pipermail/mingw/2012-January/004421.html

the .la files are only needed in order to link against static libraries,
and since Slurm doesn't provide any static libraries, I guess it would
be safe for the slurm-devel rpm not to include these files.


I think the link above describes the situation pretty well. Could we
please remove the .la files from the slurm-devel RPM if they don't
serve any specific purpose?
The attached patch to slurm.spec worked for me.


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Ole Holm Nielsen


On 07/05/2017 11:40 AM, Felix Willenborg wrote:

in my network I encountered that managed switches were preventing
necessary network communication between the nodes, on which SLURM
relies. You should check if you're using managed switches to connect
nodes to the network and if so, if they're blocking communication on
slurm ports.


Managed switches should permit IP layer 2 traffic just like unmanaged 
switches!  We only have managed Ethernet switches, and they work without 
problems.


Perhaps you meant that Ethernet switches may perform some firewall 
functions by themselves?


Firewalls must be off between Slurm compute nodes as well as the 
controller host.  See 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons


/Ole


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Ole Holm Nielsen


On 07/05/2017 11:25 AM, Ole Holm Nielsen wrote:
Could it be that you have enabled the firewall on the compute nodes? The 
firewall must be turned off (this requirement isn't documented anywhere).


You may want to go through my Slurm deployment Wiki at 
https://wiki.fysik.dtu.dk/niflheim/Niflheim7_Getting_started to see if 
anything obvious is missing in your configuration.


Correction to the web page: https://wiki.fysik.dtu.dk/niflheim/SLURM

Sorry,
Ole


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Ole Holm Nielsen


Hi Said,

Could it be that you have enabled the firewall on the compute nodes? 
The firewall must be turned off (this requirement isn't documented 
anywhere).


You may want to go through my Slurm deployment Wiki at 
https://wiki.fysik.dtu.dk/niflheim/Niflheim7_Getting_started to see if 
anything obvious is missing in your configuration.


Best regards,
Ole

On 07/05/2017 11:17 AM, Said Mohamed Said wrote:

Dear Sir/Madam


I am configuring slurm for academic use in my University but I have 
encountered the following problem which I could not found the solution 
from the Internet.



I followed all troubleshooting suggestions from your website with no luck.


Whenever I start slurmd daemon in one of compute node, it starts with 
IDLE state but goes DOWN after 4 minutes with the reason=Node not 
responding.


I am using slurm version 17.02 on both nodes.


tail /var/log/slurmd.log on fault node gives;


***

[2017-07-05T16:56:55.118] Resource spec: Reserved system memory limit 
not configured for this node

[2017-07-05T16:56:55.120] slurmd version 17.02.2 started
[2017-07-05T16:56:55.121] slurmd started on Wed, 05 Jul 2017 16:56:55 +0900
[2017-07-05T16:56:55.121] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2 
Memory=128661 TmpDisk=262012 Uptime=169125 CPUSpecList=(null) 
FeaturesAvail=(null) FeaturesActive=(null)

[2017-07-05T16:59:20.513] Slurmd shutdown completing
[2017-07-05T16:59:20.548] Message aggregation disabled
[2017-07-05T16:59:20.549] Resource spec: Reserved system memory limit 
not configured for this node

[2017-07-05T16:59:20.552] slurmd version 17.02.2 started
[2017-07-05T16:59:20.552] slurmd started on Wed, 05 Jul 2017 16:59:20 +0900
[2017-07-05T16:59:20.553] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2 
Memory=128661 TmpDisk=262012 Uptime=169270 CPUSpecList=(null) 
FeaturesAvail=(null) FeaturesActive=(null)
*** 





tail /var/log/slurmctld.log on controller node gives;



[2017-07-05T17:54:56.422] 
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0

[2017-07-05T17:55:09.004] Node OBU-N6 now responding
[2017-07-05T17:55:09.004] node OBU-N6 returned to service
[2017-07-05T17:59:52.677] error: Nodes OBU-N6 not responding
[2017-07-05T18:03:15.857] error: Nodes OBU-N6 not responding, setting DOWN




The following is my slurm.conf file content;


**

#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#

# SCHEDULING
FastSchedule=0
SchedulerType=sched/backfill
SelectType=select/linear
TreeWidth=50
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageType=accounting_storage/mysql
#AccountingStorageType=accounting_storage/filetxt
#JobCompType=jobcomp/filetxt
#AccountingStorageLoc=/var/log/slurm/accounting
#JobCompLoc=/var/log/slurm/job_completions
ClusterName=obu
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=OBU-N5 NodeAddr=10.251.17.170 CPUs=24 Sockets=2 
CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN
NodeName=OBU-N6 NodeAddr=10.251.17.171 CPUs=24 Sockets=2 
CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN
PartitionName=slurm-partition Nodes=OBU-N[5-6] Default=YES 
MaxTime=INFINITE State=UP


**



I can ssh successfully from each node and munge daemon runs on each machine.


Your help will be greatly appreciated,


Sincerely,


Said.




--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620


[slurm-dev] Re: Topology for message aggregation

2017-07-03 Thread Ole Holm Nielsen


On 07/03/2017 01:18 PM, Ulf Markwardt wrote:

is there a chance to explicitely assign nodes (e.g. machines outside the
HPC machine) for message aggregation?

All I see at the moment is that Slurm uses the (high-speed interconnect)
topolgy for this. But I do not want to put communication load (noise) on
the compute hosts.


Ulf, what do you mean by "message aggregation"?  The topology.conf file 
also handles nodes with Ethernet interconnect (no Infiniband), we have 
many nodes like that.


/Ole


[slurm-dev] Re: Multifactor Priority Plugin for Small clusters

2017-07-03 Thread Ole Holm Nielsen


On 07/03/2017 08:11 AM, Christopher Samuel wrote:


On 03/07/17 16:02, Loris Bennett wrote:


I don't think you can achieve what you want with Fairshare and
Multifactor Priority.  Fairshare looks at distributing resources fairly
between users over a *period* of time.  At any *point* in time it is
perfectly possible for all the resources to be allocated to one user.


Loris is quite right about this, but it is possible to impose limits on
a project if you chose to use slurmdbd.

First you need to set up accounting:

https://slurm.schedmd.com/accounting.html

then you can set limits:

https://slurm.schedmd.com/resource_limits.html


A more detailed recipe for fairshare and limits setup is in my Wiki 
page: https://wiki.fysik.dtu.dk/niflheim/Slurm_scheduler
My general Slurm page guiding the deployment is 
https://wiki.fysik.dtu.dk/niflheim/SLURM


I fully agree with Loris and Chris.  Such challenges are universal for 
all clusters and queueing systems.


/Ole


[slurm-dev] Announce: Node status tool "pestat" for Slurm updated to version 0.52

2017-06-28 Thread Ole Holm Nielsen


I'm announcing an updated version 0.52 of the node status tool "pestat" 
for Slurm.


New features:

1. The width of the hostname column can now be changed in the CONFIGURE 
section:

export hostnamelength="8"
Thanks to Markus Koeberl <markus.koeb...@tugraz.at> for suggesting this 
feature.


Download the tool (a short bash script) from 
https://ftp.fysik.dtu.dk/Slurm/pestat. If your commands do not live in 
/usr/bin, please make appropriate changes in the CONFIGURE section at 
the top of the script.


Usage: pestat [-p partition(s)] [-u username] [-q qoslist] [-s 
statelist] [-n/-w hostlist]

[-f | -m free_mem | -M free_mem ] [-C/-c] [-V] [-h]
where:
-p partition: Select only partion 
-u username: Print only user 
-q qoslist: Print only QOS in the qoslist 
-s statelist: Print only nodes with state in 
-n/-w hostlist: Print only nodes in hostlist
-f: Print only nodes that are flagged by * (unexpected load etc.)
-m free_mem: Print only nodes with free memory LESS than free_mem MB
	-M free_mem: Print only nodes with free memory GREATER than free_mem MB 
(under-utilized)

-C: Color output is forced ON
-c: Color output is forced OFF
-h: Print this help information
-V: Version information


I use "pestat -f" all the time because it prints and flags (in color) 
only the nodes which have an unexpected CPU load or node status, for 
example:


# pestat  -f
Print only nodes that are flagged by *
Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem Joblist
State Use/Tot  (MB) (MB) 
JobId User ...
a066  xeon8*alloc   8   88.04 23900  173* 
91683 user01
a067  xeon8*alloc   8   88.07 23900  181* 
91683 user01
a083  xeon8*alloc   8   88.06 23900  172* 
91683 user01



The -s option is useful for checking on possibly unusual node states, 
for example:


# pestat -s mixed

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Announce: Node status tool "pestat" for Slurm updated to version 0.51

2017-06-28 Thread Ole Holm Nielsen


I'm announcing an updated version 0.51 of the node status tool "pestat" 
for Slurm.


New features:

1. Turning on colors explicitly even when the output doesn't go to a 
terminal with the -C flag (and -c to turn off colors). Thanks to Fermin 
Molina <fmol...@nlhpc.cl> for requesting this!  Fermin's suggests to 
allow a nice continuous monitoring of "flagged" nodes with:


# watch -n 60 --color 'pestat -f -C'

2. Added -n/-w hostlist to select a subset of nodes.  The -n form is for 
compatibility with sinfo, whereas -w is compatible with pdsh and clush 
(ClusterShell).


Download the tool (a short bash script) from 
https://ftp.fysik.dtu.dk/Slurm/pestat. If your commands do not live in 
/usr/bin, please make appropriate changes in the CONFIGURE section at 
the top of the script.


Usage: pestat [-p partition(s)] [-u username] [-q qoslist] [-s 
statelist] [-n/-w hostlist]

[-f | -m free_mem | -M free_mem ] [-C/-c] [-V] [-h]
where:
-p partition: Select only partion 
-u username: Print only user 
-q qoslist: Print only QOS in the qoslist 
-s statelist: Print only nodes with state in 
-n/-w hostlist: Print only nodes in hostlist
-f: Print only nodes that are flagged by * (unexpected load etc.)
-m free_mem: Print only nodes with free memory LESS than free_mem MB
	-M free_mem: Print only nodes with free memory GREATER than free_mem MB 
(under-utilized)

-C: Color output is forced ON
-c: Color output is forced OFF
-h: Print this help information
Usage: pestat [-p partition(s)] [-u username] [-q qoslist] [-s statelist]
[-f | -m free_mem | -M free_mem ] [-V] [-h]
where:
-p partition: Select only partion 
-u username: Print only user 
-q qoslist: Print only QOS in the qoslist 
-s statelist: Print only nodes with state in 
-f: Print only nodes that are flagged by * (unexpected load etc.)
-m free_mem: Print only nodes with free memory LESS than free_mem MB
	-M free_mem: Print only nodes with free memory GREATER than free_mem MB 
(under-utilized)

-h: Print this help information
-V: Version information
-V: Version information

I use "pestat -f" all the time because it prints and flags (in color) 
only the nodes which have an unexpected CPU load or node status, for 
example:


# pestat  -f
Print only nodes that are flagged by *
Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem Joblist
State Use/Tot  (MB) (MB) 
JobId User ...
a066  xeon8*alloc   8   88.04 23900  173* 
91683 user01
a067  xeon8*alloc   8   88.07 23900  181* 
91683 user01
a083  xeon8*alloc   8   88.06 23900  172* 
91683 user01



The -s option is useful for checking on possibly unusual node states, 
for example:


# pestat -s mixed

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Re: slurm-dev Announce: Node status tool "pestat" for Slurm updated to version 0.50

2017-06-27 Thread Ole Holm Nielsen


On 26-06-2017 17:20, Adrian Sevcenco wrote:


On 06/22/2017 01:34 PM, Ole Holm Nielsen wrote:


I'm announcing an updated version 0.50 of the node status tool 
"pestat" for Slurm.  I discovered how to obtain the node Free Memory 
with sinfo, so now we can do nice things with memory usage!


Hi! thank you for the great tool! i don't know if this is intended but :

[Monday 26.06.17 18:12] adrian@sev : ~  $
sinfo -N -t idle -o "%N %P %C %O %m %e %t" | column -t
NODELIST   PARTITION  CPUS(A/I/O/T)  CPU_LOAD  MEMORY  FREE_MEM  STATE
localhost  local* 0/8/0/80.03  14984   201   idle

[Monday 26.06.17 18:13] adrian@sev : ~  $
free -m
   totalusedfree  shared  buff/cache 
available

Mem:  14984 392 182 134   14409  14081
Swap:  8191   08191

[Monday 26.06.17 18:13] adrian@sev : ~  $
pestat
Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem 
Joblist
 State Use/Tot  (MB) (MB) 
JobId User ...

localhost  local* idle   0   80.03 14984  201*


while it is clear that the reported free mem is what is reported by free 
as "free" one might argue that buffers/cache is memory available for 
usage as it will shrink with the application usage ...


Maybe the FREE_MEM should be reported as (free + cached) ?


The pestat tool simply reports the free_mem value provided by sinfo.
I'm not sure I understand your point, but only SchedMD can change 
Slurm's reporting.


/Ole


[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Ole Holm Nielsen


Thanks Paul!  Would you know the answer to:

Question 2: Can anyone confirm that the output "slurmdbd: debug2: 
Everything rolled up" indeed signifies that conversion is complete?


Thanks,
Ole

On 06/26/2017 04:02 PM, Paul Edmon wrote:


Yeah, we keep around a test cluster environment for that purpose to vet 
slurm upgrades before we roll them on the production cluster.


Thus far no problems.  However, paranoia is usually a good thing for 
cases like this.


-Paul Edmon-


On 06/26/2017 07:30 AM, Ole Holm Nielsen wrote:


On 06/26/2017 01:24 PM, Loris Bennett wrote:

We have also upgraded in place continuously from 2.2.4 to currently
16.05.10 without any problems.  As I mentioned previously, it can be
handy to make a copy of the statesave directory, once the daemons have
been stopped.

However, if you want to know how long the upgrade might take, then yours
is a good approach.  What is your use case here?  Do you want to inform
the users about the length of the outage with regard to job submission?


I want to be 99.9% sure that upgrading (my first one) will actually 
work.  I also want to know how roughly long the slurmdbd will be down 
so that the cluster doesn't kill all jobs due to timeouts.  Better to 
be safe than sorry.


I don't expect to inform the users, since the operation is expected to 
run smoothly without troubles for user jobs.


Thanks,
Ole


Re: [slurm-dev] Dry run upgrade procedure for the slurmdbd database
Question 2: Can anyone confirm that the output "slurmdbd: debug2: Everything rolled up" indeed signifies that conversion is complete? 
We did it in place, worked as noted on the tin. It was less painful

than I expected. TBH, your procedures are admirable, but you shouldn't
worry - it's a relatively smooth process.

cheers
L.

--
"Mission Statement: To provide hope and inspiration for collective 
action, to build collective power, to achieve collective 
transformation, rooted in grief and rage but pointed towards vision 
and dreams."


- Patrisse Cullors, Black Lives Matter founder

On 26 June 2017 at 20:04, Ole Holm Nielsen 
<ole.h.niel...@fysik.dtu.dk> wrote:


  We're planning to upgrade Slurm 16.05 to 17.02 soon. The most 
critical step seems to me to be the upgrade of the slurmdbd 
database, which may also take tens of minutes.


  I thought it's a good idea to test the slurmdbd database upgrade 
locally on a drained compute node in order to verify both 
correctness and the time required.


  I've developed the dry run upgrade procedure documented in the 
Wiki page 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm


  Question 1: Would people who have real-world Slurm upgrade 
experience kindly offer comments on this procedure?


  My testing was actually successful, and the database conversion 
took less than 5 minutes in our case.


  A crucial step is starting the slurmdbd manually after the 
upgrade. But how can we be sure that the database conversion has 
been 100% completed?


  Question 2: Can anyone confirm that the output "slurmdbd: debug2: 
Everything rolled up" indeed signifies that conversion is complete?


  Thanks,
  Ole








--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620


[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Ole Holm Nielsen


On 06/26/2017 01:24 PM, Loris Bennett wrote:

We have also upgraded in place continuously from 2.2.4 to currently
16.05.10 without any problems.  As I mentioned previously, it can be
handy to make a copy of the statesave directory, once the daemons have
been stopped.

However, if you want to know how long the upgrade might take, then yours
is a good approach.  What is your use case here?  Do you want to inform
the users about the length of the outage with regard to job submission?


I want to be 99.9% sure that upgrading (my first one) will actually 
work.  I also want to know how roughly long the slurmdbd will be down so 
that the cluster doesn't kill all jobs due to timeouts.  Better to be 
safe than sorry.


I don't expect to inform the users, since the operation is expected to 
run smoothly without troubles for user jobs.


Thanks,
Ole


Re: [slurm-dev] Dry run upgrade procedure for the slurmdbd database

We did it in place, worked as noted on the tin. It was less painful
than I expected. TBH, your procedures are admirable, but you shouldn't
worry - it's a relatively smooth process.

cheers
L.

--
"Mission Statement: To provide hope and inspiration for collective action, to build 
collective power, to achieve collective transformation, rooted in grief and rage but 
pointed towards vision and dreams."

- Patrisse Cullors, Black Lives Matter founder

On 26 June 2017 at 20:04, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> wrote:

  We're planning to upgrade Slurm 16.05 to 17.02 soon. The most critical step 
seems to me to be the upgrade of the slurmdbd database, which may also take 
tens of minutes.

  I thought it's a good idea to test the slurmdbd database upgrade locally on a 
drained compute node in order to verify both correctness and the time required.

  I've developed the dry run upgrade procedure documented in the Wiki page 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

  Question 1: Would people who have real-world Slurm upgrade experience kindly 
offer comments on this procedure?

  My testing was actually successful, and the database conversion took less 
than 5 minutes in our case.

  A crucial step is starting the slurmdbd manually after the upgrade. But how 
can we be sure that the database conversion has been 100% completed?

  Question 2: Can anyone confirm that the output "slurmdbd: debug2: Everything 
rolled up" indeed signifies that conversion is complete?

  Thanks,
  Ole






--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620


[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-26 Thread Ole Holm Nielsen


Hi Kilian,

Thanks for explaining how to configure ClusterShell correctly for Slurm! 
 I've updated my Wiki information in 
https://wiki.fysik.dtu.dk/niflheim/SLURM#clustershell now.


I would suggest you to add your examples to the ClusterShell 
documentation, where I feel it may be hidden or missing.


/Ole

On 06/23/2017 06:37 PM, Kilian Cavalotti wrote:

But how do I configure fro Slurm??  I've copied the example file to
/etc/clustershell/groups.conf.d/slurm.conf, but this doesn't enable Slurm
partitions (here: xeon24) as ClusterShell groups:

# clush -g xeon24 date
Usage: clush [options] command
clush: error: No node to run on.

Could you kindly explain this (and perhaps add examples to the
documentation)?
> 
Cheers,



--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620

Sure! That's because the groups.conf.d/slurm.conf file defines new
group sources [1]. ClusterShell supports multiple group sources, ie.
multiple sources of information to define groups. There is a default
one, defined in groups.conf, which will be used when a group name is
used, without specifying anything else, as in your "clush -g xeon24
date" command. But since the "slurm" group source is not the default,
it's not used to map the "xeon24" group to the corresponding Slurm
partition.

So, you can either:

* use the -s option to specify a group source, or prefix the group
name with the group source name in the command line, like this:

 $ clush -s slurm -g xeon24 date

or, more compact:

$ clush -w@slurm:xeon24 date

* or if you don't plan to use any other group source than "slurm", you
can make it the default with the following in
/etc/clustershell/groups.conf:

[Main]
# Default group source
default: slurm


With the example Slurm group source, you can easily execute commands
on all the nodes from a given partition, but also on nodes based on
their Slurm state, like:

$ clush -w@slurmstate:drained date


Hope this makes things a bit clearer.

[1] 
https://clustershell.readthedocs.io/en/latest/config.html#external-group-sources


[slurm-dev] Re: Announce: Node status tool "pestat" for Slurm updated to version 0.50

2017-06-26 Thread Ole Holm Nielsen


On 23-06-2017 17:20, Belgin, Mehmet wrote:
One thing I noticed is that pestat reports zero Freemem until a job is 
allocated on nodes. I’d expect it to report the same value as Memsize if 
no jobs are running. I wanted to offer this as a suggestion since zero 
free memory on idle nodes may be a bit confusing for users.

...

Before Job allocation
# pestat -p vtest
Print only nodes in partition vtest
Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem 
  Joblist
 State Use/Tot  (MB) (MB) 
  JobId User ...

devel-pcomp1  vtest* idle   0  120.02129080 *0*
devel-vcomp1  vtest* idle   0   20.02  5845 *0*
devel-vcomp2  vtest* idle   0   20.00  5845 *0*
devel-vcomp3  vtest* idle   0   20.03  5845 *0*
devel-vcomp4  vtest* idle   0   20.01  5845 *0*


I'm not seeing the incorrect Freemem that you report.  I get sensible 
numbers for Freemem:


# pestat -s idle
Select only nodes with state=idle
Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem Joblist
State Use/Tot  (MB) (MB) 
JobId User ...

a017  xeon8* idle   0   84.25*2390021590
a077  xeon8* idle   0   83.47*2390022964
b003  xeon8* idle   0   88.01*2390016839
b046  xeon8* idle   0   80.01 2390022393
b066  xeon8* idle   0   82.84*2390018610
b081  xeon8* idle   0   80.01 2390021351
g021  xeon16 idle   0  160.01 6400052393
g022  xeon16 idle   0  160.01 6400060717
g039  xeon16 idle   0  160.01 6400061795
g048  xeon16 idle   0  160.01 6400062338
g074  xeon16 idle   0  160.01 6400062274
g076  xeon16 idle   0  160.01 6400058854

You should use sinfo directly to verify Slurm's data:

 sinfo -N -t idle -o "%N %P %C %O %m %e %t"

FYI: We run Slurm 16.05 and have configured Cgroups.

/Ole


[slurm-dev] Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Ole Holm Nielsen


We're planning to upgrade Slurm 16.05 to 17.02 soon.  The most critical 
step seems to me to be the upgrade of the slurmdbd database, which may 
also take tens of minutes.


I thought it's a good idea to test the slurmdbd database upgrade locally 
on a drained compute node in order to verify both correctness and the 
time required.


I've developed the dry run upgrade procedure documented in the Wiki page 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm


Question 1: Would people who have real-world Slurm upgrade experience 
kindly offer comments on this procedure?


My testing was actually successful, and the database conversion took 
less than 5 minutes in our case.


A crucial step is starting the slurmdbd manually after the upgrade.  But 
how can we be sure that the database conversion has been 100% completed?


Question 2: Can anyone confirm that the output "slurmdbd: debug2: 
Everything rolled up" indeed signifies that conversion is complete?


Thanks,
Ole


[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-23 Thread Ole Holm Nielsen


On 06/22/2017 06:22 PM, Kilian Cavalotti wrote:> ClusterShell is 
incredibly useful, it provides not only a parallel

shell for remote execution (and file distribution, output aggregation
or diff'ing...), but also an event-driven Python library that can be
used in your Python scripts, and CLI tools to manipulate node sets
(any kind of logical operation between node groups, expansion,
multi-dimensional folding, counting, stepping, you name it). Oh, and
the tree mode [3]? You have to try it.

I can only encourage you to give a look at the documentation [2],
there are too many awesome features to describe here. ;)

[1] https://cea-hpc.github.io/clustershell/
[2] https://clustershell.readthedocs.io/en/latest/intro.html
[3] https://clustershell.readthedocs.io/en/latest/tools/clush.html#tree-mode


Yes, ClusterShell has indeed lots of features and compares favorably to 
PDSH.  I've added a brief description in my Slurm Wiki 
https://wiki.fysik.dtu.dk/niflheim/SLURM#clustershell, please comment on 
it off-line if you have the time.


However, after a brief reading of the ClusterShell manual, it hasn't 
dawned upon me how I use it with Slurm partitions.  The basic 
functionality is OK:


# clush -w i[001-003] date
i001: Fri Jun 23 09:52:29 CEST 2017
i003: Fri Jun 23 09:52:29 CEST 2017
i002: Fri Jun 23 09:52:29 CEST 2017

But how do I configure fro Slurm??  I've copied the example file to 
/etc/clustershell/groups.conf.d/slurm.conf, but this doesn't enable 
Slurm partitions (here: xeon24) as ClusterShell groups:


# clush -g xeon24 date
Usage: clush [options] command
clush: error: No node to run on.

Could you kindly explain this (and perhaps add examples to the 
documentation)?


Thanks,
Ole


[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-23 Thread Ole Holm Nielsen


On 06/22/2017 06:39 PM, Michael Jennings wrote:


On Thursday, 22 June 2017, at 04:19:04 (-0600),
Loris Bennett wrote:


   rpmbuild --rebuild --with=slurm --without=torque pdsh-2.26-4.el6.src.rpm


Remove the equals signs.  I have no problems building pdsh 2.29 via:

   rpmbuild --rebuild --with slurm --without torque pdsh-2.29-1.el7.src.rpm

for EL5, EL6, and EL7.


You're right!  On CentOS 6.9 with rpm-build-4.8.0-55.el6.x86_64 you 
apparently need to remove the "=" signs as you say.  Then rpmbuild also 
works for me as documented in 
https://wiki.fysik.dtu.dk/niflheim/SLURM#pdsh-parallel-distributed-shell.


The requirement of removing "=" must be a bug since "rpmbuild --help" says:

  --with=   enable configure  for build
  --without=disable configure  for build

Unfortunately, PDSH 2.26 has an error related to pdsh-mod-slurm, so 
rpmbuild bombs out with:


Processing files: pdsh-mod-slurm-2.26-4.el6.x86_64
error: File not found by glob: 
/root/rpmbuild/BUILDROOT/pdsh-2.26-4.el6.x86_64/usr/lib64/pdsh/slurm.*


The PDSH homepage has apparently moved recently to 
https://github.com/grondo/pdsh and now offers version pdsh-2.32 
(2017-06-22).  However, I'm currently unable to build an RPM from this 
version.


Michael: From where can one download pdsh-2.29-1.el7.src.rpm?  I can 
only find version 2.31 at 
https://dl.fedoraproject.org/pub/epel/7/SRPMS/p/pdsh-2.31-1.el7.src.rpm, 
and this version won't build on EL6.


Thanks,
Ole


[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-22 Thread Ole Holm Nielsen


On 06/22/2017 12:18 PM, Loris Bennett wrote:> I have just realised that 
pdsh, which was what I wanted the consolidated

list for, has a Slurm module, which knows about Slurm jobs.  I followed
your instructions here:

   https://wiki.fysik.dtu.dk/niflheim/SLURM#pdsh-parallel-distributed-shell

with some modifications for EPEL6.  However, in the 'rebuild' line

   rpmbuild --rebuild --with=slurm --without=torque pdsh-2.26-4.el6.src.rpm

fails with

   --with=slurm: unknown option

The page https://github.com/grondo/pdsh implies it should be

   rpmbuild --rebuild --with-slurm --without-torque pdsh-2.26-4.el6.src.rpm

but this also fails:

   --with-slurm: unknown option

Any ideas what I'm doing wrong?


Not really.  But I notice that the pdsh RPM:
https://dl.fedoraproject.org/pub/epel/6/SRPMS/pdsh-2.26-4.el6.src.rpm
is dated 2012-09-07

I also get the error:

# rpmbuild --rebuild --with=slurm --without=torque pdsh-2.26.tar.bz2
--with=slurm: unknown option

I haven't been able to find documentation of rpmbuild for EL6.  Further 
investigation is required.


/Ole


[slurm-dev] slurm-dev Announce: Node status tool "pestat" for Slurm updated to version 0.50

2017-06-22 Thread Ole Holm Nielsen


I'm announcing an updated version 0.50 of the node status tool "pestat" 
for Slurm.  I discovered how to obtain the node Free Memory with sinfo, 
so now we can do nice things with memory usage!


New features:

1. The "pestat -f" will flag nodes with less than 20% free memory.

2. Now "pestat -m 1000" will print nodes with less than 1000 MB free memory.

3. Use "pestat -M 20" to print nodes with greater than 20 MB 
free memory.  Jobs on such under-utilized nodes might better be 
submitted to lower-memory nodes.


Download the tool (a short bash script) from 
https://ftp.fysik.dtu.dk/Slurm/pestat. If your commands do not live in 
/usr/bin, please make appropriate changes in the CONFIGURE section at 
the top of the script.


Usage: pestat [-p partition(s)] [-u username] [-q qoslist] [-s statelist]
[-f | -m free_mem | -M free_mem ] [-V] [-h]
where:
-p partition: Select only partion 
-u username: Print only user 
-q qoslist: Print only QOS in the qoslist 
-s statelist: Print only nodes with state in 
-f: Print only nodes that are flagged by * (unexpected load etc.)
-m free_mem: Print only nodes with free memory LESS than free_mem MB
	-M free_mem: Print only nodes with free memory GREATER than free_mem MB 
(under-utilized)

-h: Print this help information
-V: Version information


I use "pestat -f" all the time because it prints and flags (in color) 
only the nodes which have an unexpected CPU load or node status, for 
example:


# pestat  -f
Print only nodes that are flagged by *
Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem 
Joblist
State Use/Tot  (MB) (MB) 
JobId User ...
a066  xeon8*alloc   8   88.04 23900  173* 
91683 user01
a067  xeon8*alloc   8   88.07 23900  181* 
91683 user01
a083  xeon8*alloc   8   88.06 23900  172* 
91683 user01



The -s option is useful for checking on possibly unusual node states, 
for example:


# pestat -s mixed

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-22 Thread Ole Holm Nielsen


You may want to throw in a uniq command in case the user runs multiple 
jobs on some nodes:


# squeue -u user123 -h -o "%N" | tr '\n' , | xargs scontrol show 
hostlistsorted

b[135,135,135]

This gives a better list:

# squeue -u user123 -h -o "%N" | uniq | tr '\n' , | xargs scontrol show 
hostlistsorted

b135

BTW, if you enter a non-existent user, the output is an unexpected error 
message and a long help info :-)


/Ole

On 06/22/2017 11:29 AM, Jens Dreger wrote:


I think

   squeue -u user123 -h -o "%N" | tr '\n' , | xargs scontrol show hostlistsorted

should also do it... Slightly better to remember ;)

On Thu, Jun 22, 2017 at 02:59:11AM -0600, Loris Bennett wrote:


Hi,

I can generate a list of node lists on which the jobs of a given user
are running with the following:

   $ squeue -u user123 -h -o "%N"
   node[006-007,014,016,021,024]
   node[012,094]
   node[005,008-011,013,015,026,095,097-099]

I would like to merge these node lists to obtain

   node[005-016,021,024,026,094-095,097-099]

I can do the following:

   $ squeue -u user123 -h -o "%N" | xargs -I {} scontrol show hostname {} | sed 
':a;N;$!ba;s/\n/,/g' | xargs scontrol show hostlistsorted
   node[005-016,021,024,026,094-095,097-099]

Would it be worth adding an option to allow the delimiter in the output
of 'scontrol show hostname' to be changed from an newline to, say, a
comma?  That would permit easier manipulation of node lists without
one having to google the appropiate sed magic.





--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620


[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

2017-06-20 Thread Ole Holm Nielsen


On 06/20/2017 04:32 PM, Loris Bennett wrote:

We do our upgrades while full production is up and running.  We just stop
the Slurm daemons, dump the database and copy the statesave directory
just in case.  We then do the update, and finally restart the Slurm
daemons.  We only lost jobs once during an upgrade back around 2.2.6 or
so, but that was due a rather brittle configuration provided by our
vendor (the statesave path contained the Slurm version), rather than
Slurm itself and was before we had acquired any Slurm expertise
ourselves.


1. When you refer to "daemons", do you mean slurmctld, slurmdbd as well 
as slurmd on all compute nodes?  AFAIK, the recommended procedure 
upgrading and restarting in this order: 1) slurmdbd, 2) slurmctld, 3) 
slurmd on nodes.


2. When you mention statesave, I suppose this is what you refer to:
# scontrol show config | grep -i statesave
StateSaveLocation   = /var/spool/slurmctld

Thanks,
Ole


[slurm-dev] Re: Can't get formatted sinfo to work...

2017-06-20 Thread Ole Holm Nielsen


Hi Mehmet,

Perhaps you need to configure NHC to use the short hostname, see the 
example in 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check


/Ole

On 06/19/2017 05:09 PM, Belgin, Mehmet wrote:
Thank you Loris, it was my bad. I should have used the short hostname, 
which seems to be working for me as well:


$ sinfo -o '%t %E' -hn `hostname -s`
$ drain Testing



On Jun 19, 2017, at 2:28 AM, Loris Bennett > wrote:



Hi Mehmet,

"Belgin, Mehmet" > writes:



I’m troubleshooting an issue that causes NHC to fail to offline a bad
node. The node offline script uses formatted “sinfo" to identify the
node status, which returns blank for some reason. Interestingly, sinfo
works without custom formatting.

Could this be due to a bug in the current version (17.02.4)? Would
someone mind trying the following commands in an older Slurm version
to compare the output?

[root@devel-vcomp1 nhc]# sinfo --version
slurm 17.02.4

[root@devel-vcomp1 nhc]# sinfo -o '%t %E' -hn `hostname`

(NOTHING!)

[root@devel-vcomp1 nhc]# sinfo -hn `hostname`
test up infinite 0 n/a
vtest* up infinite 0 n/a

(OK)

Thanks!

-Mehmet



Seem to work as expected with our version:

[root@node003 ~]# sinfo --version
slurm 16.05.10-2
[root@node003 ~]# sinfo -o '%t %E' -hn `hostname`
mix none
[root@node003 ~]# sinfo -hn `hostname`
test   up3:00:00  0n/a
main*  up 14-00:00:0  1mix node003
gpuup 14-00:00:0  0n/a

HTH,

Loris

--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin 
emailloris.benn...@fu-berlin.de 





[slurm-dev] Announce: Node status tool "pestat" for Slurm updated to version 0.41

2017-06-16 Thread Ole Holm Nielsen


I'm announcing an updated version 0.41 of the node status tool "pestat" 
for Slurm.  Colored output is now printed also when all nodes are 
listed, and the logic has been cleaned up a bit.


Download the tool (a short bash script) from 
https://ftp.fysik.dtu.dk/Slurm/pestat. If your commands do not live in 
/usr/bin, please make appropriate changes in the CONFIGURE section at 
the top of the script.


Usage: pestat [-p partition(s)] [-u username] [-q qoslist] [-s 
statelist] [-f] [-V] [-h]

where:
-p partition: Select only partion 
-u username: Print only user 
-q qoslist: Print only QOS in the qoslist 
-s statelist: Print only nodes with state in 
-f: Print only nodes that are flagged by * (unexpected load etc.)
-h: Print this help information
-V: Version information

I use "pestat -f" all the time because it prints and flags (in color) 
only the nodes which have an unexpected CPU load or node status, for 
example:


# pestat -f
Select only user user01
Hostname   Partition Node Num_CPU CPUload  Memsize Joblist
State Use/Tot (MB) JobId User ...
g045  xeon16alloc  16  16   11.81*64000 84943 user01
g047  xeon16alloc  16  16   11.79*64000 84943 user01
g068  xeon16alloc  16  16   15.11*64000 84943 user01

The -s option is useful for checking on possibly unusual node states, 
for example:


# pestat -s mixed

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Tools for using strigger to monitor nodes?

2017-05-23 Thread Ole Holm Nielsen


I'd like to configure E-mail notifications of failing nodes.  I already 
use the LBL NHC (Node Health Check) on the compute nodes to send alerts, 
but one may also use the Slurm strigger mechanism on the slurmctld host.


The examples in http://slurm.schedmd.com/strigger.html are quite 
rudimentary, so I've experimented with writing a slightly more user 
friend script https://ftp.fysik.dtu.dk/Slurm/notify_nodes_down


Question: Has anyone already written (and is willing to share) the 
ultimately clever tool using strigger to monitor nodes going down etc.?


Thanks,
Ole


[slurm-dev] Re: How to get pids of a job

2017-05-11 Thread Ole Holm Nielsen


I have written a small tool to display the user processes in a job:
  https://ftp.fysik.dtu.dk/Slurm/sshjob

An example output is:

# sshjob  57811
Nodelist for job-id 57811: a128
Node usage: NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

a128

  PID S USER  STARTED NLWP TIME %CPU   RSS COMMAND
18560 S userNN May 101 00:00:00  0.0  1584 /bin/bash 
/var/spool/slurmd/job57811/slurm_script

18580 S userNN May 101 00:00:00  0.0 22792 python vasp.py
18585 S userNN May 101 00:00:00  0.0  1468 sh -c mpiexec 
/home/niflheim/nirama/vasp.5.4.1_sy
18586 S userNN May 105 00:00:01  0.0  7148 mpiexec 
/home/niflheim/nirama/vasp.5.4.1_sylg/vas
18592 R userNN May 103 1-06:45:34 99.5 1917736 
/home/niflheim/nirama/vasp.5.4.1_sylg/vasp_VT
18594 R userNN May 103 1-06:51:23 99.8 1927368 
/home/niflheim/nirama/vasp.5.4.1_sylg/vasp_VT
18595 R userNN May 103 1-06:51:36 99.8 1918248 
/home/niflheim/nirama/vasp.5.4.1_sylg/vasp_VT
18596 R userNN May 103 1-06:51:22 99.8 1916808 
/home/niflheim/nirama/vasp.5.4.1_sylg/vasp_VT
18597 R userNN May 103 1-06:51:32 99.8 1913816 
/home/niflheim/nirama/vasp.5.4.1_sylg/vasp_VT
18598 R userNN May 103 1-06:51:22 99.8 1920048 
/home/niflheim/nirama/vasp.5.4.1_sylg/vasp_VT
18599 R userNN May 103 1-06:51:32 99.8 1917112 
/home/niflheim/nirama/vasp.5.4.1_sylg/vasp_VT
18600 R userNN May 103 1-06:51:23 99.8 1916796 
/home/niflheim/nirama/vasp.5.4.1_sylg/vasp_VT



The PDSH tool is required, see some advice in:
https://wiki.fysik.dtu.dk/niflheim/SLURM#pdsh-parallel-distributed-shell

A similar tool displays user processes on a specified node:
  https://ftp.fysik.dtu.dk/Slurm/sshps

I hope this helps.

/Ole


On 05/11/2017 04:01 PM, John Hearns wrote:

A good tool to us on the nodes when you have the list of nodes is 'pgrep'
https://linux.die.net/man/1/pgrep



On 11 May 2017 at 15:44, Jason Bacon > wrote:




Parse the node names from squeue output (-o can help if you want to
automate this) and then run ps or top on those nodes.

Cheers,

 JB

On 05/11/17 04:07, GHui wrote:

How to get pids of a job

I want to get a job's pids on nodes. How could I do that?

--GHui


[slurm-dev] Announce: Node status tool "pestat" for Slurm updated to version 0.40

2017-05-09 Thread Ole Holm Nielsen



I'm announcing an updated version 0.40 of the node status tool "pestat" 
for Slurm.


Download the tool (a short bash script) from 
https://ftp.fysik.dtu.dk/Slurm/pestat


Thanks to Daniel Letai for recommending better script coding styles. If 
your commands do not live in /usr/bin, please make appropriate changes 
in the CONFIGURE section at the top of the script.


New options have been added as shown by the help information:

# pestat  -h
Usage: pestat [-p partition(s)] [-u username] [-q qoslist] [-s 
statelist] [-f] [-V] [-h]

where:
-p partition: Select only partion 
-u username: Print only user 
-q qoslist: Print only QOS in the qoslist 
-s statelist: Print only nodes with state in 
-f: Print only nodes that are flagged by * (unexpected load etc.)
-h: Print this help information
-V: Version information

I use "pestat -f" all the time because it prints and flags (in color) 
only the nodes which have an unexpected CPU load or node status.


The -s option is useful for checking on possibly unusual node states, 
for example "pestat -s mix".


--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Announce: Infiniband topology tool "slurmibtopology.sh" updated version 0.21

2017-05-09 Thread Ole Holm Nielsen


I'm announcing an updated version 0.21 of an Infiniband topology tool 
"slurmibtopology.sh" for Slurm.  The output may be used as a starting 
point for writing your own topology.conf file.


Download the script from https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh

Thanks to Felip Moll for testing the script on a rather large IB network.

Motivation: I had to create a Slurm topology.conf file and needed an 
automated way to get the correct node and switch Infiniband 
connectivity.  The manual page 
https://slurm.schedmd.com/topology.conf.html refers to an outdated tool 
ib2slurm.


Version 0.21 reads switch-to-switch links and prints out lines with 
"Switches=..." for those switches with 0 compute node (HCA) links.

An option -c will delete all the (possibly useful) comment lines.

Example: Running this script on our Infiniband network:

# ./slurmibtopology.sh -c
Verify the Infiniband interface:
mlx4_0
Infiniband interface OK

Generate the Slurm topology.conf file for Infiniband switches.

Beware: The Switches= lines need to be reviewed and edited for correctness.
Read also https://slurm.schedmd.com/topology.html

SwitchName=ibsw1 Nodes=i[001-028]
SwitchName=ibsw2 Nodes=finbul,i[029-051],niflfs[1-2],niflopt1
SwitchName=ibsw3 Nodes=g[081-100,102-112],h[001-002]
SwitchName=ibsw4 Switches=ibsw[2-3,7]
SwitchName=ibsw5 Nodes=g[005-008,013-014,016,021-024,029-032,037-040]
SwitchName=ibsw6 Nodes=g[045-048,053-056,061-064,069-072,077-080]
SwitchName=ibsw7 Nodes=g[041-044,049-052,057-060,065-068,073-076]
SwitchName=ibsw8 Nodes=g[001-004,009-012,017-020,025-028,033-036]


It would be great if other sites could test this tool on their 
Infiniband network and report bugs or suggest improvements.


--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-09 Thread Ole Holm Nielsen


On 05/09/2017 09:14 AM, Janne Blomqvist wrote:


On 2017-05-07 15:29, Ole Holm Nielsen wrote:


I'm announcing an initial version 0.1 of an Infiniband topology tool
"slurmibtopology.sh" for Slurm.


I have also created one, at

https://github.com/jabl/ibtopotool

You need the python networkx library (python-networkx package on centos
& Ubuntu, or install via pip).

Run with --help option to get some usage instructions. In addition to
generating slurm topology.conf, it can also generate graphviz dot files
for visualization.


Thanks for providing this tool to the Slurm community.  It seems that 
tools for generating topology.conf have been developed in many places, 
probably because it's an important task.


I installed python-networkx 1.8.1-12.el7 from EPEL on our CentOS 7.3 
system and then executed ibtopotool.py, but it gives an error message:


# ./ibtopotool.py
Traceback (most recent call last):
  File "./ibtopotool.py", line 216, in 
graph = parse_ibtopo(args[0], options.shortlabels)
IndexError: list index out of range

Could you help solving this?

Thanks,
Ole


[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-09 Thread Ole Holm Nielsen


Hi Damien,

Thanks for your positive feedback!  I'll be posting an updated script 
soon which works better for multilevel IB networks.


I would however warn anyone against using the output of 
slurmibtopology.sh directly in an automated procedure.  I recommend 
strongly a manual step as well in a procedure like:


1. Use slurmibtopology.sh to generate topology.conf.  It should show 
correctly the leaf switches and their links to compute nodes.


2. Review your 2nd and 3rd level network topology as discussed in 
https://slurm.schedmd.com/topology.html.  Heed in particular this statement:

As a practical matter, listing every switch connection definitely results in a 
slower scheduling algorithm for Slurm to optimize job placement. The 
application performance may achieve little benefit from such optimization. 
Listing the leaf switches with their nodes plus one top level switch should 
result in good performance for both applications and Slurm.


3. In the generated topology.conf you should select only 1 top-level 
switch and delete the others.


4. Copy the edited topology.conf to your cluster.

/Ole

On 05/08/2017 04:56 PM, Damien François wrote:

Hi

many thanks for the tools, it works flawlessly here.

I just patched to send the output that does not belong to the topology.conf to 
stderr so I could simply redirect to topology.conf in an automated Slurm 
install procedure

16c16
< echo Verify the Infiniband interface: >&2
---

echo Verify the Infiniband interface:

18c18
< if $IBSTAT -l >&2
---

if $IBSTAT -l

20c20
<echo Infiniband interface OK >&2
---

echo Infiniband interface OK

22c22
<echo Infiniband interface NOT OK >&2
---

echo Infiniband interface NOT OK

26c26
< cat <&2
---

cat <

Sincerely

damien


On 07 May 2017, at 14:29, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> wrote:


I'm announcing an initial version 0.1 of an Infiniband topology tool 
"slurmibtopology.sh" for Slurm.

I had to create a Slurm topology.conf file and needed an automated way to get 
the correct node and switch Infiniband connectivity.  The manual page 
https://slurm.schedmd.com/topology.conf.html refers to an outdated tool 
ib2slurm.

Inspired by the script in 
https://unix.stackexchange.com/questions/255472/text-processing-building-a-slurm-topology-conf-file-from-ibnetdiscover-output
 I decided to write a simpler and more understandable tool.  It parses the output of the 
OFED command "ibnetdiscover" and generates an initial topology.conf file (which 
you may want to edit for readability).

Download the script https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh

Example: Running this script on our Infiniband network:

# slurmibtopology.sh
Verify the Infiniband interface:
mlx4_0
Infiniband interface OK

Generate the Slurm topology.conf file for Infiniband switches:

# IB switch no. 1: MF0;mell02:IS5030/U1
SwitchName=ibsw1 Nodes=i[001-028]
# IB switch no. 2: MF0;mell03:IS5030/U1
SwitchName=ibsw2 Nodes=finbul,i[029-051],niflfs[1-2],niflopt1
# IB switch no. 3: MF0;mell01:SX6036/U1
SwitchName=ibsw3 Nodes=g[081-100,102-112],h[001-002]
# IB switch no. 4: MF0;mell04:SX6036/U1
# NOTICE: This switch has no attached nodes (empty hostlist)
SwitchName=ibsw4 Nodes=""
# IB switch no. 5: Mellanox 4036 # volt01
SwitchName=ibsw5 Nodes=g[005-008,013-014,016,021-024,029-032,037-040]
# IB switch no. 6: Mellanox 4036 # volt03
SwitchName=ibsw6 Nodes=g[045-048,053-056,061-064,069-072,077-080]
# IB switch no. 7: Mellanox 4036 # volt04
SwitchName=ibsw7 Nodes=g[041-044,049-052,057-060,065-068,073-076]
# IB switch no. 8: Mellanox 4036 # volt02
SwitchName=ibsw8 Nodes=g[001-004,009-012,017-020,025-028,033-036]
# Merging all switches in a top-level spine switch
SwitchName=spineswitch Switches=ibsw[1-8]

It would be great if other sites could test this tool on their Infiniband 
network and report bugs or suggest improvements.

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark




--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620


[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-09 Thread Ole Holm Nielsen


On 05/08/2017 08:27 PM, Jeffrey Frey wrote:

The primary problem I've had with ib2slurm is that it segfaults.  There's a bug 
in the ibnetdiscover library -- ib2slurm passes a NULL config pointer to the 
ibnd_discover_fabric() which is supposed to be okay according to the 
documentation, but that function actually requires a config structure to be 
passed to it.


I've forked and updated ib2slurm:


https://github.com/jtfrey/ib2slurm


There doesn't appear to be much movement on the original project, so I haven't 
put in a pull request against my fork.


It's great that you try to revive the outdated ib2slurm project!  The 
Slurm pages on topology.conf should point to your project in stead of 
the seemingly dead ib2slurm.


That said, the concept of ib2slurm seems to be that you use 
ibnetdiscover to generate a cache file which is subsequently parsed by 
ib2slurm.  My slurmibtopology.sh script uses the same idea: Parse the 
output from ibnetdiscover and generate a topology.conf file.  For me 
personally it was easier to use awk than C for this task because awk 
supports associative arrays.


/Ole



On May 8, 2017, at 10:55 AM, Damien François <damien.franc...@uclouvain.be> 
wrote:

Hi

many thanks for the tools, it works flawlessly here.

I just patched to send the output that does not belong to the topology.conf to 
stderr so I could simply redirect to topology.conf in an automated Slurm 
install procedure

16c16
< echo Verify the Infiniband interface: >&2
---

echo Verify the Infiniband interface:

18c18
< if $IBSTAT -l >&2
---

if $IBSTAT -l

20c20
<echo Infiniband interface OK >&2
---

echo Infiniband interface OK

22c22
<echo Infiniband interface NOT OK >&2
---

echo Infiniband interface NOT OK

26c26
< cat <&2
---

cat <

Sincerely

damien


On 07 May 2017, at 14:29, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> wrote:


I'm announcing an initial version 0.1 of an Infiniband topology tool 
"slurmibtopology.sh" for Slurm.

I had to create a Slurm topology.conf file and needed an automated way to get 
the correct node and switch Infiniband connectivity.  The manual page 
https://slurm.schedmd.com/topology.conf.html refers to an outdated tool 
ib2slurm.

Inspired by the script in 
https://unix.stackexchange.com/questions/255472/text-processing-building-a-slurm-topology-conf-file-from-ibnetdiscover-output
 I decided to write a simpler and more understandable tool.  It parses the output of the 
OFED command "ibnetdiscover" and generates an initial topology.conf file (which 
you may want to edit for readability).

Download the script https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh

Example: Running this script on our Infiniband network:

# slurmibtopology.sh
Verify the Infiniband interface:
mlx4_0
Infiniband interface OK

Generate the Slurm topology.conf file for Infiniband switches:

# IB switch no. 1: MF0;mell02:IS5030/U1
SwitchName=ibsw1 Nodes=i[001-028]
# IB switch no. 2: MF0;mell03:IS5030/U1
SwitchName=ibsw2 Nodes=finbul,i[029-051],niflfs[1-2],niflopt1
# IB switch no. 3: MF0;mell01:SX6036/U1
SwitchName=ibsw3 Nodes=g[081-100,102-112],h[001-002]
# IB switch no. 4: MF0;mell04:SX6036/U1
# NOTICE: This switch has no attached nodes (empty hostlist)
SwitchName=ibsw4 Nodes=""
# IB switch no. 5: Mellanox 4036 # volt01
SwitchName=ibsw5 Nodes=g[005-008,013-014,016,021-024,029-032,037-040]
# IB switch no. 6: Mellanox 4036 # volt03
SwitchName=ibsw6 Nodes=g[045-048,053-056,061-064,069-072,077-080]
# IB switch no. 7: Mellanox 4036 # volt04
SwitchName=ibsw7 Nodes=g[041-044,049-052,057-060,065-068,073-076]
# IB switch no. 8: Mellanox 4036 # volt02
SwitchName=ibsw8 Nodes=g[001-004,009-012,017-020,025-028,033-036]
# Merging all switches in a top-level spine switch
SwitchName=spineswitch Switches=ibsw[1-8]

It would be great if other sites could test this tool on their Infiniband 
network and report bugs or suggest improvements.

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-07 Thread Ole Holm Nielsen


I'm announcing an initial version 0.1 of an Infiniband topology tool 
"slurmibtopology.sh" for Slurm.


I had to create a Slurm topology.conf file and needed an automated way 
to get the correct node and switch Infiniband connectivity.  The manual 
page https://slurm.schedmd.com/topology.conf.html refers to an outdated 
tool ib2slurm.


Inspired by the script in 
https://unix.stackexchange.com/questions/255472/text-processing-building-a-slurm-topology-conf-file-from-ibnetdiscover-output 
I decided to write a simpler and more understandable tool.  It parses 
the output of the OFED command "ibnetdiscover" and generates an initial 
topology.conf file (which you may want to edit for readability).


Download the script https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh

Example: Running this script on our Infiniband network:

# slurmibtopology.sh
Verify the Infiniband interface:
mlx4_0
Infiniband interface OK

Generate the Slurm topology.conf file for Infiniband switches:

# IB switch no. 1: MF0;mell02:IS5030/U1
SwitchName=ibsw1 Nodes=i[001-028]
# IB switch no. 2: MF0;mell03:IS5030/U1
SwitchName=ibsw2 Nodes=finbul,i[029-051],niflfs[1-2],niflopt1
# IB switch no. 3: MF0;mell01:SX6036/U1
SwitchName=ibsw3 Nodes=g[081-100,102-112],h[001-002]
# IB switch no. 4: MF0;mell04:SX6036/U1
# NOTICE: This switch has no attached nodes (empty hostlist)
SwitchName=ibsw4 Nodes=""
# IB switch no. 5: Mellanox 4036 # volt01
SwitchName=ibsw5 Nodes=g[005-008,013-014,016,021-024,029-032,037-040]
# IB switch no. 6: Mellanox 4036 # volt03
SwitchName=ibsw6 Nodes=g[045-048,053-056,061-064,069-072,077-080]
# IB switch no. 7: Mellanox 4036 # volt04
SwitchName=ibsw7 Nodes=g[041-044,049-052,057-060,065-068,073-076]
# IB switch no. 8: Mellanox 4036 # volt02
SwitchName=ibsw8 Nodes=g[001-004,009-012,017-020,025-028,033-036]
# Merging all switches in a top-level spine switch
SwitchName=spineswitch Switches=ibsw[1-8]

It would be great if other sites could test this tool on their 
Infiniband network and report bugs or suggest improvements.


--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Re: Announce: Node status tool "pestat" version 0.1 for Slurm

2017-05-03 Thread Ole Holm Nielsen


Sorry, the HTTP URL is http://ftp.fysik.dtu.dk/Slurm/pestat

On 05/03/2017 05:53 PM, Ole Holm Nielsen wrote:

On 05/03/2017 04:44 PM, Andrej Prsa wrote:



I'll be expanding the functionality of pestat over time, so please
send me comments and bug reports.


Thanks for sharing! I had to change the hardcoded paths, so perhaps you
should make the paths variables at the top of the script or look for
the sinfo & friends to assign the path.


Thanks for the suggestion.  I have now released pestat version 0.34 at
ftp://ftp.fysik.dtu.dk/pub/Slurm/pestat or
http://ftp.fysik.dtu.dk/pub/Slurm/pestat

The command path PREFIX=/usr/bin has been defined at the top of the
script where it can be easily redefined.

/Ole


[slurm-dev] Re: Announce: Node status tool "pestat" version 0.1 for Slurm

2017-05-03 Thread Ole Holm Nielsen


On 05/03/2017 04:44 PM, Andrej Prsa wrote:



I'll be expanding the functionality of pestat over time, so please
send me comments and bug reports.


Thanks for sharing! I had to change the hardcoded paths, so perhaps you
should make the paths variables at the top of the script or look for
the sinfo & friends to assign the path.


Thanks for the suggestion.  I have now released pestat version 0.34 at
ftp://ftp.fysik.dtu.dk/pub/Slurm/pestat or 
http://ftp.fysik.dtu.dk/pub/Slurm/pestat


The command path PREFIX=/usr/bin has been defined at the top of the 
script where it can be easily redefined.


/Ole


[slurm-dev] Re: Announce: Node status tool "pestat" for Slurm updated to version 0.30

2017-05-03 Thread Ole Holm Nielsen


Hi John,

Thanks for your request for HTTP access.  I've configured our web-server 
for providing the FTP files via HTTP also, please see:


http://ftp.fysik.dtu.dk/Slurm/

Does that work for you?

/Ole

On 05/03/2017 12:02 PM, John Hearns wrote:

Ole,
a small ask. I si tpossible to put the 'pestat' utility for Slurm and
for PBS on a site which uses http?
The reason is many (most ?) corporate networks block ftp access.

Thankyou


On 3 May 2017 at 09:06, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk
<mailto:ole.h.niel...@fysik.dtu.dk>> wrote:


I'm announcing an updated version 0.30 of the node status tool
"pestat" for Slurm.

Download the tool (a short bash script) from
ftp://ftp.fysik.dtu.dk/pub/Slurm/pestat
<ftp://ftp.fysik.dtu.dk/pub/Slurm/pestat>

New options have been added as shown by the help information:

# pestat -h
Usage: pestat [-p partition(s)] [-u username] [-q qoslist] [-f] [-V]
[-h]
where:
-p partition: Select only partion 
-u username: Print only user 
-q qoslist: Print only QOS in the qoslist 
-f: Print only nodes that are flagged by * (unexpected load
etc.)
-h: Print this help information
-V: Version information

I use "pestat -f" all the time because it prints and flags (in
color) only the nodes which have an unexpected CPU load or node status.


[slurm-dev] Announce: Node status tool "pestat" for Slurm updated to version 0.30

2017-05-03 Thread Ole Holm Nielsen


I'm announcing an updated version 0.30 of the node status tool "pestat" 
for Slurm.


Download the tool (a short bash script) from 
ftp://ftp.fysik.dtu.dk/pub/Slurm/pestat


New options have been added as shown by the help information:

# pestat -h
Usage: pestat [-p partition(s)] [-u username] [-q qoslist] [-f] [-V] [-h]
where:
-p partition: Select only partion 
-u username: Print only user 
-q qoslist: Print only QOS in the qoslist 
-f: Print only nodes that are flagged by * (unexpected load etc.)
-h: Print this help information
-V: Version information

I use "pestat -f" all the time because it prints and flags (in color) 
only the nodes which have an unexpected CPU load or node status.


--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Re: Announce: Node status tool "pestat" version 0.1 for Slurm

2017-05-03 Thread Ole Holm Nielsen


On 05/03/2017 08:47 AM, Bjørn-Helge Mevik wrote:

Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> writes:


I'm announcing an initial version 0.1 of the node status tool "pestat" for
Slurm.


Interesting tool!


Thanks!


This tool needs to expand Slurm hostlists like a[095,097-098] into
a095,a097,a098, so I found the nice tool
https://www.nsc.liu.se/~kent/python-hostlist/ by Kent Engström at NSC. It's
simple to install this as an RPM package, see
https://wiki.fysik.dtu.dk/niflheim/SLURM#expanding-host-lists


For the simple case you show, you could just use

$ scontrol show hostnames a[095,097-098]
a095
a097
a098


Thanks for the suggestion.  Another colleague also told me, so the 
current tool at ftp://ftp.fysik.dtu.dk/pub/Slurm/pestat now uses 
"scontrol show hostnames" without the need for python-hostlist (which is 
still a nice and very general tool).


/Ole


[slurm-dev] Announce: Node status tool "pestat" version 0.1 for Slurm

2017-05-02 Thread Ole Holm Nielsen


I'm announcing an initial version 0.1 of the node status tool "pestat" 
for Slurm.


For Torque clusters I wrote the tool "pestat" (available in 
ftp://ftp.fysik.dtu.dk/pub/Torque/) and we use it all the time, so I 
think there's a need for this kind of tool for Slurm as well.  Slurm 
commands don't offer such information directly, and you need to combine 
the output of several Slurm commands.


The output from pestat currently looks like:

Hostname   Partition Node Num_CPU CPUload  Memsize Joblist
State Use/Tot (MB) JobId User ...
a001  xeon8*alloc   8   88.0123900 48451 user1
a002  xeon8*alloc   8   88.0523900 45937 user2
a003  xeon8*alloc   8   88.0123900 47784 user3
...

You can also do "pestat -p partition" to select partitions.

Download the tool (a short bash script) from 
ftp://ftp.fysik.dtu.dk/pub/Slurm/pestat


This tool needs to expand Slurm hostlists like a[095,097-098] into 
a095,a097,a098, so I found the nice tool 
https://www.nsc.liu.se/~kent/python-hostlist/ by Kent Engström at NSC. 
It's simple to install this as an RPM package, see 
https://wiki.fysik.dtu.dk/niflheim/SLURM#expanding-host-lists


I'll be expanding the functionality of pestat over time, so please send 
me comments and bug reports.


My ToDo list would include:

1. Report current node memory usage of jobs.
2. Flag jobs that abuse or under-utilize resources.

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Re: New slurm user question.

2017-05-01 Thread Ole Holm Nielsen


I've been missing such a node status tool for Slurm for a long time!
For Torque clusters I wrote the tool "pestat" (available in 
ftp://ftp.fysik.dtu.dk/pub/Torque/) and we use it all the time.


Here's my quick stab at writing a "pestat" tool for Slurm: 
ftp://ftp.fysik.dtu.dk/pub/Slurm/pestat


I'll announce "pestat" separately on this list.  It requires you to 
install a host-expression expansion tool from 
https://www.nsc.liu.se/~kent/python-hostlist/


/Ole

On 04/27/2017 11:11 PM, Vicker, Darby (JSC-EG311) wrote:

I don't think there is a one liner slurm command to do this.  Even the 
"pbsnodes" wrapper that comes with slurm punts on this:

   # find job(s) on node
   my $jobs; >

   if ( $state eq "allocated" ) {
# how to get list of jobs on node efficiently?
   }

   # this isn't really defined in SLURM, so I am not sure how to get it


We have a perl script that runs "scontrol show jobs" to get the list of nodes each job is 
running on and then combines that with a "scontrol show nodes" output to get what you 
describe.

# nodeinfo.pl

CPU's
 Node use/tot statereserv job info
  --- - - --
   r1i0n0   12/12 job-exclusive - 45360,gsalaza3
   r1i0n1   12/12 job-exclusive - 45360,gsalaza3
   r1i0n2   12/12 job-exclusive - 45360,gsalaza3
   r1i0n3   12/12 job-exclusive - 45360,gsalaza3
   r1i0n40/12  idle -
   r1i0n50/12  idle -
.
.
.







-Original Message-
From: Mark London 
Reply-To: slurm-dev 
Date: Thursday, April 27, 2017 at 3:18 PM
To: slurm-dev 
Subject: [slurm-dev] New slurm user question.


Hi - I want a method to create a list all the cluster nodes, with one
node line per line, and include the job(s) running, for each node.  I
haven't quite found a command that will do this.   Before I write a
script to do this, I was wondering if someone else has already done
this.   Or whether there is a command to do it, that I'm not aware of.
Thanks very much.

Mark London


[slurm-dev] Re: Nodes in state 'down*' despite slurmd running

2017-04-05 Thread Ole Holm Nielsen


On 04/05/2017 03:59 PM, Loris Bennett wrote:

We are running 16.05.10-2 with power-saving.  However, we have noticed a
problem recently when nodes are woken up in order to start a job.  The
node will go from 'idle~' to, say, 'mixed#', but then the job will fail
and the node will be put in 'down*'.  We have turned up the log level to
'debug' with the DebugFlag 'Power', but this hasn't produced anything
relevant.  The problem is, however, resolved if the node is rebooted.

Thus, there seems to be some disturbance of the communication between
the slurmd on the woken node and the slurmctd on the administration
node.  Does anyone have any idea what might be going on?


We have seen something similar with Slurm 16.05.10.

How many nodes are in your network?  If there are more than about 400 
devices in the network, you must tune the kernel ARP cache of the 
slurmctld server, see 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks


/Ole


[slurm-dev] TaskProlog script examples?

2017-03-28 Thread Ole Holm Nielsen


We need to setup the job environment beyond what's inherited from the 
job submission on the login nodes.  The Slurm TaskProlog script example 
in https://slurm.schedmd.com/faq.html#task_prolog is the only example 
I've been able to find on the net.


Question: Does anyone have some good examples of TaskProlog and 
TaskEpilog scripts which they can share with the community?


Added information: We want to set up an environment variable CPU_ARCH 
taking hard-coded text values such as "broadwell", "haswell", etc.  On 
the login nodes we do this with a script in /etc/profile.d/ but this is 
ignored in tasks started by slurmd.


Other useful TaskProlog tasks could be to set up scratch directories for 
jobs and wipe them again in TaskEpilog.  Does anyone have good scripts 
for this?


Thanks a lot,
Ole

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] slurm-dev Re: check_fs_mount error on nodes

2017-02-24 Thread Ole Holm Nielsen


Hi Yi,

You should create the NHC file nhc.conf only once for each type of node 
by nhc-genconf for an initial starting point, not at every node 
reinstallation.


You must then tailor nhc.conf to do only the checks that you find 
relevant for the given nodes.  See an example in 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check


We create a global nhc.conf file with nodes name patterns selecting 
which tests to run on which nodes.  The /etc/nhc/nhc.conf file is then 
distributed by rsync, just like we distribute our Slurm config files and 
other stuff.


/Ole

On 02/24/2017 12:44 AM, Yi Sun wrote:

Thanks very much for your reply.

My NHC version is 1.4.2 it is the correct rpm for my Centos7.2

I have the following lines in my nhc.conf but there's nothing related
to /run/user/1000 so am a bit confused. If I set the node back to idle,
it keeps coming back to drained after some time.

Thanks,
Yi

 testnode1 || check_fs_mount_rw -t "ext4" -s "/dev/vda1" -f "/"
 testnode1 || check_fs_mount_rw -t "sysfs" -s "sysfs" -f "/sys"
 testnode1 || check_fs_mount_rw -t "proc" -s "proc" -f "/proc"
 testnode1 || check_fs_mount_rw -t "devtmpfs" -s "devtmpfs" -f "/dev"
 devlogin0 || check_fs_mount_rw -t "securityfs" -s "securityfs" -f
"/sys/kernel/security"
 testnode1 || check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/dev/shm"
 testnode1 || check_fs_mount_rw -t "devpts" -s "devpts" -f "/dev/pts"
 testnode1 || check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/run"
 testnode1 || check_fs_mount_ro -t "tmpfs" -s "tmpfs" -f
"/sys/fs/cgroup"
 testnode1 || check_fs_mount_rw -t "pstore" -s "pstore" -f
"/sys/fs/pstore"
 testnode1 || check_fs_mount_rw -t "configfs" -s "configfs" -f
"/sys/kernel/config"
 testnode1 || check_fs_mount_rw -t "debugfs" -s "debugfs" -f
"/sys/kernel/debug"
 testnode1 || check_fs_mount_rw -t "hugetlbfs" -s "hugetlbfs" -f
"/dev/hugepages"
 testnode1 || check_fs_mount_rw -t "mqueue" -s "mqueue" -f "/dev/mqueue"


On Thu, 2017-02-23 at 00:35 -0800, Ole Holm Nielsen wrote:

On 02/23/2017 02:24 AM, Yi Sun wrote:

Hi,
I'm trying to add new compute nodes to existing slurm cluster.
After this, I run 'sinfo -R' on server node, the newly added nodes
showed 'NHC:check_fs_mount' and the status is drained.

I then looked at the log, it says something about /run/user/1000 is not
mounted on these new nodes. I'm a bit new to this and I'm not sure what
is happening. If I simply run scontrol update and set these new nodes
state to Resume, the status will go back to idle. Do I need to worry
about this 'check_fs_mount' issue?


You have wisely chosen to enable the Node Health Check (NHC) in Slurm.
For further information see my Wiki
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check.
  Please make sure that you have installed the latest NHC version 1.4.2,
and that it's the correct RPM package for your Linux version.

You do need to configure NHC appropriately for your servers, however, so
check the file /etc/nhc/nhc.conf.  What lines 'check_fs_mount' is in
your nhc.conf?  Which Linux OS do you use?  Which NHC version do you use?

/Ole


[slurm-dev] Re: check_fs_mount error on nodes

2017-02-23 Thread Ole Holm Nielsen


On 02/23/2017 02:24 AM, Yi Sun wrote:

Hi,
I'm trying to add new compute nodes to existing slurm cluster.
After this, I run 'sinfo -R' on server node, the newly added nodes
showed 'NHC:check_fs_mount' and the status is drained.

I then looked at the log, it says something about /run/user/1000 is not
mounted on these new nodes. I'm a bit new to this and I'm not sure what
is happening. If I simply run scontrol update and set these new nodes
state to Resume, the status will go back to idle. Do I need to worry
about this 'check_fs_mount' issue?


You have wisely chosen to enable the Node Health Check (NHC) in Slurm. 
For further information see my Wiki 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check. 
 Please make sure that you have installed the latest NHC version 1.4.2, 
and that it's the correct RPM package for your Linux version.


You do need to configure NHC appropriately for your servers, however, so 
check the file /etc/nhc/nhc.conf.  What lines 'check_fs_mount' is in 
your nhc.conf?  Which Linux OS do you use?  Which NHC version do you use?


/Ole

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark,


[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ole Holm Nielsen
We limit the cpu times in /etc/security/limits.conf so that user processes have 
a maximum of 10 minutes. It doesn't eliminate the problem completely, but it's 
fairly effective on users who misunderstood the role of login nodes.



On Thu, Feb 9, 2017 at 6:38 PM +0100, "Jason Bacon" 
> wrote:




We simply make it impossible to run computational software on the head
nodes.

1.No scientific software packages are installed on the local disk.
2.Our NFS-mounted application directory is mounted with noexec.

Regards,

 Jason

On 02/09/17 07:09, John Hearns wrote:
>
> Does anyone have a good suggestion for this problem?
>
> On a cluster I am implementing I noticed a user is running a code on
> 16 cores, on one of the login nodes, outside the batch system.
>
> What are the accepted techniques to combat this? Other than applying a
> LART, if you all know what this means.
>
> On one system I set up a year or so ago I was asked to implement a
> shell timeout, so if the user was idle for 30 minutes they would be
> logged out.
>
> This actually is quite easy to set up as I recall.
>
> I guess in this case as the user is connected to a running process
> then they are not ‘idle’.
>
> Any views or opinions presented in this email are solely those of the
> author and do not necessarily represent those of the company.
> Employees of XMA Ltd are expressly required not to make defamatory
> statements and not to infringe or authorise any infringement of
> copyright or any other legal right by email communications. Any such
> communication is contrary to company policy and outside the scope of
> the employment of the individual concerned. The company will not
> accept any liability in respect of such communication, and the
> employee responsible will be personally liable for any damages or
> other liability arising. XMA Limited is registered in England and
> Wales (registered no. 2051703). Registered Office: Wilford Industrial
> Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP


--
Earth is a beta site.



[slurm-dev] Slurm daemons started incorrectly on CentOS/RHEL 7 (Systemd systems)

2017-01-12 Thread Ole Holm Nielsen


Sites installing the Slurm RPM packages on CentOS 7, RHEL 7, and other 
Systemd-based systems will experience a bug in the start-up of the Slurm 
daemons.  Several sites have made the same discovery, but I would like 
to make all affected sites aware of the problem and a suggested workaround.


We discovered that the slurmd daemon didn't set the configured 
LimitMEMLOCK ulimits, causing failures in jobs using Infiniband and 
OmniPath fabrics.  This was tracked down to slurmd being started in 
error by the System V style init-script /etc/init.d/slurm.  The correct 
behavior on Systemd-based systems is to start the slurmd service using 
the /usr/lib/systemd/system/slurmd.service as installed by the slurm RPM 
package.


The best workaround we have come up with for CentOS/RHEL 7 is to 
completely disable the slurm init-script, which was inadvertently 
installed, as follows:


chkconfig --del slurm
rm -f /etc/init.d/slurm

This must be repeated every time Slurm is updated.

You must subsequently ensure that the Slurm services are started 
correctly using systemctl on the master, database, and compute nodes. 
The slurmd.service file correctly sets the ulimits.  We have some notes 
about this in our Wiki: 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#starting-slurm-daemons-at-boot-time


The above bug is due to the slurm.spec file inadvertently configuring 
the init-script in the post-install phase, and it has been reported in 
https://bugs.schedmd.com/show_bug.cgi?id=3371.  It is expected to be 
resolved in Slurm 17.11.


Ole Holm Nielsen
Technical University of Denmark


[slurm-dev] Re: mail job status to user

2017-01-09 Thread Ole Holm Nielsen


On 01/10/2017 12:46 AM, Christopher Samuel wrote:


On 10/01/17 09:36, Steven Lo wrote:


Torque/Maui has the ability to mail user about the job status
automatically when job exit.  Does Slurm has the same feature without
using the SBATCH command in the job submission?


Unless Torque has changed massively since I last used it you have to
request those email status messages with directives to qsub in a similar
manner to the way you need to with sbatch (in both cases it's
effectively just embedding command line options into the job script).

I guess I'm not sure why you think it's different with Slurm?


For the record: Torque will always send mail if a job is aborted, and 
for additional mails you may put such a line in your Torque script:

#PBS -m abe
which is quite similar to Slurm:
#SBATCH --mail-type=ALL

See the Torque qsub manual at 
http://docs.adaptivecomputing.com/torque/2-5-12/help.htm#topics/commands/qsub.htm


/Ole


[slurm-dev] Re: preemptive fair share scheduling

2017-01-06 Thread Ole Holm Nielsen


On 01/06/2017 01:07 PM, Sophana Kok wrote:

Hi again, is there someone from schedmd to respond? On the contact form,
it is written to contact the mailing list for technical questions.


If the Slurm community doesn't get you going, you need to buy commercial 
support, see https://www.schedmd.com/support.php, if you want SchedMD to 
support you.


/Ole


[slurm-dev] Re: Two partitions with same compute nodes

2016-11-29 Thread Ole Holm Nielsen


On 11/29/2016 12:27 PM, Daniel Ruiz Molina wrote:

I would like to know if it would be possible in SLURM configure two
partition, composed by the same nodes, but one for using with GPUs and
the other one only for OpenMPI. This configuration was allowed in Sun
Grid Engine because GPU resource was assigned to the queue and to the
compute node, but in SLURM I have only found the way for assigning a GPU
resource to a compute node, independently if that compute belongs to
partition X or to partition Y.


My 2 cents: I'm pretty sure that Slurm permits that any node is a member 
of 1 or more partitions.  Just define the nodes in slurm.conf as members 
of all the PartitionName's which you require.


/Ole


[slurm-dev] Re: PIDfile on CentOS7 and compute nodes

2016-11-28 Thread Ole Holm Nielsen


On 11/28/2016 06:51 PM, Andrus, Brian Contractor wrote:

I am building the RPMs on CentOS7. I merely do: rpmbuild -tb
slurm-16.05.6.tar.bz2

I do see the resulting rpm has both the init file and the unit files:



*# rpm -qlp ../RPMS/x86_64/slurm-16.05.6-1.el7.centos.x86_64.rpm|egrep
"init.d|service$"*

*/etc/init.d/slurm*

*/usr/lib/systemd/system/slurmctld.service*

*/usr/lib/systemd/system/slurmd.service*


Yes, and you start the slurmctld service on the master node by:
systemctl enable slurmctld.service
systemctl start slurmctld.service

/Ole


-Original Message-----

From: Ole Holm Nielsen [mailto:ole.h.niel...@fysik.dtu.dk]

Sent: Sunday, November 27, 2016 11:12 PM

To: slurm-dev <slurm-dev@schedmd.com>

Subject: [slurm-dev] Re: PIDfile on CentOS7 and compute nodes





Hi Brian,



Did you build and install the Slurm RPMs on CentOS 7, or is it a manual
install?  Which Slurm and CentOS versions do you run?  We run Slurm

16.05 on CentOS 7, see instructions in our Wiki
https://wiki.fysik.dtu.dk/niflheim/SLURM



/Ole



On 11/25/2016 05:04 PM, Andrus, Brian Contractor wrote:


All,







I have been having an issue where if I try to run the slurm daemon



under systemd, it hangs for some time and then errors out with:















systemd[1]: Starting LSB: slurm daemon management...







systemd[1]: PID file /var/run/slurmctld.pid not readable (yet?) after

start.






systemd[1]: slurm.service: control process exited, code=exited



status=203







systemd[1]: Failed to start LSB: slurm daemon management.







systemd[1]: Unit slurm.service entered failed state.







systemd[1]: slurm.service failed.























Now it does actually start and is running when I do a ‘ps’.







So I DID figure out a work-around, which, for now, I will code for



changing the scripts.







If I remove the lines from the /etc/init.d/slurm file:















# processname: /usr/sbin/slurmctld







# pidfile: /var/run/slurmctld.pid















Then systemd is happy running just slurm.















Not sure what the appropriate fix is for this, but that is a



work-around that seems effective.




[slurm-dev] Re: PIDfile on CentOS7 and compute nodes

2016-11-28 Thread Ole Holm Nielsen


On 11/28/2016 09:36 AM, Janne Blomqvist wrote:


On 2016-11-25 18:03, Andrus, Brian Contractor wrote:

All,

I have been having an issue where if I try to run the slurm daemon under
systemd, it hangs for some time and then errors out with:


If you're using rpm's built using the rpm spec file in the official
sources, it has a bug where the old init.d slurm script is included.

>

You should use the systemd services instead, slurmctld.service for
slurmctld on the master node, slurmdbd.service for slurmdbd, and
slurmd.service for slurmd on compute nodes.


AFAIK, the slurmctld.service included in the Slurm 16.05 version works 
without problems on CentOS 7.2.  There was a bug in Slurm 14.x regarding 
systemd files, but that was fixed in Slurm 15.08.


/Ole


[slurm-dev] Re: start munge again after boot?

2016-11-07 Thread Ole Holm Nielsen


On 11/07/2016 11:21 PM, Lachlan Musicman wrote:

Peixin,

What operating system are you using? I found on Centos 7 I needed to
create a tmpfile.d entry to make sure that the /var/run/munge was
created correctly on boot every time.

https://www.freedesktop.org/software/systemd/man/tmpfiles.d.html


On CentOS 7 we don't need any hacks like that for Munge to start 
correctly.  Make sure you install the Munge RPMs from the EPEL 
repository, see https://wiki.fysik.dtu.dk/niflheim/SLURM


/Ole


[slurm-dev] Re: Requirement of no firewall on compute nodes?

2016-10-28 Thread Ole Holm Nielsen


Hi Neile,

I agree that you can run a firewall to block off all non-cluster nodes.
The requirement is that between compute nodes, all ports must be opened 
in the firewall (in case you use one).


/Ole

On 10/27/2016 05:11 PM, Neile Havens wrote:



Can anyone confirm that Moe's statement is still valid with the current
Slurm version?



Conclusion: Compute nodes must have their Linux firewall disabled.


FWIW, I still run a firewall on my compute nodes.  The firewall is open to any 
traffic from other compute nodes or the head node, but blocks traffic from 
elsewhere on our network (unfortunately, we don't have a dedicated network for 
our cluster environment).  Here are my notes from my install of SLURM 16.05 on 
CentOS 7.

  - head node
- NOTE: port 6817/tcp is for slurmctld, port 6819/tcp is for slurmdbd
- NOTE: opening to anything from cluster nodes, so that srun works (per Moe 
Jette's comment in the link you sent)
- sudo firewall-cmd --add-rich-rule='rule family="ipv4" source 
address="a.b.c.d/XX" accept'
- sudo firewall-cmd --runtime-to-permanent
  - compute nodes
- NOTE: port 6818/tcp is for slurmd
- NOTE: opening to anything from cluster nodes makes it simpler to work 
with MPI, although it
  should be possible to configure specific port ranges in 
/etc/openmpi-x86_64/openmpi-mca-params.conf
- sudo firewall-cmd --add-rich-rule='rule family="ipv4" source 
address="a.b.c.d/XX" accept'
- sudo firewall-cmd --runtime-to-permanent


[slurm-dev] Re: Requirement of no firewall on compute nodes?

2016-10-28 Thread Ole Holm Nielsen


Hi Chris,

Unfortunately, it isn't sufficient to open for Slurm port 6818 (I had 
already done that).  When tasks are started from the job's master node 
on slave nodes, unknown ports will be used by srun, so you must open all 
ports in your Linux firewall to all other compute nodes in the cluster.


It would be nice if someone could document which TCP port ranges are 
actually required to be opened in the firewall.  Might it just be the 
ephemeral ports 49152 to 65535, for example?


Thanks,
Ole

On 10/27/2016 05:13 PM, Christopher Benjamin Coffey wrote:

Hi Ole,

I don’t see a reason for a firewall to exist on a compute node, is it a 
requirement on your new cluster?  If not, disable it.  I don’t see Moe’s 
statement as saying that you can’t have a firewall, just that if there is one, 
you should open it up to allow all slurm communication.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167

On 10/27/16, 5:58 AM, "Ole Holm Nielsen" <ole.h.niel...@fysik.dtu.dk> wrote:


In the process of developing our new cluster using Slurm, I've been
bitten by the firewall settings on the compute nodes preventing MPI jobs
from spawning tasks on remote nodes.

I now believe that Slurm actually has a requirement that compute nodes
must have their Linux firewall disabled.  I haven't been able to find
any hint of this requirement in the official Slurm documentation.  I did
find an old slurm-devel posting by Moe Jette (pretty authoritative!) in 2010
   https://groups.google.com/forum/#!topic/slurm-devel/wOHcXopbaXw
saying:

> Other communications (say between srun and the spawned tasks) are 
intended to operate within a cluster
> and have no port restrictions. If there is a firewall between nodes in your cluster 
(at least as a "cluster" is
> configured in SLURM), then logic would need to be added to SLURM to 
provide the functionality you describe.

Can anyone confirm that Moe's statement is still valid with the current
Slurm version?

Conclusion: Compute nodes must have their Linux firewall disabled.


[slurm-dev] Re: slurm network address problem ?

2016-10-27 Thread Ole Holm Nielsen


You might want to check out my Wiki-page for setting up Slurm on CentOS 
7.2: https://wiki.fysik.dtu.dk/niflheim/SLURM.

Perhaps you'll solve the problem using this information?

On 10/27/2016 04:14 PM, Mikhail Kuzminsky wrote:

I worked w/PBS and SGE; now I'm beginner w/slurm, and installed
slurm-16.05.5 on my home PC/x86-64 under CentOS 7.2 1511 (desktop
installation).

1) Munge isn't necessary for my one PC w/slurm. But
 (after standard rpmbuild)  directory /usr/lib64/slurm  don't have
auth_none.so plugin,
and AuthType=auth/none in slurm.conf can't work, I use AuthType=auth/munge.

2) Both slurmd and slurmctld work on "myhome1" node.
But "scontrol show nodes" informs that:
...
 Reason=NO NETWORK ADDRESS FOUND [slurm@2016-10-16T10:49:05]

Therefore any my srun's say
srun: Required node not available (down, drained or reserved)
srun: job NN queued and waiting for resources


I tried 3 variants of NodeName in slurm.conf : w/o NodeAddr and w/use
192.168.1.10 or even 127.0.0.1
My current string in slurm.conf is:
NodeName=myhome1 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1
NodeAddr=192.168.1.10 State=UNKNOWN


Statical IP w/192.168.0.10 local address is used on my PC.
/etc/hosts is:
127.0.0.1   localhost localhost.localdomain localhost4
localhost4.localdomain4
::1 localhost localhost.localdomain localhost6
localhost6.localdomain6
192.168.1.10 myhome1.ru myhome1

/etc/sysconfig/network-scripts/ifcfg-enp8s0   contains:
TYPE="Ethernet"
BOOTPROTO="static"
NAME="enp8s0"
DEVICE="enp8s0"
IPADDR="192.168.1.10"
GATEWAY="192.168.1.1"
DEFROUTE="yes"
ONBOOT="yes"
PREFIX="24"
etc

External connection w/global Internet is realized via the same "enp8s0"
and home router at 192.168.1.1

What should I do to have normal address for myhome1 for work w/slurm ?


[slurm-dev] Requirement of no firewall on compute nodes?

2016-10-27 Thread Ole Holm Nielsen


In the process of developing our new cluster using Slurm, I've been 
bitten by the firewall settings on the compute nodes preventing MPI jobs 
from spawning tasks on remote nodes.


I now believe that Slurm actually has a requirement that compute nodes 
must have their Linux firewall disabled.  I haven't been able to find 
any hint of this requirement in the official Slurm documentation.  I did 
find an old slurm-devel posting by Moe Jette (pretty authoritative!) in 2010

  https://groups.google.com/forum/#!topic/slurm-devel/wOHcXopbaXw
saying:


Other communications (say between srun and the spawned tasks) are intended to 
operate within a cluster
and have no port restrictions. If there is a firewall between nodes in your cluster (at 
least as a "cluster" is
configured in SLURM), then logic would need to be added to SLURM to provide the 
functionality you describe.


Can anyone confirm that Moe's statement is still valid with the current 
Slurm version?


Conclusion: Compute nodes must have their Linux firewall disabled.

Thanks,
Ole


[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

2016-10-27 Thread Ole Holm Nielsen


On 10/27/2016 09:42 AM, Loris Bennett wrote:

So is restarting slurmctld the only way to let it pick up changes in slurm.conf?


No.  You can also do

  scontrol reconfigure

This does not restart slurmctld.


Question: How are the slurmd daemons notified about the changes in 
slurm.conf?  Will slurmctld force the slurmd daemons to reread 
slurm.conf?  There is no corresponding "slurmd reconfigure" command.


Question: If the cgroups.conf file is changed, will that also be picked 
up by a scontrol reconfigure?


Thanks,
Ole


[slurm-dev] Re: Packaging for fedora (and EPEL)

2016-10-17 Thread Ole Holm Nielsen


FWIW, I've documented how to install Slurm 16.05 on CentOS 7.2 in this 
Wiki page: https://wiki.fysik.dtu.dk/niflheim/SLURM


/Ole

On 10/17/2016 09:48 AM, Andrew Elwell wrote:

I see from https://bugzilla.redhat.com/show_bug.cgi?id=1149566 that
there have been a few unsuccessful attempts to get slurm into fedora
(and potentially EPEL)

Is anyone on this list actively working on it at the moment? I'll
update the bugzilla ticket to prod the last portential packager but
failing that I'm offering to work on it.

My plan is to get 16.05 into fedora, but not into EPEL itself (the
supported life of a given release is just too short to match with the
RHEL timeline), however I'll probably put "unofficial" srpms publicly
available that shoud meet all the epel packaging requirements.

schedmd people - as some of this may involve patches to the spec file
amongst other things, what's the best way to progress this - attach a
diff to something on your bugzilla page rather than git pull req?


[slurm-dev] Re: rpm dependencies in 16.05.5

2016-10-13 Thread Ole Holm Nielsen


I have a Wiki page describing how to install Munge and Slurm on CentOS 
7: https://wiki.fysik.dtu.dk/niflheim/SLURM

I hope this may help.
/Ole

On 10/13/2016 02:38 PM, Andrew Elwell wrote:


Hi folks,

I've just built 16.05.5 into rpms (using the rpmbuild -ta
slurm*.tar.bz2 method) to update a CentOS 7 slurmdbd host.

According to http://slurm.schedmd.com/accounting.html

"Note that SlurmDBD relies upon existing Slurm plugins for
authentication and Slurm sql for database use, but the other Slurm
commands and daemons are not required on the host where SlurmDBD is
installed. Install the slurmdbd, slurm-plugins, and slurm-sql RPMs on
the computer when SlurmDBD is to execute. If you want munge
authentication, which is highly recommended, you will also need to
install the slurm-munge RPM."

so just installing  slurmdbd, slurm-plugins, and slurm-sql works (yum
localinstall), but as expected fails to start:

[2016-10-13T20:19:46.931] error: Couldn't find the specified plugin
name for auth/munge looking at all files
[2016-10-13T20:19:46.931] error: cannot find auth plugin for auth/munge
[2016-10-13T20:19:46.931] error: cannot create auth context for auth/munge
[2016-10-13T20:19:46.931] fatal: Unable to initialize auth/munge
authentication plugin

however it's not possible to cleanly install slurm-munge without slurm:

[root@ae-test01 ~]# yum localinstall
rpmbuild/RPMS/x86_64/slurm-munge-16.05.5-1.el7.centos.x86_64.rpm
Loaded plugins: fastestmirror
Examining rpmbuild/RPMS/x86_64/slurm-munge-16.05.5-1.el7.centos.x86_64.rpm:
slurm-munge-16.05.5-1.el7.centos.x86_64
Marking rpmbuild/RPMS/x86_64/slurm-munge-16.05.5-1.el7.centos.x86_64.rpm
to be installed
Resolving Dependencies
--> Running transaction check
---> Package slurm-munge.x86_64 0:16.05.5-1.el7.centos will be installed
--> Processing Dependency: slurm for package:
slurm-munge-16.05.5-1.el7.centos.x86_64
base
  | 3.6 kB  00:00:00
extras
  | 3.4 kB  00:00:00
updates
  | 3.4 kB  00:00:00
--> Finished Dependency Resolution
Error: Package: slurm-munge-16.05.5-1.el7.centos.x86_64
(/slurm-munge-16.05.5-1.el7.centos.x86_64)
   Requires: slurm
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest


[root@ae-test01 ~]# yum localinstall
rpmbuild/RPMS/x86_64/slurm-munge-16.05.5-1.el7.centos.x86_64.rpm
rpmbuild/RPMS/x86_64/slurm-16.05.5-1.el7.centos.x86_64.rpm
Loaded plugins: fastestmirror
Examining rpmbuild/RPMS/x86_64/slurm-munge-16.05.5-1.el7.centos.x86_64.rpm:
slurm-munge-16.05.5-1.el7.centos.x86_64
Marking rpmbuild/RPMS/x86_64/slurm-munge-16.05.5-1.el7.centos.x86_64.rpm
to be installed
Examining rpmbuild/RPMS/x86_64/slurm-16.05.5-1.el7.centos.x86_64.rpm:
slurm-16.05.5-1.el7.centos.x86_64
Marking rpmbuild/RPMS/x86_64/slurm-16.05.5-1.el7.centos.x86_64.rpm to
be installed
Resolving Dependencies
--> Running transaction check
---> Package slurm.x86_64 0:16.05.5-1.el7.centos will be installed
---> Package slurm-munge.x86_64 0:16.05.5-1.el7.centos will be installed
--> Finished Dependency Resolution

Dependencies Resolved


 PackageArch  Version
   Repository   Size

Installing:
 slurm  x86_6416.05.5-1.el7.centos
   /slurm-16.05.5-1.el7.centos.x86_64   85 M
 slurm-mungex86_6416.05.5-1.el7.centos
   /slurm-munge-16.05.5-1.el7.centos.x86_64 44 k

Transaction Summary

Install  2 Packages

Total size: 85 M
Installed size: 85 M
Is this ok [y/d/N]: y


So - is this just a broken spec file that sets unneeded dependencies
or are the docs wrong that you don't need to install slurm?

Andrew



[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-18 Thread Ole Holm Nielsen



On 08/17/2016 03:49 PM, Barbara Krasovec wrote:

I upgraded  SLURM rom 15.08 to 16.05 without draining the nodes and
without loosing any jobs, this was my procedure:

I increased timeouts in slurm.conf:
SlurmctldTimeout=3600
SlurmdTimeout=3600


Question: When you change parameters in slurm.conf, how do you force all 
daemons to reconfigure this?  Can you reload the slurm.conf without 
restarting the daemons (which is what you want to avoid)?



Did mysqldump of slurm database and copied slurmstate dir (just in


What do you mean by "slurmstate"?


case), I increased innodb_buffer_size in my.cnf to 128M, then I followed


The 128 MB seems to be the default!  But in the SLURM accounting page I 
found some recommendations for the MySQL/Mariadb configuration.  How to 
implement these recommendations has now been added to my Wiki:

https://wiki.fysik.dtu.dk/niflheim/SLURM#mysql-configuration


the instructions on slurm page:


Shutdown the slurmdbd daemon
Upgrade the slurmdbd daemon
Restart the slurmdbd daemon
Shutdown the slurmctld daemon(s)
Shutdown the slurmd daemons on the compute nodes
Upgrade the slurmctld and slurmd daemons
Restart the slurmd daemons on the compute nodes
Restart the slurmctld daemon(s)


Chris Samuel in a previous posting had some more cautious advice about 
upgrading slurmd daemons!  I hope that Chris may offer addition insights.


/Ole


[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-17 Thread Ole Holm Nielsen



On 08/03/2016 03:04 AM, Christopher Samuel wrote:

So you always go in the order of upgrading:

* slurmdbd
* slurmctld
[recompile all plugins, MPI stacks, etc that link against Slurm]
* slurmd

We use a health check script that defines the version of Slurm that is
considered production so we can just bump that number first, wait for
all the compute nodes to be marked as drained and then as nodes become
idle we can start restarting slurmd knowing that we will never get a job
that spans both old and new slurmd's.


Obviously upgrading slurmd's which are running jobs is quite tricky!  I 
have some questions:


1. Can't you replace the health check by a global scontrol like this?
   scontrol update NodeName= State=drain Reason="Upgrading 
slurmd"


2. Do you really have to wait for *all* nodes to become drained before 
starting to upgrade?  This could take weeks!


3. Is it OK to upgrade subsets of nodes after they become drained?

4. I assume that upgraded nodes can be returned to the IDLE state by:
   scontrol update NodeName= State=resume

Could you possibly elaborate on the steps which you described?

FYI, I'm trying to capture this advice in my Wiki:
https://wiki.fysik.dtu.dk/niflheim/SLURM#upgrading-on-centos-7

Thanks,
Ole


[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-16 Thread Ole Holm Nielsen


On 03-08-2016 03:04, Christopher Samuel wrote:


On 03/08/16 03:13, Balaji Deivam wrote:


Right now we are using Slurm 14.11.3 and planning to upgrade to the
latest version 16.5.3. Could you please share the latest upgrade steps
document link?


Google is your friend:

http://slurm.schedmd.com/quickstart_admin.html#upgrade

It looks like it's not (yet) been changed to include reference to the
16.05.3 version but the general principle is the same.

To reinforce what it says there:

# always upgrade the SlurmDBD daemon first


A colleague told me that there may be a problem with upgrading the 
slurmdbd RPM package on CentOS 7:  The database conversion during an 
upgrade may take a very long time, causing systemd to time out and kill 
slurmdbd in the middle of the database conversion.  A manual start of 
the slurmdbd daemon is apparently the workaround.  I have not been able 
to find any discussions on the net of this issue.


Question: Can anyone provide slurmdbd upgrade instructions which work 
correctly on CentOS 7 (and other OSes using systemd)?


Thanks,
Ole


[slurm-dev] How to configure PAM with SLURM on CentOS 7?

2016-07-18 Thread Ole Holm Nielsen


The slurm-pam_slurm RPM package contains two shared libraries 
/lib64/security/pam_slurm*.so. The source files contrib/pam*/README have 
examples on configuring /etc/pam.d/system-auth.  The public 
documentation http://slurm.schedmd.com/faq.html#pam seems of little use 
regarding the slurm-pam_slurm RPM package.


For CentOS/RHEL 7 the /etc/pam.d/system-auth file is controlled by the 
authconfig command, so it's really not transparent how to use PAM with 
SLURM on this system.


Question: Has anyone figured out how to configure PAM correctly on 
CentOS 7 for achieving the desired SLURM access restrictions on compute 
nodes?


Thanks,
Ole

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Re: Trying to get a simple slurm cluster going

2016-07-18 Thread Ole Holm Nielsen


Perhaps my SLURM HowTo Wiki page 
https://wiki.fysik.dtu.dk/niflheim/SLURM could help you getting started. 
 We're using CentOS 7.2, but most of the setup will be the same or 
similar for CentOS 6.


/Ole

On 07/18/2016 12:52 AM, P. Larry Nelson wrote:

While I am in search of real hardware on which to build/test Slurm,
I am attempting to just play around with it on a test VM (Scientific
Linux 6.8), which, of course, is using NATted networking and is a
standalone system protected from the outside world.

I downloaded the latest (16.05.2) tarball and ran the rpmbuild
and then installed all the rpm's.  Ran the Easy Configurator
and gave it the hostname of the VM for the ControlMachine
and the loopback address of 127.0.0.1 for the ControlAddr.

Made a munge key and it started just fine.

When I do a 'service slurm start', it responds "OK" for both slurmctld
and slurmd, but slurmctld dies right away.

If I do a 'slurmctld -Dvvv', I get:
slurmctld: pidfile not locked, assuming no running daemon
slurmctld: debug:  creating clustername file: /var/spool/clustername
slurmctld: fatal: _create_clustername_file: failed to create file
/var/spool/clustername

The slurm.conf has this for ClusterName:
ClusterName=SlurmCluster

So, why is slurmctld trying to create file /var/spool/clustername
instead of /var/spool/SlurmCluster.

Slurmd and slurmctld are started as root.
I'm obviously missing something here


[slurm-dev] Re: NHC and disk / dell server health

2016-01-27 Thread Ole Holm Nielsen


On 01/27/2016 09:12 AM, Johan Guldmyr wrote:

has anybody already made some custom NHC checks that can be used to
check disk health or perhaps even hardware health on a dell server?

I've been thinking of using smartctl + NHC to test if the local disks
on the compute node is healthy.

Or for Dell hardware then "omreport" something or perhaps one could
call for example the check_openmanage nagios check from NHC..


We're extremely happy with NHC (Node Health Check was moved to 
https://github.com/mej/nhc recently) due to its numerous checks and its 
lightweight resource usage.


I haven't been able to find any command for checking disk health, since 
smartctl is completely unreliable for checking failing disks (a bad disk 
will usually have a PASSED SMART status).  What I've seen many times is 
that a disk fails partly, so the kernel remounts file systems read-only. 
 This prevents any further health checks from running, including NHC, 
and all batch jobs running on a system with read-only disks are going to 
fail (almost) silently :-(  Normally I discover this scenario due to 
user complaints.


There is one hardware test which I do find useful for catching mostly 
memory errors. Use this NHC check in nhc.conf:


# Check Machine Check Exception (MCE, mcelog) errors (Intel only, not AMD)
* || check_hw_mcelog

You'll need to have the mcelogd daemon running.  Make a manual test by: 
mcelog --client


--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark


[slurm-dev] Re: more detailed installation guide

2016-01-06 Thread Ole Holm Nielsen


On 01/06/2016 06:03 AM, Novosielski, Ryan wrote:

I haven't gotten all the way through it, but this is really good so far!
I am curious though -- not seen this problem -- what is the thing about
RHEL7 and systemd? I've not seen any problem there. I use CentOS 7.1 and
7.2. Maybe not an issue anymore?


There's no "thing" with RHEL7 and systemd.  It's just what Red Hat 
decided to base RHEL7 on - and which has generated much discussion.  In 
my Wiki I provide information needed for using SLURM with systemd, 
hoping that people may find it useful.


/Ole


[slurm-dev] Re: more detailed installation guide

2016-01-06 Thread Ole Holm Nielsen


On 01/06/2016 02:54 PM, Novosielski, Ryan wrote:

Sorry, was referring to: "running RHEL 7 there is a bug systemctl
start/stop does not work on RHEL 7
." I should have clicked
the link. In any case, I've not noticed this problem so perhaps it was
fixed in either SLURM or systemd?


With SLURM 15.08 I don't have this systemctl problem any longer, which I 
first discovered when we had SLURM 14.x.  I've added this information to 
the Wiki now at 
https://wiki.fysik.dtu.dk/niflheim/SLURM#master-server-configuration


/Ole


[slurm-dev] Re: more detailed installation guide

2016-01-05 Thread Ole Holm Nielsen


On 01/05/2016 06:04 PM, Daniel Letai wrote:


Just one comment regarding openmpi building:
https://wiki.fysik.dtu.dk/niflheim/SLURM#mpi-setup - At least with
regard to openmpi, it should be built --with-pmi


Thanks for your comment.  I added a link to PMI, hope that's correct.

/Ole



On 01/05/2016 01:26 PM, Ole Holm Nielsen wrote:


On 01/05/2016 12:12 PM, Randy Bin Lin wrote:

I was wondering if anyone has a more detailed installation guide than
the official guide below:

http://slurm.schedmd.com/quickstart_admin.html

I got the general idea how to install slurm on a local linux cluster.
but still don’t know how to do it exactly. a step-by-step guide will be
great if any one has it. please share it with me. it is greatly
appreciated.


I had the same problem as you, so I've been writing a Wiki page about
simple SLURM installation for a generic CentOS 7 Linux cluster. Please
see:
https://wiki.fysik.dtu.dk/niflheim/SLURM

While this page reflects work in progress, I've gathered much
information which is scattered across many SLURM docs pages and other
relevant pages.  Getting relevant information for getting started is
sometimes like the needle-in-haystack problem :-)

The Wiki refers to CentOS 7 (and RHEL 7) configurations, but a number
of points should be valid for other Linuxes as well.

If there are errors or missing points in this Wiki page, please write
to me.


[slurm-dev] Re: more detailed installation guide

2016-01-05 Thread Ole Holm Nielsen


On 01/05/2016 12:12 PM, Randy Bin Lin wrote:

I was wondering if anyone has a more detailed installation guide than
the official guide below:

http://slurm.schedmd.com/quickstart_admin.html

I got the general idea how to install slurm on a local linux cluster.
but still don’t know how to do it exactly. a step-by-step guide will be
great if any one has it. please share it with me. it is greatly appreciated.


I had the same problem as you, so I've been writing a Wiki page about 
simple SLURM installation for a generic CentOS 7 Linux cluster. Please see:

https://wiki.fysik.dtu.dk/niflheim/SLURM

While this page reflects work in progress, I've gathered much 
information which is scattered across many SLURM docs pages and other 
relevant pages.  Getting relevant information for getting started is 
sometimes like the needle-in-haystack problem :-)


The Wiki refers to CentOS 7 (and RHEL 7) configurations, but a number of 
points should be valid for other Linuxes as well.


If there are errors or missing points in this Wiki page, please write to me.

Thanks,
Ole

--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark


[slurm-dev] srun: error: Unable to allocate resources: Requested node configuration is not available

2015-12-17 Thread Ole Holm Nielsen


I'm setting up a new test cluster with SLURM 15.08.5 (we run Torque/Maui 
on our production cluster).  We have a SLURM master server running 
CentOS 7.2 and two compute nodes on separate subnets (10.1.. and 
10.2..).  I'm writing a SLURM installation HowTo page as I go along:

https://wiki.fysik.dtu.dk/niflheim/SLURM

I'm now facing a problem running a trivial test:
# srun -N1 --constraint="opteron4" /bin/hostname
srun: error: Unable to allocate resources: Requested node configuration 
is not available


Question: What may be causing the available node with property 
"opteron4" to reject jobs?



The other partition works just fine:
# srun -N1 --constraint="xeon8" /bin/hostname
a012.dcsc.fysik.dtu.dk

FYI, the node status is:

# scontrol show nodes
NodeName=a012 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=8 CPUErr=0 CPUTot=8 CPULoad=0.01 
Features=xeon5570,hp5412e,ethernet,xeon8

   Gres=(null)
   NodeAddr=a012 NodeHostName=a012 Version=15.08
   OS=Linux RealMemory=23900 AllocMem=0 FreeMem=22859 Sockets=2 Boards=1
   State=IDLE+COMPLETING ThreadsPerCore=1 TmpDisk=32752 Weight=1 Owner=N/A
   BootTime=2015-09-08T16:25:29 SlurmdStartTime=2015-12-16T15:29:32
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=q007 Arch=x86_64 CoresPerSocket=2
   CPUAlloc=0 CPUErr=0 CPUTot=4 CPULoad=0.01 
Features=opteron2218,hp5412b,ethernet,opteron4

   Gres=(null)
   NodeAddr=q007 NodeHostName=q007 Version=15.08
   OS=Linux RealMemory=7820 AllocMem=0 FreeMem=7584 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=32752 Weight=1 Owner=N/A
   BootTime=2015-12-17T08:40:49 SlurmdStartTime=2015-12-17T08:41:03
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

I believe that the nodes are configured identically, except for their 
hardware differences.


--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] SLURM 14.11 broken on CentOS 7.0?

2014-11-26 Thread Ole Holm Nielsen

Hi,

Can anyone help in getting SLURM 14.11 to work correctly on CentOS 7.0?

I'm testing SLURM 14.11 on CentOS 7.0.  The default SLURM RPM is built 
by: rpmbuild -ta slurm-14.11.0.tar.bz2.  As a bare minimum test cluster, 
I have a master PC running slurmctld, and just 1 compute node PC running 
slurmd (see note below regarding bug in slurm systemctl).


To verify my SLURM installation I have run the test script 
testsuite/expect/regression, and unfortunately I get 13 FAILURE messages 
in total:


# grep FAILURE regression-25-Nov-2014.log
FAILURE: Login and slurm user info mismatch
test1.1 FAILURE
FAILURE: Login and slurm user info mismatch
test1.8 FAILURE
FAILURE: srun is not generating output (2 != 20)
test1.69 FAILURE
FAILURE: scontrol failed to find matching job (5 != 6)
FAILURE: waiting for job 7927 to run
FAILURE: scontrol failed to find all job steps
FAILURE: scontrol failed to specific job step
test2.8 FAILURE
test7.17 FAILURE
FAILURE: sacct reporting failed (3  4)
test12.2 FAILURE
FAILURE: Login and sbatch user info mismatch
test17.4 FAILURE
test17.35 FAILURE
test24.1 FAILURE
test24.3 FAILURE
test24.4 FAILURE
test30.1 FAILURE
test33.1 FAILURE

(the full test output is in the attached file).

Since I'm new to SLURM, I have no clue whether each of these failures 
are benign or critical.  I'm assuming that SLURM on CentOS 7 is broken 
until someone more knowledgeable can help in decoding the test output.


Maybe some tests fail because my cluster has only 1 compute node, but 
who can tell if this is the case?


BTW, the compute node slurmd unfortunately can't be started in the 
normal way, as reported in bug 1182: systemctl start/stop does not work 
on RHEL 7, see http://bugs.schedmd.com/show_bug.cgi?id=1182.  One must 
modify /etc/slurm/slurm.conf and start slurm manually by: cd 
/etc/init.d; ./slurm start.  This bug should be fixed!


My SLURM configuration is extremely simple:

# sed '/^#/d' /etc/slurm/slurm.conf
ControlMachine=bell
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/linear
AccountingStorageType=accounting_storage/filetxt
ClusterName=cluster
JobAcctGatherType=jobacct_gather/linux
NodeName=hertz State=UNKNOWN CoresPerSocket=2 RealMemory=3688 TmpDisk=23988
PartitionName=testing Nodes=hertz Default=YES MaxTime=3000 State=UP

Thanks for any help,
Ole

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


regression-25-Nov-2014.log.gz
Description: application/gzip


[slurm-dev] Re: Health Check Program

2013-01-15 Thread Ole Holm Nielsen

On 01/15/2013 11:36 PM, Paul Edmon wrote:

 So does any one have an example node health check script for SLURM? One
 that would be run by HealthCheckProgram defined in slurm.conf. I'd
 rather not reinvent the wheel if I don't have too.  Thanks.

Probably the most comprehensive and lightweight health check tool out 
there is Node Health Check: 
http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check.  It has 
integration with SLURM as well as Torque resource managers. NHC works 
great for us.

-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Re: Scheduling in a heterogeneous cluster?

2013-01-05 Thread Ole Holm Nielsen

On 04-01-2013 18:28, Ralph Castain wrote:
 FWIW: I believe we have a mapper in Open MPI that does what you want - i.e., 
 it looks at the number of available cpus on each node, and maps the processes 
 to maximize the number of procs co-located on nodes. In your described case, 
 it would tend to favor the 16ppn nodes as that would provide the best MPI 
 performance, then move to the 8ppn nodes, etc.

Yes, OpenMPI can be tweaked into different task layouts on the available set of 
nodes.  But the problem at hand is for the scheduler to allocate some set of 
nodes, given that the user want a specific number of MPI tasks (say, 32), and 
given that there are heterogeneous nodes available with 4, 8 or 16 cores.  As 
Moe wrote, this flexibility would be hard to achieve with Slurm (with Maui it's 
impossible).

Regards,
Ole


[slurm-dev] Re: Scheduling in a heterogeneous cluster?

2013-01-05 Thread Ole Holm Nielsen

 On 04-01-2013 18:28, Ralph Castain wrote:
 FWIW: I believe we have a mapper in Open MPI that does what you want - 
 i.e., it looks at the number of available cpus on each node, and maps the 
 processes to maximize the number of procs co-located on nodes. In your 
 described case, it would tend to favor the 16ppn nodes as that would 
 provide the best MPI performance, then move to the 8ppn nodes, etc.

 Yes, OpenMPI can be tweaked into different task layouts on the available set 
 of
 nodes.  But the problem at hand is for the scheduler to allocate some set of
 nodes, given that the user want a specific number of MPI tasks (say, 32), and
 given that there are heterogeneous nodes available with 4, 8 or 16 cores.  As
 Moe wrote, this flexibility would be hard to achieve with Slurm (with Maui 
 it's
 impossible).

 I think you may have misunderstood both Moe and I. I believe Moe was pointing 
 out that you would need a new plugin to provide that capability, and I 
 pointed out that we already have that algorithm in OMPI and could port it to 
 a Slurm plugin if desired. That said, it would take some time to make that 
 happen, so it may not resolve your problem.

Thanks for enlightening me!  Sounds interesting, and may be worth a future 
effort because so many clusters are (or will become) heterogeneous!

Regards,
Ole


[slurm-dev] Re: Scheduling in a heterogeneous cluster?

2013-01-03 Thread Ole Holm Nielsen



On 01/03/2013 04:26 PM, Ralph Castain wrote:

 Call me puzzled - are you sure you have Maui configured correctly? Typically, 
 you configure it for exclusive node allocation, and then just ask it for a 
 number of slots. It doesn't *require* you to specify nodes or ppn - those are 
 options. Might be worth checking for advice on that mailing list.

We don't configure exclusive node allocation in Maui because we want to 
cater for multiple single-CPU jobs per node.

Though I haven't tested it, I think that with Maui's exclusive node 
allocation, if you submit with qsub -l nodes=32 you would get 32 full 
nodes with either 4, 8 or 16 cores per node in our cluster.

Thanks,
Ole

 On Jan 3, 2013, at 6:41 AM, Ole Holm Nielsen ole.h.niel...@fysik.dtu.dk 
 wrote:


 We're currently running a Torque/Maui batch system on our cluster, but
 we're evaluating whether to transition to Slurm. Like most other sites,
 we have a heterogeneous cluster consisting of several generations of
 hardware (all nodes are dual-socket with either 2, 4 or 8 cores/socket).

 One gripe we have about the Maui scheduler is its inability to
 dynamically schedule jobs to different types of hardware while
 maintaining an optimal layout of tasks. Our MPI jobs don't care how
 they're distributed across nodes, but they should use entire nodes for
 performance reasons.  We do permit multiple jobs per node (shared nodes)
 because there's a need for jobs using fewer cores than in an entire node.

 I've studied the Slurm examples in
 http://www.schedmd.com/slurmdocs/cpu_management.html#Section4 but I
 don't seem to find any mention of using heterogeneous hardware like ours.

 As an example, assume a user wants to submit a 32-task MPI job. With
 Torque he would have to submit in one of these mutually exclusive ways
 in order to achieve an optimal layout:
 qsub -l nodes=8,ppn=4
 qsub -l nodes=4,ppn=8
 qsub -l nodes=2,ppn=16
 If he were to submit to just 32 nodes:
 qsub -l nodes=32
 (which would be the simplest for users to understand), the job would get
 scattered across irregular sets of nodes in stead of being packed into
 an optimal set of nodes, so we unfortunately have to disallow this type
 of usage.

 Question: Can Slurm be configured for heterogeneous hardware allowing
 users to submit, for example, srun --ntasks=32 and get any one of the
 optimal layouts discussed above?

 FYI, we have many different types of jobs in our cluster. Users are
 allocating anywhere from 1 core to ~1000 cores per job.

 Thanks,
 Ole

 --
 Ole Holm Nielsen
 Department of Physics, Technical University of Denmark

-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://www.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620 / Fax: (+45) 4593 2399


[slurm-dev] Re: Scheduling in a heterogeneous cluster?

2013-01-03 Thread Ole Holm Nielsen

On 03-01-2013 18:34, Moe Jette wrote:

 Slurm has an option --ntasks-per-node, so there are equivalents to
 qsub -l nodes=8,ppn=4
 qsub -l nodes=4,ppn=8
 qsub -l nodes=2,ppn=16
 like this
 sbatch -N8 --ntasks-per-node=4
 sbatch -N4 --ntasks-per-node=8
 sbatch -N2 --ntasks-per-node=16

 You could use a job_submit plugin to map sbatch -n32 to _one_ of the
 above task distributions, but you would need to modify the Slurm code
 for it to use only one of these distributions.

Moe, thanks for the hint! I heard your impressive presentation at the SC'12 
booth and wanted to investigate the option of switching our cluster to Slurm.

Do I understand you correctly that the node layout would have to be decided 
manually with the Slurm scheduler? I.e., Slurm has no way of automatically and 
dynamically mapping jobs to whatever kinds of nodes are free (with different 
numbers of cores per node), just like the situation with Torque/Maui?

Thanks,
Ole


 Quoting Ole Holm Nielsen ole.h.niel...@fysik.dtu.dk:


 We're currently running a Torque/Maui batch system on our cluster, but
 we're evaluating whether to transition to Slurm. Like most other sites,
 we have a heterogeneous cluster consisting of several generations of
 hardware (all nodes are dual-socket with either 2, 4 or 8 cores/socket).

 One gripe we have about the Maui scheduler is its inability to
 dynamically schedule jobs to different types of hardware while
 maintaining an optimal layout of tasks. Our MPI jobs don't care how
 they're distributed across nodes, but they should use entire nodes for
 performance reasons.  We do permit multiple jobs per node (shared nodes)
 because there's a need for jobs using fewer cores than in an entire node.

 I've studied the Slurm examples in
 http://www.schedmd.com/slurmdocs/cpu_management.html#Section4 but I
 don't seem to find any mention of using heterogeneous hardware like ours.

 As an example, assume a user wants to submit a 32-task MPI job. With
 Torque he would have to submit in one of these mutually exclusive ways
 in order to achieve an optimal layout:
 qsub -l nodes=8,ppn=4
 qsub -l nodes=4,ppn=8
 qsub -l nodes=2,ppn=16
 If he were to submit to just 32 nodes:
 qsub -l nodes=32
 (which would be the simplest for users to understand), the job would get
 scattered across irregular sets of nodes in stead of being packed into
 an optimal set of nodes, so we unfortunately have to disallow this type
 of usage.

 Question: Can Slurm be configured for heterogeneous hardware allowing
 users to submit, for example, srun --ntasks=32 and get any one of the
 optimal layouts discussed above?

 FYI, we have many different types of jobs in our cluster. Users are
 allocating anywhere from 1 core to ~1000 cores per job.

 Thanks,
 Ole

 --
 Ole Holm Nielsen
 Department of Physics, Technical University of Denmark




  1   2   >