[slurm-dev] Re:

2017-05-09 Thread Lachlan Musicman
Ignore this  - I discovered the problem. A couple of bpipe jobs from three
weeks ago were zombied and eating all the memory.

cheers
L.

--
"Mission Statement: To provide hope and inspiration for collective action,
to build collective power, to achieve collective transformation, rooted in
grief and rage but pointed towards vision and dreams."

 - Patrice Cullors, *Black Lives Matter founder*

On 10 May 2017 at 10:57, Lachlan Musicman  wrote:

> Running Slurm 16.05 on CentOS 7.3 I'm trying to start an interactive
> session with
>
> srun -w papr-expanded01 --pty --mem 8192 -t 06:00 /bin/bash
> --partition=expanded
> srun -w papr-expanded01 --pty -t 06:00 /bin/bash --partition=expanded
> srun -w papr-expanded01 --pty --mem 8192 /bin/bash --partition=expanded
> srun -w papr-expanded01 --pty /bin/bash --partition=expanded
>
> No matter what I change (including user), I always get
>
> srun: error: Unable to allocate resources: Requested node configuration is
> not available
>
> which is the same as in the logs. There is no other debug message. Any
> hints on what I'm doing wrong?
>
> (notes: the node has enough memory, has sync'd time with head node, am
> using users with access to partitions)
>
> cheers
> L.
>
>
> --
> "Mission Statement: To provide hope and inspiration for collective action,
> to build collective power, to achieve collective transformation, rooted in
> grief and rage but pointed towards vision and dreams."
>
>  - Patrice Cullors, *Black Lives Matter founder*
>


[slurm-dev]

2017-05-09 Thread Lachlan Musicman
Running Slurm 16.05 on CentOS 7.3 I'm trying to start an interactive
session with

srun -w papr-expanded01 --pty --mem 8192 -t 06:00 /bin/bash
--partition=expanded
srun -w papr-expanded01 --pty -t 06:00 /bin/bash --partition=expanded
srun -w papr-expanded01 --pty --mem 8192 /bin/bash --partition=expanded
srun -w papr-expanded01 --pty /bin/bash --partition=expanded

No matter what I change (including user), I always get

srun: error: Unable to allocate resources: Requested node configuration is
not available

which is the same as in the logs. There is no other debug message. Any
hints on what I'm doing wrong?

(notes: the node has enough memory, has sync'd time with head node, am
using users with access to partitions)

cheers
L.


--
"Mission Statement: To provide hope and inspiration for collective action,
to build collective power, to achieve collective transformation, rooted in
grief and rage but pointed towards vision and dreams."

 - Patrice Cullors, *Black Lives Matter founder*


[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-09 Thread Jeffrey Frey

>> The primary problem I've had with ib2slurm is that it segfaults.  There's a 
>> bug in the ibnetdiscover library -- ib2slurm passes a NULL config pointer to 
>> the ibnd_discover_fabric() which is supposed to be okay according to the 
>> documentation, but that function actually requires a config structure to be 
>> passed to it.
>> 
>> 
>> I've forked and updated ib2slurm:
>> 
>> 
>>  https://github.com/jtfrey/ib2slurm
>> 
>> 
>> There doesn't appear to be much movement on the original project, so I 
>> haven't put in a pull request against my fork.
> 
> It's great that you try to revive the outdated ib2slurm project!  The Slurm 
> pages on topology.conf should point to your project in stead of the seemingly 
> dead ib2slurm.
> 
> That said, the concept of ib2slurm seems to be that you use ibnetdiscover to 
> generate a cache file which is subsequently parsed by ib2slurm.  My 
> slurmibtopology.sh script uses the same idea: Parse the output from 
> ibnetdiscover and generate a topology.conf file.  For me personally it was 
> easier to use awk than C for this task because awk supports associative 
> arrays.
> 
> /Ole


Thanks.  The original ib2slurm wasn't able to export SLURM-style compacted 
hostlists.  Rather than implementing that myself, I've opted to (optionally) 
link ib2slurm against libslurm and make use of SLURM's native hostlist 
functionality to handle the compaction.  Also added a CMakeLists.txt so that 
CMake builds should be possible.  All are present on the github repo.

https://github.com/jtfrey/ib2slurm


> use ibnetdiscover to generate a cache file which is subsequently parsed by 
> ib2slurm

ib2slurm uses the same library functions that the ibnetdiscover utility uses.  
Primarily, it produces an in-memory representation of the network topology.  
For large networks or situations where a topology is routinely gathered and 
cached using the ibnetdiscover library, the cache file can be reused by 
ib2slurm.  Otherwise, ib2slurm must walk the network to produce the in-memory 
topology each time (and that can be slow, especially if you're trying to get a 
node name mapping file debugged).




::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::





[slurm-dev] Re: Issue to startup slurm daemon on Compute nodes

2017-05-09 Thread John Hearns
Followig on from Maik's response,
it would be worth mentioning the compat-glibc package for CentOS

https://centos-packages.com/7/package/compat-glibc/
https://www.centos.org/forums/viewtopic.php?t=22250

Big get out of jail card - I have never built any version of Slurm on a
CentOS 7 system using the compat-glibc libraries!!!



On 9 May 2017 at 15:59, Maik Schmidt  wrote:

> It means you have to build SLURM on the node with the oldest glibc that
> you might still have in your cluster. It will then also run on the ones
> with newer glibc versions, just not the other way around.
>
> Best,
> Maik
>
>
> Am 09.05.2017 um 15:49 schrieb J. Smith:
>
>> Hi,
>>
>> I have compiled slurm v17.02.2 on Master Nodes running CentOS7.
>> I have no issue to startup slurm on the Master nodes but I am unable to
>> start the daemon on the Compute Nodes running on CentOS6. It is looking
>> for
>> GLIBC 2.14 which is not available on our compute Nodes(using glibc-2.12).
>>
>> Error:
>> service slurm status
>> /home/share/slurm/17.02.2/bin/scontrol: /lib64/libc.so.6: version
>> `GLIBC_2.14' not found (required by /home/share/slurm/17.02.2/bin/
>> scontrol)
>>
>> Does that mean that slurm will only work accross compute nodes running
>> CentOS7 and not CentOS6? Any suggestions?
>>
>> Thanks!
>>
>> --
> Maik Schmidt
> HPC Services
>
> Technische Universität Dresden
> Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
> Willers-Bau A116
> D-01062 Dresden
> Telefon: +49 351 463-32836
>
>
>


[slurm-dev] Re: Partition default job time limit

2017-05-09 Thread Paul Edmon

Yes.  Use the DefaultTime option.


*DefaultTime*
   Run time limit used for jobs that don't specify a value. If not set
   then MaxTime will be used. Format is the same as for MaxTime. 



https://slurm.schedmd.com/slurm.conf.html

-Paul Edmon-

On 05/09/2017 05:35 AM, Georg Hildebrand wrote:

Partition default job time limit
Hi @here,

Is it possible to have a default job time limit for an slurm partition 
that is lower than the MaxTime?


Viele Grüße / kind regards
Georg






[slurm-dev] Re: Issue to startup slurm daemon on Compute nodes

2017-05-09 Thread Maik Schmidt
It means you have to build SLURM on the node with the oldest glibc that 
you might still have in your cluster. It will then also run on the ones 
with newer glibc versions, just not the other way around.


Best,
Maik

Am 09.05.2017 um 15:49 schrieb J. Smith:

Hi,

I have compiled slurm v17.02.2 on Master Nodes running CentOS7.
I have no issue to startup slurm on the Master nodes but I am unable to
start the daemon on the Compute Nodes running on CentOS6. It is looking for
GLIBC 2.14 which is not available on our compute Nodes(using glibc-2.12).

Error:
service slurm status
/home/share/slurm/17.02.2/bin/scontrol: /lib64/libc.so.6: version
`GLIBC_2.14' not found (required by /home/share/slurm/17.02.2/bin/scontrol)

Does that mean that slurm will only work accross compute nodes running
CentOS7 and not CentOS6? Any suggestions?

Thanks!


--
Maik Schmidt
HPC Services

Technische Universität Dresden
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
Willers-Bau A116
D-01062 Dresden
Telefon: +49 351 463-32836




smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Announce: Node status tool "pestat" for Slurm updated to version 0.40

2017-05-09 Thread Ole Holm Nielsen



I'm announcing an updated version 0.40 of the node status tool "pestat" 
for Slurm.


Download the tool (a short bash script) from 
https://ftp.fysik.dtu.dk/Slurm/pestat


Thanks to Daniel Letai for recommending better script coding styles. If 
your commands do not live in /usr/bin, please make appropriate changes 
in the CONFIGURE section at the top of the script.


New options have been added as shown by the help information:

# pestat  -h
Usage: pestat [-p partition(s)] [-u username] [-q qoslist] [-s 
statelist] [-f] [-V] [-h]

where:
-p partition: Select only partion 
-u username: Print only user 
-q qoslist: Print only QOS in the qoslist 
-s statelist: Print only nodes with state in 
-f: Print only nodes that are flagged by * (unexpected load etc.)
-h: Print this help information
-V: Version information

I use "pestat -f" all the time because it prints and flags (in color) 
only the nodes which have an unexpected CPU load or node status.


The -s option is useful for checking on possibly unusual node states, 
for example "pestat -s mix".


--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] RE: Announce: Infiniband topology tool "slurmibtopology.sh" updated version 0.21

2017-05-09 Thread Yaron Weitz

Thanks Ole.

I just rebuild my InfiniBand fabric and your script will surely help me.

Yaron

-Original Message-
From: Ole Holm Nielsen [mailto:ole.h.niel...@fysik.dtu.dk]
Sent: Tuesday, May 9, 2017 12:43 PM
To: slurm-dev 
Subject: [slurm-dev] Announce: Infiniband topology tool "slurmibtopology.sh"
updated version 0.21


I'm announcing an updated version 0.21 of an Infiniband topology tool
"slurmibtopology.sh" for Slurm.  The output may be used as a starting point
for writing your own topology.conf file.

Download the script from https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh

Thanks to Felip Moll for testing the script on a rather large IB network.

Motivation: I had to create a Slurm topology.conf file and needed an
automated way to get the correct node and switch Infiniband connectivity.
The manual page https://slurm.schedmd.com/topology.conf.html refers to an
outdated tool ib2slurm.

Version 0.21 reads switch-to-switch links and prints out lines with
"Switches=..." for those switches with 0 compute node (HCA) links.
An option -c will delete all the (possibly useful) comment lines.

Example: Running this script on our Infiniband network:

# ./slurmibtopology.sh -c
Verify the Infiniband interface:
mlx4_0
Infiniband interface OK

Generate the Slurm topology.conf file for Infiniband switches.

Beware: The Switches= lines need to be reviewed and edited for correctness.
Read also https://slurm.schedmd.com/topology.html

SwitchName=ibsw1 Nodes=i[001-028]
SwitchName=ibsw2 Nodes=finbul,i[029-051],niflfs[1-2],niflopt1
SwitchName=ibsw3 Nodes=g[081-100,102-112],h[001-002]
SwitchName=ibsw4 Switches=ibsw[2-3,7]
SwitchName=ibsw5 Nodes=g[005-008,013-014,016,021-024,029-032,037-040]
SwitchName=ibsw6 Nodes=g[045-048,053-056,061-064,069-072,077-080]
SwitchName=ibsw7 Nodes=g[041-044,049-052,057-060,065-068,073-076]
SwitchName=ibsw8 Nodes=g[001-004,009-012,017-020,025-028,033-036]


It would be great if other sites could test this tool on their Infiniband
network and report bugs or suggest improvements.

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Announce: Infiniband topology tool "slurmibtopology.sh" updated version 0.21

2017-05-09 Thread Ole Holm Nielsen


I'm announcing an updated version 0.21 of an Infiniband topology tool 
"slurmibtopology.sh" for Slurm.  The output may be used as a starting 
point for writing your own topology.conf file.


Download the script from https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh

Thanks to Felip Moll for testing the script on a rather large IB network.

Motivation: I had to create a Slurm topology.conf file and needed an 
automated way to get the correct node and switch Infiniband 
connectivity.  The manual page 
https://slurm.schedmd.com/topology.conf.html refers to an outdated tool 
ib2slurm.


Version 0.21 reads switch-to-switch links and prints out lines with 
"Switches=..." for those switches with 0 compute node (HCA) links.

An option -c will delete all the (possibly useful) comment lines.

Example: Running this script on our Infiniband network:

# ./slurmibtopology.sh -c
Verify the Infiniband interface:
mlx4_0
Infiniband interface OK

Generate the Slurm topology.conf file for Infiniband switches.

Beware: The Switches= lines need to be reviewed and edited for correctness.
Read also https://slurm.schedmd.com/topology.html

SwitchName=ibsw1 Nodes=i[001-028]
SwitchName=ibsw2 Nodes=finbul,i[029-051],niflfs[1-2],niflopt1
SwitchName=ibsw3 Nodes=g[081-100,102-112],h[001-002]
SwitchName=ibsw4 Switches=ibsw[2-3,7]
SwitchName=ibsw5 Nodes=g[005-008,013-014,016,021-024,029-032,037-040]
SwitchName=ibsw6 Nodes=g[045-048,053-056,061-064,069-072,077-080]
SwitchName=ibsw7 Nodes=g[041-044,049-052,057-060,065-068,073-076]
SwitchName=ibsw8 Nodes=g[001-004,009-012,017-020,025-028,033-036]


It would be great if other sites could test this tool on their 
Infiniband network and report bugs or suggest improvements.


--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Partition default job time limit

2017-05-09 Thread Georg Hildebrand
Hi @here,

Is it possible to have a default job time limit for an slurm partition that
is lower than the MaxTime?

Viele Grüße / kind regards
Georg


[slurm-dev] Inconsistent job timings with openMPI's mpirun

2017-05-09 Thread Gunnar Jansen
I am experiencing inconsistent timings running the exact same job file on a
small cluster using slurm 14.11.8 and OpenMPI 1.8.8.

The test case I am running is the device-device latency test from the
osu-micro-benchmarks-5.0 suite.

I use the following slurm script file:

#!/bin/bash
#
#SBATCH --job-name=CudaRDMA
#SBATCH --nodes=2
#SBATCH --time=10:00
#SBATCH --gres=gpu:1

module load openmpi/1.8.8
module load cuda


export OMPI_MCA_btl_openib_want_cuda_gdr=1
export OMPI_MCA_btl_openib_cuda_rdma_limit=65537

time mpirun -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_cuda_rdma_limit
65537 -np 2 -npernode 1 -bind-to core -report-bindings -x
CUDA_VISIBLE_DEVICES=0 /home/janseng/tests/osu-micro-
benchmarks-5.0/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D
D

If I run this batch file via sbatch a couple of times I encounter very
different timings:

real0m24.996s
user0m5.280s
sys 0m1.328s

real4m24.126s
user0m5.200s
sys 0m1.191s

real2m40.372s
user0m5.187s
sys 0m1.188s

These are just a few of my timings. The majority lies between 2-4 minutes.
However, the 25s should be the most realistic of the timings especially
comparing to the constant user and sys times.

The actual latency test results are nearly identical and don't show any
problems in the communication. The issue is also not limited to this test
case but frequently occurs in other situations as well.

Has anybody else encountered issues like this and can point me in the right
direction?

Gunnar


[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-09 Thread Janne Blomqvist


On 2017-05-09 10:27, Ole Holm Nielsen wrote:


On 05/09/2017 09:14 AM, Janne Blomqvist wrote:


On 2017-05-07 15:29, Ole Holm Nielsen wrote:


I'm announcing an initial version 0.1 of an Infiniband topology tool
"slurmibtopology.sh" for Slurm.


I have also created one, at

https://github.com/jabl/ibtopotool

You need the python networkx library (python-networkx package on centos
& Ubuntu, or install via pip).

Run with --help option to get some usage instructions. In addition to
generating slurm topology.conf, it can also generate graphviz dot files
for visualization.


Thanks for providing this tool to the Slurm community.  It seems that
tools for generating topology.conf have been developed in many places,
probably because it's an important task.

I installed python-networkx 1.8.1-12.el7 from EPEL on our CentOS 7.3
system and then executed ibtopotool.py, but it gives an error message:

# ./ibtopotool.py
Traceback (most recent call last):
  File "./ibtopotool.py", line 216, in 
graph = parse_ibtopo(args[0], options.shortlabels)
IndexError: list index out of range

Could you help solving this?


Duh.. I just pushed a fix, thanks for reporting.


--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: Creating init script in /etc/init.d while building from source

2017-05-09 Thread Gennaro Oliva

Hi,

On Tue, May 09, 2017 at 12:17:11AM -0700, Janne Blomqvist wrote:
> for Ubuntu 16.04 you should be using the systemd service files instead of
> init.d scripts. They are part of the rpm file when building for red hat
> based systems, don't know about ubuntu; but presumably you can find them
> somewhere in the source tree.

Maybe you can adapt these to your environment:

https://anonscm.debian.org/cgit/collab-maint/slurm-llnl.git/tree/debian/slurmctld.service
https://anonscm.debian.org/cgit/collab-maint/slurm-llnl.git/tree/debian/slurmd.service
https://anonscm.debian.org/cgit/collab-maint/slurm-llnl.git/tree/debian/slurmdbd.service

Best regards
-- 
Gennaro Oliva


[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-09 Thread Ole Holm Nielsen


On 05/09/2017 09:14 AM, Janne Blomqvist wrote:


On 2017-05-07 15:29, Ole Holm Nielsen wrote:


I'm announcing an initial version 0.1 of an Infiniband topology tool
"slurmibtopology.sh" for Slurm.


I have also created one, at

https://github.com/jabl/ibtopotool

You need the python networkx library (python-networkx package on centos
& Ubuntu, or install via pip).

Run with --help option to get some usage instructions. In addition to
generating slurm topology.conf, it can also generate graphviz dot files
for visualization.


Thanks for providing this tool to the Slurm community.  It seems that 
tools for generating topology.conf have been developed in many places, 
probably because it's an important task.


I installed python-networkx 1.8.1-12.el7 from EPEL on our CentOS 7.3 
system and then executed ibtopotool.py, but it gives an error message:


# ./ibtopotool.py
Traceback (most recent call last):
  File "./ibtopotool.py", line 216, in 
graph = parse_ibtopo(args[0], options.shortlabels)
IndexError: list index out of range

Could you help solving this?

Thanks,
Ole


[slurm-dev] Re: Creating init script in /etc/init.d while building from source

2017-05-09 Thread Janne Blomqvist


On 2017-05-09 09:09, Dhiraj Reddy wrote:

Hi,

How to create slurmd and slurmctld init scripts in the directory
/etc/init.d while building and installing slurm from source.

I think something should be done with the files ./init.d.slurm in /etc
directory but I don't know what do.

I am using Ubuntu 16.04.

Thanks
Dhiraj


Hi,

for Ubuntu 16.04 you should be using the systemd service files instead 
of init.d scripts. They are part of the rpm file when building for red 
hat based systems, don't know about ubuntu; but presumably you can find 
them somewhere in the source tree.


--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-09 Thread Janne Blomqvist


On 2017-05-07 15:29, Ole Holm Nielsen wrote:


I'm announcing an initial version 0.1 of an Infiniband topology tool
"slurmibtopology.sh" for Slurm.


I have also created one, at

https://github.com/jabl/ibtopotool

You need the python networkx library (python-networkx package on centos 
& Ubuntu, or install via pip).


Run with --help option to get some usage instructions. In addition to 
generating slurm topology.conf, it can also generate graphviz dot files 
for visualization.



--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-09 Thread Ole Holm Nielsen


Hi Damien,

Thanks for your positive feedback!  I'll be posting an updated script 
soon which works better for multilevel IB networks.


I would however warn anyone against using the output of 
slurmibtopology.sh directly in an automated procedure.  I recommend 
strongly a manual step as well in a procedure like:


1. Use slurmibtopology.sh to generate topology.conf.  It should show 
correctly the leaf switches and their links to compute nodes.


2. Review your 2nd and 3rd level network topology as discussed in 
https://slurm.schedmd.com/topology.html.  Heed in particular this statement:

As a practical matter, listing every switch connection definitely results in a 
slower scheduling algorithm for Slurm to optimize job placement. The 
application performance may achieve little benefit from such optimization. 
Listing the leaf switches with their nodes plus one top level switch should 
result in good performance for both applications and Slurm.


3. In the generated topology.conf you should select only 1 top-level 
switch and delete the others.


4. Copy the edited topology.conf to your cluster.

/Ole

On 05/08/2017 04:56 PM, Damien François wrote:

Hi

many thanks for the tools, it works flawlessly here.

I just patched to send the output that does not belong to the topology.conf to 
stderr so I could simply redirect to topology.conf in an automated Slurm 
install procedure

16c16
< echo Verify the Infiniband interface: >&2
---

echo Verify the Infiniband interface:

18c18
< if $IBSTAT -l >&2
---

if $IBSTAT -l

20c20
&2
---

echo Infiniband interface OK

22c22
&2
---

echo Infiniband interface NOT OK

26c26
< cat <&2
---

cat <

Sincerely

damien


On 07 May 2017, at 14:29, Ole Holm Nielsen  wrote:


I'm announcing an initial version 0.1 of an Infiniband topology tool 
"slurmibtopology.sh" for Slurm.

I had to create a Slurm topology.conf file and needed an automated way to get 
the correct node and switch Infiniband connectivity.  The manual page 
https://slurm.schedmd.com/topology.conf.html refers to an outdated tool 
ib2slurm.

Inspired by the script in 
https://unix.stackexchange.com/questions/255472/text-processing-building-a-slurm-topology-conf-file-from-ibnetdiscover-output
 I decided to write a simpler and more understandable tool.  It parses the output of the 
OFED command "ibnetdiscover" and generates an initial topology.conf file (which 
you may want to edit for readability).

Download the script https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh

Example: Running this script on our Infiniband network:

# slurmibtopology.sh
Verify the Infiniband interface:
mlx4_0
Infiniband interface OK

Generate the Slurm topology.conf file for Infiniband switches:

# IB switch no. 1: MF0;mell02:IS5030/U1
SwitchName=ibsw1 Nodes=i[001-028]
# IB switch no. 2: MF0;mell03:IS5030/U1
SwitchName=ibsw2 Nodes=finbul,i[029-051],niflfs[1-2],niflopt1
# IB switch no. 3: MF0;mell01:SX6036/U1
SwitchName=ibsw3 Nodes=g[081-100,102-112],h[001-002]
# IB switch no. 4: MF0;mell04:SX6036/U1
# NOTICE: This switch has no attached nodes (empty hostlist)
SwitchName=ibsw4 Nodes=""
# IB switch no. 5: Mellanox 4036 # volt01
SwitchName=ibsw5 Nodes=g[005-008,013-014,016,021-024,029-032,037-040]
# IB switch no. 6: Mellanox 4036 # volt03
SwitchName=ibsw6 Nodes=g[045-048,053-056,061-064,069-072,077-080]
# IB switch no. 7: Mellanox 4036 # volt04
SwitchName=ibsw7 Nodes=g[041-044,049-052,057-060,065-068,073-076]
# IB switch no. 8: Mellanox 4036 # volt02
SwitchName=ibsw8 Nodes=g[001-004,009-012,017-020,025-028,033-036]
# Merging all switches in a top-level spine switch
SwitchName=spineswitch Switches=ibsw[1-8]

It would be great if other sites could test this tool on their Infiniband 
network and report bugs or suggest improvements.

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark




--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620


[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-09 Thread Ole Holm Nielsen


On 05/08/2017 08:27 PM, Jeffrey Frey wrote:

The primary problem I've had with ib2slurm is that it segfaults.  There's a bug 
in the ibnetdiscover library -- ib2slurm passes a NULL config pointer to the 
ibnd_discover_fabric() which is supposed to be okay according to the 
documentation, but that function actually requires a config structure to be 
passed to it.


I've forked and updated ib2slurm:


https://github.com/jtfrey/ib2slurm


There doesn't appear to be much movement on the original project, so I haven't 
put in a pull request against my fork.


It's great that you try to revive the outdated ib2slurm project!  The 
Slurm pages on topology.conf should point to your project in stead of 
the seemingly dead ib2slurm.


That said, the concept of ib2slurm seems to be that you use 
ibnetdiscover to generate a cache file which is subsequently parsed by 
ib2slurm.  My slurmibtopology.sh script uses the same idea: Parse the 
output from ibnetdiscover and generate a topology.conf file.  For me 
personally it was easier to use awk than C for this task because awk 
supports associative arrays.


/Ole



On May 8, 2017, at 10:55 AM, Damien François  
wrote:

Hi

many thanks for the tools, it works flawlessly here.

I just patched to send the output that does not belong to the topology.conf to 
stderr so I could simply redirect to topology.conf in an automated Slurm 
install procedure

16c16
< echo Verify the Infiniband interface: >&2
---

echo Verify the Infiniband interface:

18c18
< if $IBSTAT -l >&2
---

if $IBSTAT -l

20c20
&2
---

echo Infiniband interface OK

22c22
&2
---

echo Infiniband interface NOT OK

26c26
< cat <&2
---

cat <

Sincerely

damien


On 07 May 2017, at 14:29, Ole Holm Nielsen  wrote:


I'm announcing an initial version 0.1 of an Infiniband topology tool 
"slurmibtopology.sh" for Slurm.

I had to create a Slurm topology.conf file and needed an automated way to get 
the correct node and switch Infiniband connectivity.  The manual page 
https://slurm.schedmd.com/topology.conf.html refers to an outdated tool 
ib2slurm.

Inspired by the script in 
https://unix.stackexchange.com/questions/255472/text-processing-building-a-slurm-topology-conf-file-from-ibnetdiscover-output
 I decided to write a simpler and more understandable tool.  It parses the output of the 
OFED command "ibnetdiscover" and generates an initial topology.conf file (which 
you may want to edit for readability).

Download the script https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh

Example: Running this script on our Infiniband network:

# slurmibtopology.sh
Verify the Infiniband interface:
mlx4_0
Infiniband interface OK

Generate the Slurm topology.conf file for Infiniband switches:

# IB switch no. 1: MF0;mell02:IS5030/U1
SwitchName=ibsw1 Nodes=i[001-028]
# IB switch no. 2: MF0;mell03:IS5030/U1
SwitchName=ibsw2 Nodes=finbul,i[029-051],niflfs[1-2],niflopt1
# IB switch no. 3: MF0;mell01:SX6036/U1
SwitchName=ibsw3 Nodes=g[081-100,102-112],h[001-002]
# IB switch no. 4: MF0;mell04:SX6036/U1
# NOTICE: This switch has no attached nodes (empty hostlist)
SwitchName=ibsw4 Nodes=""
# IB switch no. 5: Mellanox 4036 # volt01
SwitchName=ibsw5 Nodes=g[005-008,013-014,016,021-024,029-032,037-040]
# IB switch no. 6: Mellanox 4036 # volt03
SwitchName=ibsw6 Nodes=g[045-048,053-056,061-064,069-072,077-080]
# IB switch no. 7: Mellanox 4036 # volt04
SwitchName=ibsw7 Nodes=g[041-044,049-052,057-060,065-068,073-076]
# IB switch no. 8: Mellanox 4036 # volt02
SwitchName=ibsw8 Nodes=g[001-004,009-012,017-020,025-028,033-036]
# Merging all switches in a top-level spine switch
SwitchName=spineswitch Switches=ibsw[1-8]

It would be great if other sites could test this tool on their Infiniband 
network and report bugs or suggest improvements.

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


[slurm-dev] Creating init script in /etc/init.d while building from source

2017-05-09 Thread Dhiraj Reddy
Hi,

How to create slurmd and slurmctld init scripts in the directory
/etc/init.d while building and installing slurm from source.

I think something should be done with the files ./init.d.slurm in /etc
directory but I don't know what do.

I am using Ubuntu 16.04.

Thanks
Dhiraj