[slurm-dev] Re:
Ignore this - I discovered the problem. A couple of bpipe jobs from three weeks ago were zombied and eating all the memory. cheers L. -- "Mission Statement: To provide hope and inspiration for collective action, to build collective power, to achieve collective transformation, rooted in grief and rage but pointed towards vision and dreams." - Patrice Cullors, *Black Lives Matter founder* On 10 May 2017 at 10:57, Lachlan Musicmanwrote: > Running Slurm 16.05 on CentOS 7.3 I'm trying to start an interactive > session with > > srun -w papr-expanded01 --pty --mem 8192 -t 06:00 /bin/bash > --partition=expanded > srun -w papr-expanded01 --pty -t 06:00 /bin/bash --partition=expanded > srun -w papr-expanded01 --pty --mem 8192 /bin/bash --partition=expanded > srun -w papr-expanded01 --pty /bin/bash --partition=expanded > > No matter what I change (including user), I always get > > srun: error: Unable to allocate resources: Requested node configuration is > not available > > which is the same as in the logs. There is no other debug message. Any > hints on what I'm doing wrong? > > (notes: the node has enough memory, has sync'd time with head node, am > using users with access to partitions) > > cheers > L. > > > -- > "Mission Statement: To provide hope and inspiration for collective action, > to build collective power, to achieve collective transformation, rooted in > grief and rage but pointed towards vision and dreams." > > - Patrice Cullors, *Black Lives Matter founder* >
[slurm-dev]
Running Slurm 16.05 on CentOS 7.3 I'm trying to start an interactive session with srun -w papr-expanded01 --pty --mem 8192 -t 06:00 /bin/bash --partition=expanded srun -w papr-expanded01 --pty -t 06:00 /bin/bash --partition=expanded srun -w papr-expanded01 --pty --mem 8192 /bin/bash --partition=expanded srun -w papr-expanded01 --pty /bin/bash --partition=expanded No matter what I change (including user), I always get srun: error: Unable to allocate resources: Requested node configuration is not available which is the same as in the logs. There is no other debug message. Any hints on what I'm doing wrong? (notes: the node has enough memory, has sync'd time with head node, am using users with access to partitions) cheers L. -- "Mission Statement: To provide hope and inspiration for collective action, to build collective power, to achieve collective transformation, rooted in grief and rage but pointed towards vision and dreams." - Patrice Cullors, *Black Lives Matter founder*
[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1
>> The primary problem I've had with ib2slurm is that it segfaults. There's a >> bug in the ibnetdiscover library -- ib2slurm passes a NULL config pointer to >> the ibnd_discover_fabric() which is supposed to be okay according to the >> documentation, but that function actually requires a config structure to be >> passed to it. >> >> >> I've forked and updated ib2slurm: >> >> >> https://github.com/jtfrey/ib2slurm >> >> >> There doesn't appear to be much movement on the original project, so I >> haven't put in a pull request against my fork. > > It's great that you try to revive the outdated ib2slurm project! The Slurm > pages on topology.conf should point to your project in stead of the seemingly > dead ib2slurm. > > That said, the concept of ib2slurm seems to be that you use ibnetdiscover to > generate a cache file which is subsequently parsed by ib2slurm. My > slurmibtopology.sh script uses the same idea: Parse the output from > ibnetdiscover and generate a topology.conf file. For me personally it was > easier to use awk than C for this task because awk supports associative > arrays. > > /Ole Thanks. The original ib2slurm wasn't able to export SLURM-style compacted hostlists. Rather than implementing that myself, I've opted to (optionally) link ib2slurm against libslurm and make use of SLURM's native hostlist functionality to handle the compaction. Also added a CMakeLists.txt so that CMake builds should be possible. All are present on the github repo. https://github.com/jtfrey/ib2slurm > use ibnetdiscover to generate a cache file which is subsequently parsed by > ib2slurm ib2slurm uses the same library functions that the ibnetdiscover utility uses. Primarily, it produces an in-memory representation of the network topology. For large networks or situations where a topology is routinely gathered and cached using the ibnetdiscover library, the cache file can be reused by ib2slurm. Otherwise, ib2slurm must walk the network to produce the in-memory topology each time (and that can be slow, especially if you're trying to get a node name mapping file debugged). :: Jeffrey T. Frey, Ph.D. Systems Programmer V / HPC Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 ::
[slurm-dev] Re: Issue to startup slurm daemon on Compute nodes
Followig on from Maik's response, it would be worth mentioning the compat-glibc package for CentOS https://centos-packages.com/7/package/compat-glibc/ https://www.centos.org/forums/viewtopic.php?t=22250 Big get out of jail card - I have never built any version of Slurm on a CentOS 7 system using the compat-glibc libraries!!! On 9 May 2017 at 15:59, Maik Schmidtwrote: > It means you have to build SLURM on the node with the oldest glibc that > you might still have in your cluster. It will then also run on the ones > with newer glibc versions, just not the other way around. > > Best, > Maik > > > Am 09.05.2017 um 15:49 schrieb J. Smith: > >> Hi, >> >> I have compiled slurm v17.02.2 on Master Nodes running CentOS7. >> I have no issue to startup slurm on the Master nodes but I am unable to >> start the daemon on the Compute Nodes running on CentOS6. It is looking >> for >> GLIBC 2.14 which is not available on our compute Nodes(using glibc-2.12). >> >> Error: >> service slurm status >> /home/share/slurm/17.02.2/bin/scontrol: /lib64/libc.so.6: version >> `GLIBC_2.14' not found (required by /home/share/slurm/17.02.2/bin/ >> scontrol) >> >> Does that mean that slurm will only work accross compute nodes running >> CentOS7 and not CentOS6? Any suggestions? >> >> Thanks! >> >> -- > Maik Schmidt > HPC Services > > Technische Universität Dresden > Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) > Willers-Bau A116 > D-01062 Dresden > Telefon: +49 351 463-32836 > > >
[slurm-dev] Re: Partition default job time limit
Yes. Use the DefaultTime option. *DefaultTime* Run time limit used for jobs that don't specify a value. If not set then MaxTime will be used. Format is the same as for MaxTime. https://slurm.schedmd.com/slurm.conf.html -Paul Edmon- On 05/09/2017 05:35 AM, Georg Hildebrand wrote: Partition default job time limit Hi @here, Is it possible to have a default job time limit for an slurm partition that is lower than the MaxTime? Viele Grüße / kind regards Georg
[slurm-dev] Re: Issue to startup slurm daemon on Compute nodes
It means you have to build SLURM on the node with the oldest glibc that you might still have in your cluster. It will then also run on the ones with newer glibc versions, just not the other way around. Best, Maik Am 09.05.2017 um 15:49 schrieb J. Smith: Hi, I have compiled slurm v17.02.2 on Master Nodes running CentOS7. I have no issue to startup slurm on the Master nodes but I am unable to start the daemon on the Compute Nodes running on CentOS6. It is looking for GLIBC 2.14 which is not available on our compute Nodes(using glibc-2.12). Error: service slurm status /home/share/slurm/17.02.2/bin/scontrol: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /home/share/slurm/17.02.2/bin/scontrol) Does that mean that slurm will only work accross compute nodes running CentOS7 and not CentOS6? Any suggestions? Thanks! -- Maik Schmidt HPC Services Technische Universität Dresden Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) Willers-Bau A116 D-01062 Dresden Telefon: +49 351 463-32836 smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Announce: Node status tool "pestat" for Slurm updated to version 0.40
I'm announcing an updated version 0.40 of the node status tool "pestat" for Slurm. Download the tool (a short bash script) from https://ftp.fysik.dtu.dk/Slurm/pestat Thanks to Daniel Letai for recommending better script coding styles. If your commands do not live in /usr/bin, please make appropriate changes in the CONFIGURE section at the top of the script. New options have been added as shown by the help information: # pestat -h Usage: pestat [-p partition(s)] [-u username] [-q qoslist] [-s statelist] [-f] [-V] [-h] where: -p partition: Select only partion -u username: Print only user -q qoslist: Print only QOS in the qoslist -s statelist: Print only nodes with state in -f: Print only nodes that are flagged by * (unexpected load etc.) -h: Print this help information -V: Version information I use "pestat -f" all the time because it prints and flags (in color) only the nodes which have an unexpected CPU load or node status. The -s option is useful for checking on possibly unusual node states, for example "pestat -s mix". -- Ole Holm Nielsen Department of Physics, Technical University of Denmark
[slurm-dev] RE: Announce: Infiniband topology tool "slurmibtopology.sh" updated version 0.21
Thanks Ole. I just rebuild my InfiniBand fabric and your script will surely help me. Yaron -Original Message- From: Ole Holm Nielsen [mailto:ole.h.niel...@fysik.dtu.dk] Sent: Tuesday, May 9, 2017 12:43 PM To: slurm-devSubject: [slurm-dev] Announce: Infiniband topology tool "slurmibtopology.sh" updated version 0.21 I'm announcing an updated version 0.21 of an Infiniband topology tool "slurmibtopology.sh" for Slurm. The output may be used as a starting point for writing your own topology.conf file. Download the script from https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh Thanks to Felip Moll for testing the script on a rather large IB network. Motivation: I had to create a Slurm topology.conf file and needed an automated way to get the correct node and switch Infiniband connectivity. The manual page https://slurm.schedmd.com/topology.conf.html refers to an outdated tool ib2slurm. Version 0.21 reads switch-to-switch links and prints out lines with "Switches=..." for those switches with 0 compute node (HCA) links. An option -c will delete all the (possibly useful) comment lines. Example: Running this script on our Infiniband network: # ./slurmibtopology.sh -c Verify the Infiniband interface: mlx4_0 Infiniband interface OK Generate the Slurm topology.conf file for Infiniband switches. Beware: The Switches= lines need to be reviewed and edited for correctness. Read also https://slurm.schedmd.com/topology.html SwitchName=ibsw1 Nodes=i[001-028] SwitchName=ibsw2 Nodes=finbul,i[029-051],niflfs[1-2],niflopt1 SwitchName=ibsw3 Nodes=g[081-100,102-112],h[001-002] SwitchName=ibsw4 Switches=ibsw[2-3,7] SwitchName=ibsw5 Nodes=g[005-008,013-014,016,021-024,029-032,037-040] SwitchName=ibsw6 Nodes=g[045-048,053-056,061-064,069-072,077-080] SwitchName=ibsw7 Nodes=g[041-044,049-052,057-060,065-068,073-076] SwitchName=ibsw8 Nodes=g[001-004,009-012,017-020,025-028,033-036] It would be great if other sites could test this tool on their Infiniband network and report bugs or suggest improvements. -- Ole Holm Nielsen Department of Physics, Technical University of Denmark
[slurm-dev] Announce: Infiniband topology tool "slurmibtopology.sh" updated version 0.21
I'm announcing an updated version 0.21 of an Infiniband topology tool "slurmibtopology.sh" for Slurm. The output may be used as a starting point for writing your own topology.conf file. Download the script from https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh Thanks to Felip Moll for testing the script on a rather large IB network. Motivation: I had to create a Slurm topology.conf file and needed an automated way to get the correct node and switch Infiniband connectivity. The manual page https://slurm.schedmd.com/topology.conf.html refers to an outdated tool ib2slurm. Version 0.21 reads switch-to-switch links and prints out lines with "Switches=..." for those switches with 0 compute node (HCA) links. An option -c will delete all the (possibly useful) comment lines. Example: Running this script on our Infiniband network: # ./slurmibtopology.sh -c Verify the Infiniband interface: mlx4_0 Infiniband interface OK Generate the Slurm topology.conf file for Infiniband switches. Beware: The Switches= lines need to be reviewed and edited for correctness. Read also https://slurm.schedmd.com/topology.html SwitchName=ibsw1 Nodes=i[001-028] SwitchName=ibsw2 Nodes=finbul,i[029-051],niflfs[1-2],niflopt1 SwitchName=ibsw3 Nodes=g[081-100,102-112],h[001-002] SwitchName=ibsw4 Switches=ibsw[2-3,7] SwitchName=ibsw5 Nodes=g[005-008,013-014,016,021-024,029-032,037-040] SwitchName=ibsw6 Nodes=g[045-048,053-056,061-064,069-072,077-080] SwitchName=ibsw7 Nodes=g[041-044,049-052,057-060,065-068,073-076] SwitchName=ibsw8 Nodes=g[001-004,009-012,017-020,025-028,033-036] It would be great if other sites could test this tool on their Infiniband network and report bugs or suggest improvements. -- Ole Holm Nielsen Department of Physics, Technical University of Denmark
[slurm-dev] Partition default job time limit
Hi @here, Is it possible to have a default job time limit for an slurm partition that is lower than the MaxTime? Viele Grüße / kind regards Georg
[slurm-dev] Inconsistent job timings with openMPI's mpirun
I am experiencing inconsistent timings running the exact same job file on a small cluster using slurm 14.11.8 and OpenMPI 1.8.8. The test case I am running is the device-device latency test from the osu-micro-benchmarks-5.0 suite. I use the following slurm script file: #!/bin/bash # #SBATCH --job-name=CudaRDMA #SBATCH --nodes=2 #SBATCH --time=10:00 #SBATCH --gres=gpu:1 module load openmpi/1.8.8 module load cuda export OMPI_MCA_btl_openib_want_cuda_gdr=1 export OMPI_MCA_btl_openib_cuda_rdma_limit=65537 time mpirun -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_cuda_rdma_limit 65537 -np 2 -npernode 1 -bind-to core -report-bindings -x CUDA_VISIBLE_DEVICES=0 /home/janseng/tests/osu-micro- benchmarks-5.0/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D If I run this batch file via sbatch a couple of times I encounter very different timings: real0m24.996s user0m5.280s sys 0m1.328s real4m24.126s user0m5.200s sys 0m1.191s real2m40.372s user0m5.187s sys 0m1.188s These are just a few of my timings. The majority lies between 2-4 minutes. However, the 25s should be the most realistic of the timings especially comparing to the constant user and sys times. The actual latency test results are nearly identical and don't show any problems in the communication. The issue is also not limited to this test case but frequently occurs in other situations as well. Has anybody else encountered issues like this and can point me in the right direction? Gunnar
[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1
On 2017-05-09 10:27, Ole Holm Nielsen wrote: On 05/09/2017 09:14 AM, Janne Blomqvist wrote: On 2017-05-07 15:29, Ole Holm Nielsen wrote: I'm announcing an initial version 0.1 of an Infiniband topology tool "slurmibtopology.sh" for Slurm. I have also created one, at https://github.com/jabl/ibtopotool You need the python networkx library (python-networkx package on centos & Ubuntu, or install via pip). Run with --help option to get some usage instructions. In addition to generating slurm topology.conf, it can also generate graphviz dot files for visualization. Thanks for providing this tool to the Slurm community. It seems that tools for generating topology.conf have been developed in many places, probably because it's an important task. I installed python-networkx 1.8.1-12.el7 from EPEL on our CentOS 7.3 system and then executed ibtopotool.py, but it gives an error message: # ./ibtopotool.py Traceback (most recent call last): File "./ibtopotool.py", line 216, in graph = parse_ibtopo(args[0], options.shortlabels) IndexError: list index out of range Could you help solving this? Duh.. I just pushed a fix, thanks for reporting. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: Creating init script in /etc/init.d while building from source
Hi, On Tue, May 09, 2017 at 12:17:11AM -0700, Janne Blomqvist wrote: > for Ubuntu 16.04 you should be using the systemd service files instead of > init.d scripts. They are part of the rpm file when building for red hat > based systems, don't know about ubuntu; but presumably you can find them > somewhere in the source tree. Maybe you can adapt these to your environment: https://anonscm.debian.org/cgit/collab-maint/slurm-llnl.git/tree/debian/slurmctld.service https://anonscm.debian.org/cgit/collab-maint/slurm-llnl.git/tree/debian/slurmd.service https://anonscm.debian.org/cgit/collab-maint/slurm-llnl.git/tree/debian/slurmdbd.service Best regards -- Gennaro Oliva
[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1
On 05/09/2017 09:14 AM, Janne Blomqvist wrote: On 2017-05-07 15:29, Ole Holm Nielsen wrote: I'm announcing an initial version 0.1 of an Infiniband topology tool "slurmibtopology.sh" for Slurm. I have also created one, at https://github.com/jabl/ibtopotool You need the python networkx library (python-networkx package on centos & Ubuntu, or install via pip). Run with --help option to get some usage instructions. In addition to generating slurm topology.conf, it can also generate graphviz dot files for visualization. Thanks for providing this tool to the Slurm community. It seems that tools for generating topology.conf have been developed in many places, probably because it's an important task. I installed python-networkx 1.8.1-12.el7 from EPEL on our CentOS 7.3 system and then executed ibtopotool.py, but it gives an error message: # ./ibtopotool.py Traceback (most recent call last): File "./ibtopotool.py", line 216, in graph = parse_ibtopo(args[0], options.shortlabels) IndexError: list index out of range Could you help solving this? Thanks, Ole
[slurm-dev] Re: Creating init script in /etc/init.d while building from source
On 2017-05-09 09:09, Dhiraj Reddy wrote: Hi, How to create slurmd and slurmctld init scripts in the directory /etc/init.d while building and installing slurm from source. I think something should be done with the files ./init.d.slurm in /etc directory but I don't know what do. I am using Ubuntu 16.04. Thanks Dhiraj Hi, for Ubuntu 16.04 you should be using the systemd service files instead of init.d scripts. They are part of the rpm file when building for red hat based systems, don't know about ubuntu; but presumably you can find them somewhere in the source tree. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1
On 2017-05-07 15:29, Ole Holm Nielsen wrote: I'm announcing an initial version 0.1 of an Infiniband topology tool "slurmibtopology.sh" for Slurm. I have also created one, at https://github.com/jabl/ibtopotool You need the python networkx library (python-networkx package on centos & Ubuntu, or install via pip). Run with --help option to get some usage instructions. In addition to generating slurm topology.conf, it can also generate graphviz dot files for visualization. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1
Hi Damien, Thanks for your positive feedback! I'll be posting an updated script soon which works better for multilevel IB networks. I would however warn anyone against using the output of slurmibtopology.sh directly in an automated procedure. I recommend strongly a manual step as well in a procedure like: 1. Use slurmibtopology.sh to generate topology.conf. It should show correctly the leaf switches and their links to compute nodes. 2. Review your 2nd and 3rd level network topology as discussed in https://slurm.schedmd.com/topology.html. Heed in particular this statement: As a practical matter, listing every switch connection definitely results in a slower scheduling algorithm for Slurm to optimize job placement. The application performance may achieve little benefit from such optimization. Listing the leaf switches with their nodes plus one top level switch should result in good performance for both applications and Slurm. 3. In the generated topology.conf you should select only 1 top-level switch and delete the others. 4. Copy the edited topology.conf to your cluster. /Ole On 05/08/2017 04:56 PM, Damien François wrote: Hi many thanks for the tools, it works flawlessly here. I just patched to send the output that does not belong to the topology.conf to stderr so I could simply redirect to topology.conf in an automated Slurm install procedure 16c16 < echo Verify the Infiniband interface: >&2 --- echo Verify the Infiniband interface: 18c18 < if $IBSTAT -l >&2 --- if $IBSTAT -l 20c20&2 --- echo Infiniband interface OK 22c22 &2 --- echo Infiniband interface NOT OK 26c26 < cat <&2 --- cat < Sincerely damien On 07 May 2017, at 14:29, Ole Holm Nielsen wrote: I'm announcing an initial version 0.1 of an Infiniband topology tool "slurmibtopology.sh" for Slurm. I had to create a Slurm topology.conf file and needed an automated way to get the correct node and switch Infiniband connectivity. The manual page https://slurm.schedmd.com/topology.conf.html refers to an outdated tool ib2slurm. Inspired by the script in https://unix.stackexchange.com/questions/255472/text-processing-building-a-slurm-topology-conf-file-from-ibnetdiscover-output I decided to write a simpler and more understandable tool. It parses the output of the OFED command "ibnetdiscover" and generates an initial topology.conf file (which you may want to edit for readability). Download the script https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh Example: Running this script on our Infiniband network: # slurmibtopology.sh Verify the Infiniband interface: mlx4_0 Infiniband interface OK Generate the Slurm topology.conf file for Infiniband switches: # IB switch no. 1: MF0;mell02:IS5030/U1 SwitchName=ibsw1 Nodes=i[001-028] # IB switch no. 2: MF0;mell03:IS5030/U1 SwitchName=ibsw2 Nodes=finbul,i[029-051],niflfs[1-2],niflopt1 # IB switch no. 3: MF0;mell01:SX6036/U1 SwitchName=ibsw3 Nodes=g[081-100,102-112],h[001-002] # IB switch no. 4: MF0;mell04:SX6036/U1 # NOTICE: This switch has no attached nodes (empty hostlist) SwitchName=ibsw4 Nodes="" # IB switch no. 5: Mellanox 4036 # volt01 SwitchName=ibsw5 Nodes=g[005-008,013-014,016,021-024,029-032,037-040] # IB switch no. 6: Mellanox 4036 # volt03 SwitchName=ibsw6 Nodes=g[045-048,053-056,061-064,069-072,077-080] # IB switch no. 7: Mellanox 4036 # volt04 SwitchName=ibsw7 Nodes=g[041-044,049-052,057-060,065-068,073-076] # IB switch no. 8: Mellanox 4036 # volt02 SwitchName=ibsw8 Nodes=g[001-004,009-012,017-020,025-028,033-036] # Merging all switches in a top-level spine switch SwitchName=spineswitch Switches=ibsw[1-8] It would be great if other sites could test this tool on their Infiniband network and report bugs or suggest improvements. -- Ole Holm Nielsen Department of Physics, Technical University of Denmark -- Ole Holm Nielsen PhD, Manager of IT services Department of Physics, Technical University of Denmark, Building 307, DK-2800 Kongens Lyngby, Denmark E-mail: ole.h.niel...@fysik.dtu.dk Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/ Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620
[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1
On 05/08/2017 08:27 PM, Jeffrey Frey wrote: The primary problem I've had with ib2slurm is that it segfaults. There's a bug in the ibnetdiscover library -- ib2slurm passes a NULL config pointer to the ibnd_discover_fabric() which is supposed to be okay according to the documentation, but that function actually requires a config structure to be passed to it. I've forked and updated ib2slurm: https://github.com/jtfrey/ib2slurm There doesn't appear to be much movement on the original project, so I haven't put in a pull request against my fork. It's great that you try to revive the outdated ib2slurm project! The Slurm pages on topology.conf should point to your project in stead of the seemingly dead ib2slurm. That said, the concept of ib2slurm seems to be that you use ibnetdiscover to generate a cache file which is subsequently parsed by ib2slurm. My slurmibtopology.sh script uses the same idea: Parse the output from ibnetdiscover and generate a topology.conf file. For me personally it was easier to use awk than C for this task because awk supports associative arrays. /Ole On May 8, 2017, at 10:55 AM, Damien Françoiswrote: Hi many thanks for the tools, it works flawlessly here. I just patched to send the output that does not belong to the topology.conf to stderr so I could simply redirect to topology.conf in an automated Slurm install procedure 16c16 < echo Verify the Infiniband interface: >&2 --- echo Verify the Infiniband interface: 18c18 < if $IBSTAT -l >&2 --- if $IBSTAT -l 20c20 &2 --- echo Infiniband interface OK 22c22 &2 --- echo Infiniband interface NOT OK 26c26 < cat <&2 --- cat < Sincerely damien On 07 May 2017, at 14:29, Ole Holm Nielsen wrote: I'm announcing an initial version 0.1 of an Infiniband topology tool "slurmibtopology.sh" for Slurm. I had to create a Slurm topology.conf file and needed an automated way to get the correct node and switch Infiniband connectivity. The manual page https://slurm.schedmd.com/topology.conf.html refers to an outdated tool ib2slurm. Inspired by the script in https://unix.stackexchange.com/questions/255472/text-processing-building-a-slurm-topology-conf-file-from-ibnetdiscover-output I decided to write a simpler and more understandable tool. It parses the output of the OFED command "ibnetdiscover" and generates an initial topology.conf file (which you may want to edit for readability). Download the script https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh Example: Running this script on our Infiniband network: # slurmibtopology.sh Verify the Infiniband interface: mlx4_0 Infiniband interface OK Generate the Slurm topology.conf file for Infiniband switches: # IB switch no. 1: MF0;mell02:IS5030/U1 SwitchName=ibsw1 Nodes=i[001-028] # IB switch no. 2: MF0;mell03:IS5030/U1 SwitchName=ibsw2 Nodes=finbul,i[029-051],niflfs[1-2],niflopt1 # IB switch no. 3: MF0;mell01:SX6036/U1 SwitchName=ibsw3 Nodes=g[081-100,102-112],h[001-002] # IB switch no. 4: MF0;mell04:SX6036/U1 # NOTICE: This switch has no attached nodes (empty hostlist) SwitchName=ibsw4 Nodes="" # IB switch no. 5: Mellanox 4036 # volt01 SwitchName=ibsw5 Nodes=g[005-008,013-014,016,021-024,029-032,037-040] # IB switch no. 6: Mellanox 4036 # volt03 SwitchName=ibsw6 Nodes=g[045-048,053-056,061-064,069-072,077-080] # IB switch no. 7: Mellanox 4036 # volt04 SwitchName=ibsw7 Nodes=g[041-044,049-052,057-060,065-068,073-076] # IB switch no. 8: Mellanox 4036 # volt02 SwitchName=ibsw8 Nodes=g[001-004,009-012,017-020,025-028,033-036] # Merging all switches in a top-level spine switch SwitchName=spineswitch Switches=ibsw[1-8] It would be great if other sites could test this tool on their Infiniband network and report bugs or suggest improvements. -- Ole Holm Nielsen Department of Physics, Technical University of Denmark
[slurm-dev] Creating init script in /etc/init.d while building from source
Hi, How to create slurmd and slurmctld init scripts in the directory /etc/init.d while building and installing slurm from source. I think something should be done with the files ./init.d.slurm in /etc directory but I don't know what do. I am using Ubuntu 16.04. Thanks Dhiraj