First of all, I am new to this mailling list (and also to SLURM, and
InfiniBand), so I will start by presenting myself.
My name is Miguel, my background is Physics Engineering, and I work at
the Institute of Engineering of Coimbra (ISEC <www.isec.pt>), located in
Coimbra <https://en.wikipedia.org/wiki/Coimbra>, a small town from
Portugal <https://en.wikipedia.org/wiki/Portugal> (a small European
country, planted by the sea). Recently, I was made responsible for
installing a small HPC cluster in my institution.
The cluster has a Login Node, an Head Node (with 20 cores, to which a
DAS storage is attached), and 20 Compute Nodes (24 cores each). The Head
Node and the Compute Nodes have Mellanox InfiniBand cards, and the
cluster has a single Mellanox IB switch. For the OS, I have chosen
CentOS 7.2. For the workload manager, I could have chosen TORQUE,
eventually with MAUI, which is a solution that I have already
implemented with commodity hardware. However, I must implement also an
accounting system, and Gold is no longer maintained. For this, I have
opted to install SLURM, which I have never worked with, not even as an
HPC user.
This said, please be gentle with me, and help me if you can. :)
My current problem (later, I am sure I will have many others) is with
the installation of SLURM along with the Mellanox OFED stack. During the
configure process I get the following warning:
configure: WARNING: unable to locate ofed installation
Apart from this, SLURM compiles with no errors, and make check reports
that the compiled software has passed all tests. Also make install does
what it is supposed to do, without errors. So, I could proceed as is,
and try to understand/solve the problem later. At least I could have a
working system sooner. However, I do not want to proceed without first
trying to understand/solve what is involved with the warning message.
Digging in the configure file, I found that the configuration process
searches for the OFED stack in the /usr and /usr/local directories,
tests for the existence of mad.h, and uses the libmad and libumad
libraries for compilation. Digging in the directory tree, I found that
the Mellanox OFED stack uses ib_mad.h, and the devel files are installed
in /usr/src/ofa_kernel-3.2/. So, I made the naive assumption that
changing the configure file to test for ib_mad.h, and using the libibmad
and libibumad libraries, and giving the parameter
--with-ofed=/usr/src/ofa_kernel-3.2, would solve the problem. But, of
course (or not), it did not.
So, my first question is: how important it is to compile SLURM with OFED
support? This is probably a silly question, because if it was not that
relevant, SLURM would not have this option, but I would like to
understand the differences between the two options (with OFED support,
and without).
My second question is: how can I compile SLURM with support for the
Mellanox OFED stack?
Best regards,
Miguel