First of all, I am new to this mailling list (and also to SLURM, and InfiniBand), so I will start by presenting myself.

My name is Miguel, my background is Physics Engineering, and I work at the Institute of Engineering of Coimbra (ISEC <www.isec.pt>), located in Coimbra <https://en.wikipedia.org/wiki/Coimbra>, a small town from Portugal <https://en.wikipedia.org/wiki/Portugal> (a small European country, planted by the sea). Recently, I was made responsible for installing a small HPC cluster in my institution.

The cluster has a Login Node, an Head Node (with 20 cores, to which a DAS storage is attached), and 20 Compute Nodes (24 cores each). The Head Node and the Compute Nodes have Mellanox InfiniBand cards, and the cluster has a single Mellanox IB switch. For the OS, I have chosen CentOS 7.2. For the workload manager, I could have chosen TORQUE, eventually with MAUI, which is a solution that I have already implemented with commodity hardware. However, I must implement also an accounting system, and Gold is no longer maintained. For this, I have opted to install SLURM, which I have never worked with, not even as an HPC user.

This said, please be gentle with me, and help me if you can. :)


My current problem (later, I am sure I will have many others) is with the installation of SLURM along with the Mellanox OFED stack. During the configure process I get the following warning:

configure: WARNING: unable to locate ofed installation

Apart from this, SLURM compiles with no errors, and make check reports that the compiled software has passed all tests. Also make install does what it is supposed to do, without errors. So, I could proceed as is, and try to understand/solve the problem later. At least I could have a working system sooner. However, I do not want to proceed without first trying to understand/solve what is involved with the warning message.

Digging in the configure file, I found that the configuration process searches for the OFED stack in the /usr and /usr/local directories, tests for the existence of mad.h, and uses the libmad and libumad libraries for compilation. Digging in the directory tree, I found that the Mellanox OFED stack uses ib_mad.h, and the devel files are installed in /usr/src/ofa_kernel-3.2/. So, I made the naive assumption that changing the configure file to test for ib_mad.h, and using the libibmad and libibumad libraries, and giving the parameter --with-ofed=/usr/src/ofa_kernel-3.2, would solve the problem. But, of course (or not), it did not.

So, my first question is: how important it is to compile SLURM with OFED support? This is probably a silly question, because if it was not that relevant, SLURM would not have this option, but I would like to understand the differences between the two options (with OFED support, and without).

My second question is: how can I compile SLURM with support for the Mellanox OFED stack?


Best regards,
Miguel

Reply via email to