We determined that this issue was actually due to not having an unlimited 
memlock for the slurm user when the slurm service started. The work-around was 
to simply restart slurm subsequent to boot and the new unlimited setting would 
allow infiniband usage. Moving the startup script to runlevel 3 did not fix the 
issue.

The solution is outlined in #58 in the Slurm FAQ  
http://slurm.schedmd.com/faq.html#mpi_perf

I discovered this fix almost by accident. When trying an install on Centos 7, I 
observed the same error message when using my own custom-built RPMs of 
openmpi-1.10. When I used a packaged version, the error message correctly 
identified the memlock issue. After adding the correct setting to a file in 
/etc/security/limits.d/ , using my custom RPMs worked without the error message.

Hope this helps anyone who has had the same issue.

--
Nathan Smith
Research Systems Engineer
Advanced Computing Center
Oregon Health & Science University
Ph: 503-494-6731


-----Original Message-----
From: Nathan Smith 
Sent: Monday, April 18, 2016 11:13 AM
To: us...@open-mpi.org
Subject: openib MTL not working via slurm after update

We recently updated and rebooted Infiniband-attached nodes, and now when trying 
to schedule MPI jobs with slurm, we are seeing the following:

--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be used on a 
specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

   Local host:           node-x
   Local device:         mlx5_0
   Local port:           1
   CPCs attempted:       udcm
--------------------------------------------------------------------------

This worked before reboots. The infiniband network itself is fine. 
However, if we invoke the same job directly using mpirun on the same nodes, we 
do not receive the message (meaning the openib BTL works). 
Some IB-related packages were updated (e.g. the rdma metapackage for Centos6.7).

What I'm hoping for is some guidance on what components are involved here and 
the possible causes of slurm not being able to use the openib BTL (a post to 
the slurm list did not get anywhere). There is very little documentation I 
could locate on what CPCs are, or what udcm is, and how to troubleshoot it.

Using openmpi 1.10.2 with slurm and PMI support configured in.

--
Nathan Smith
Research Systems Engineer
Advanced Computing Center
Oregon Health & Science University

Reply via email to