We determined that this issue was actually due to not having an unlimited memlock for the slurm user when the slurm service started. The work-around was to simply restart slurm subsequent to boot and the new unlimited setting would allow infiniband usage. Moving the startup script to runlevel 3 did not fix the issue.
The solution is outlined in #58 in the Slurm FAQ http://slurm.schedmd.com/faq.html#mpi_perf I discovered this fix almost by accident. When trying an install on Centos 7, I observed the same error message when using my own custom-built RPMs of openmpi-1.10. When I used a packaged version, the error message correctly identified the memlock issue. After adding the correct setting to a file in /etc/security/limits.d/ , using my custom RPMs worked without the error message. Hope this helps anyone who has had the same issue. -- Nathan Smith Research Systems Engineer Advanced Computing Center Oregon Health & Science University Ph: 503-494-6731 -----Original Message----- From: Nathan Smith Sent: Monday, April 18, 2016 11:13 AM To: us...@open-mpi.org Subject: openib MTL not working via slurm after update We recently updated and rebooted Infiniband-attached nodes, and now when trying to schedule MPI jobs with slurm, we are seeing the following: -------------------------------------------------------------------------- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: node-x Local device: mlx5_0 Local port: 1 CPCs attempted: udcm -------------------------------------------------------------------------- This worked before reboots. The infiniband network itself is fine. However, if we invoke the same job directly using mpirun on the same nodes, we do not receive the message (meaning the openib BTL works). Some IB-related packages were updated (e.g. the rdma metapackage for Centos6.7). What I'm hoping for is some guidance on what components are involved here and the possible causes of slurm not being able to use the openib BTL (a post to the slurm list did not get anywhere). There is very little documentation I could locate on what CPCs are, or what udcm is, and how to troubleshoot it. Using openmpi 1.10.2 with slurm and PMI support configured in. -- Nathan Smith Research Systems Engineer Advanced Computing Center Oregon Health & Science University