Hello, We recently ran into an issue with a cluster and dual port ConnectX3 cards. We are using the cards with one port setup for IB and one port setup for 10gbe. We ran into scaling issues when using the openib BTL where the system tried to run over the 10gbe port rather than the IB port. This caused lots of RDMA errors (RDMA_CM_EVENT_ADDR_ERROR) which were somewhat hard to diagnose. We were able to discover the issue via “—mca btl_base_verbose 30”. This showed the ports being used. From there, we were able to setup our openmpi module to use “ btl_openib_if_include “mlx4_0:1” “ and put openmpi traffic over the proper port. There wasn’t much documentation on the issue, so I wanted to send it out to the mailing list.
Also, is there a reason that openib attempts to use the 10gbe interface as well? What is the cause for this as the default behavior? If this sort of configuration gets more common, it may come up more in the future. Thank you, Nathan Grodowitz ITSD Linux R&D Scientific Platforms HPC Admin Office:865-576-4715 Cell:865-347-4247