Hello,

We recently ran into an issue with a cluster and dual port ConnectX3 cards. We 
are using the cards with one port setup for IB and one port setup for 10gbe. We 
ran into scaling issues when using the openib BTL where the system tried to run 
over the 10gbe port rather than the IB port. This caused lots of RDMA errors 
(RDMA_CM_EVENT_ADDR_ERROR) which were somewhat hard to diagnose. We were able 
to discover the issue via “—mca btl_base_verbose 30”. This showed the ports 
being used. From there, we were able to setup our openmpi module to use “ 
btl_openib_if_include “mlx4_0:1” “ and put openmpi traffic over the proper 
port. There wasn’t much documentation on the issue, so I wanted to send it out 
to the mailing list. 

Also, is there a reason that openib attempts to use the 10gbe interface as 
well? What is the cause for this as the default behavior? If this sort of 
configuration gets more common, it may come up more in the future. 


Thank you,

Nathan Grodowitz
ITSD Linux R&D Scientific Platforms 
HPC Admin
Office:865-576-4715
Cell:865-347-4247

Reply via email to