Jeff,

On 01-Oct-11 1:01 AM, Konz, Jeffrey (SSA Solution Centers) wrote:
> Encountered a problem when trying to run OpenMPI 1.5.4 with RoCE over 10GbE 
> fabric.
> 
> Got this run time error:
> 
> An invalid CPC name was specified via the btl_openib_cpc_include MCA
> parameter.
> 
>    Local host:                   atl3-14
>    btl_openib_cpc_include value: rdmacm
>    Invalid name:                 rdmacm
>    All possible valid names:     oob,xoob
> --------------------------------------------------------------------------
> [atl3-14:07184] mca: base: components_open: component btl / openib open 
> function failed
> [atl3-12:09178] mca: base: components_open: component btl / openib open 
> function failed
> 
> Used these options to mpirun:
>    "--mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm -mca 
> btl_openib_if_include mlx4_0:2"
> 
> We have a Mellanox LOM with two ports, first is an IB port, second is an 
> 10GbE port.
> Running over the IB port and TCP over the 10GbE port work fine.
> 
> Built OpenMPI with this option "--enable-openib-rdmacm".
> Our system has OFED 1.5.2 with librdmacm-1.0.13-1
> 
> I noticed this output from configure script:
> checking rdma/rdma_cma.h usability... no
> checking rdma/rdma_cma.h presence... no
> checking for rdma/rdma_cma.h... no
> checking whether IBV_LINK_LAYER_ETHERNET is declared... yes
> checking if RDMAoE support is enabled... yes
> checking for infiniband/driver.h... yes
> checking if ConnectX XRC support is enabled... yes
> checking if dynamic SL is enabled... no
> checking if OpenFabrics RDMACM support is enabled... no
> 
> Are we missing a build option or a piece of software?
> Config.log and output from "ompi_info --all" attached.

You shouldn't use the "--enable-openib-rdmacm" option - rdmacm
support is enabled by default, providing librdmacm is found on
the machine.

So the question is, why OMPI config script didn't find it?
OMPI looks for "rdma/rdma_cma.h" header. Do you have it on
you build machine?
The usual location of this file is /usr/include/rdma/rdma_cma.h

Another reason might be this: it appears that OMPI is including
"rdma/rdma_cma.h" rather than <rdma/rdma_cma.h>.

Please apply the following tiny fix to OMPI source:

Index: ompi/config/ompi_check_openib.m4
===================================================================
--- ompi/config/ompi_check_openib.m4    (revision 25228)
+++ ompi/config/ompi_check_openib.m4    (working copy)
@@ -207,7 +207,7 @@
                      [AC_CHECK_LIB([rdmacm], [rdma_create_id],
                          [AC_MSG_CHECKING([for rdma_get_peer_addr])
                          $1_msg=no
-                         AC_LINK_IFELSE([AC_LANG_PROGRAM([[#include 
"rdma/rdma_cma.h"
+                         AC_LINK_IFELSE([AC_LANG_PROGRAM([[#include 
<rdma/rdma_cma.h>
                                  ]], [[void *ret = (void*) 
rdma_get_peer_addr((struct rdma_cm_id*)0);]])],
                              [$1_have_rdmacm=1
                              $1_msg=yes])

Run autogen.sh & configure and check if rdmacm is found.

-- YK



> % ibv_devinfo
> hca_id: mlx4_0
>          transport:                      InfiniBand (0)
>          fw_ver:                         2.9.1000
>          node_guid:                      78e7:d103:0021:4464
>          sys_image_guid:                 78e7:d103:0021:4467
>          vendor_id:                      0x02c9
>          vendor_part_id:                 26438
>          hw_ver:                         0xB0
>          board_id:                       HP_0200000003
>          phys_port_cnt:                  2
>                  port:   1
>                          state:                  PORT_ACTIVE (4)
>                          max_mtu:                2048 (4)
>                          active_mtu:             2048 (4)
>                          sm_lid:                 34
>                          port_lid:               11
>                          port_lmc:               0x00
>                          link_layer:             IB
> 
>                  port:   2
>                          state:                  PORT_ACTIVE (4)
>                          max_mtu:                2048 (4)
>                          active_mtu:             1024 (3)
>                          sm_lid:                 0
>                          port_lid:               0
>                          port_lmc:               0x00
>                          link_layer:             Ethernet
> 
> % /sbin/ifconfig
> eth0      Link encap:Ethernet  HWaddr 78:E7:D1:21:44:60
>            inet addr:16.113.180.147  Bcast:16.113.183.255  Mask:255.255.252.0
>            inet6 addr: fe80::7ae7:d1ff:fe21:4460/64 Scope:Link
>            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>            RX packets:1861763 errors:0 dropped:0 overruns:0 frame:0
>            TX packets:1776402 errors:0 dropped:0 overruns:0 carrier:0
>            collisions:0 txqueuelen:1000
>            RX bytes:712448939 (679.4 MiB)  TX bytes:994111004 (948.0 MiB)
>            Memory:fb9e0000-fba00000
> 
> eth2      Link encap:Ethernet  HWaddr 78:E7:D1:21:44:65
>            inet addr:10.10.0.147  Bcast:10.10.0.255  Mask:255.255.255.0
>            inet6 addr: fe80::78e7:d100:121:4465/64 Scope:Link
>            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>            RX packets:8519814 errors:0 dropped:0 overruns:0 frame:0
>            TX packets:8555715 errors:0 dropped:0 overruns:0 carrier:0
>            collisions:0 txqueuelen:1000
>            RX bytes:12370127778 (11.5 GiB)  TX bytes:12372246315 (11.5 GiB)
> 
> ib0       Link encap:InfiniBand  HWaddr 
> 80:00:00:4D:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>            inet addr:192.168.0.147  Bcast:192.168.0.255  Mask:255.255.255.0
>            inet6 addr: fe80::7ae7:d103:21:4465/64 Scope:Link
>            UP BROADCAST RUNNING MULTICAST  MTU:16384  Metric:1
>            RX packets:1989 errors:0 dropped:0 overruns:0 frame:0
>            TX packets:208 errors:0 dropped:0 overruns:0 carrier:0
>            collisions:0 txqueuelen:256
>            RX bytes:275196 (268.7 KiB)  TX bytes:19202 (18.7 KiB)
> 
> lo        Link encap:Local Loopback
>            inet addr:127.0.0.1  Mask:255.0.0.0
>            inet6 addr: ::1/128 Scope:Host
>            UP LOOPBACK RUNNING  MTU:16436  Metric:1
>            RX packets:42224 errors:0 dropped:0 overruns:0 frame:0
>            TX packets:42224 errors:0 dropped:0 overruns:0 carrier:0
>            collisions:0 txqueuelen:0
>            RX bytes:3115668 (2.9 MiB)  TX bytes:3115668 (2.9 MiB)
> 
> Thanks,
> 
> -Jeff
> 
> 
> /**********************************************************/
> /* Jeff Konz                          jeffrey.k...@hp.com */
> /* Solutions Architect                   HPC Benchmarking */
> /* Americas Shared Solutions Architecture (SSA)           */
> /* Hewlett-Packard Company                                */
> /* Office: 248-491-7480              Mobile: 248-345-6857 */
> /**********************************************************/
> 
> 
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to