Nathan:  Thanks for providing the debug flags.  I've attached the
output (NetPIPE.debug1) which basically shows that for RoCE the
udcm_component_query() will always fail.  Can someone verify if
this is correct that udcm is not supported for RoCE?  When I change
the test to force usage it does not work (NetPIPE.debug2).

[hero35][[38845,1],0][connect/btl_openib_connect_udcm.c:452:udcm_component_query]
UD CPC only supported on InfiniBand; skipped on mlx4_0:1
[hero35][[38845,1],0][connect/btl_openib_connect_udcm.c:501:udcm_component_query]
unavailable for use on mlx4_0:1; skipped

from btl_openib_connect_udcm.c

 438 static int udcm_component_query(mca_btl_openib_module_t *btl,
 439                                 opal_btl_openib_connect_base_module_t
**cpc)
 440 {
 441     udcm_module_t *m = NULL;
 442     int rc = OPAL_ERR_NOT_SUPPORTED;
 443
 444     do {
 445         /* If we do not have struct ibv_device.transport_device, then
 446            we're in an old version of OFED that is IB only (i.e., no
 447            iWarp), so we can safely assume that we can use this CPC. */
 448 #if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) &&
HAVE_DECL_IBV_LINK_LAYER_ETHERN     ET
 449         if (BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl)) {
 450             BTL_VERBOSE(("UD CPC only supported on InfiniBand; skipped
on %s:%d",
 451                          ibv_get_device_name(btl->device->ib_dev),
 452                          btl->port_num));
 453             break;
 454         }
 455 #endif

from base.h

#ifdef OPAL_HAVE_RDMAOE
#define BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl)                       \
        (((IBV_TRANSPORT_IB != ((btl)->device->ib_dev->transport_type)) || \
        (IBV_LINK_LAYER_ETHERNET == ((btl)->ib_port_attr.link_layer))) ?   \
        true : false)
#else
#define BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl)                       \
        ((IBV_TRANSPORT_IB != ((btl)->device->ib_dev->transport_type)) ?   \
        true : false)
#endif

So clearly for RoCE the transport is InfiniBand and the link layer is
Ethernet
so this will show that NOT_IB() is true, meaning that udcm is evidently
not supported for RoCE.  udcm definitely fails under 1.10.4 for RoCE in
our tests.  That means we need rdmacm to work which it evidently does
not at the moment for 2.0.1.  Could someone please verify that rdmacm
is not currently working in 2.0.1?  And therefore I'm assuming that
2.0.1 has not been successfully tested on RoCE???

                           Dave



> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 14 Dec 2016 21:12:16 -0700
> From: Nathan Hjelm <hje...@me.com>
> To: drdavetur...@gmail.com, Open MPI Users <users@lists.open-mpi.org>
> Subject: Re: [OMPI users] rdmacm and udcm failure in 2.0.1 on RoCE
> Message-ID: <32528c5d-14bc-42ce-b19a-684b81801...@me.com>
> Content-Type: text/plain; charset=utf-8
>
> Can you configure with ?enable-debug and run with ?mca btl_base_verbose
> 100 and provide the output? It may indicate why neither udcm nor rdmacm are
> available.
>
> -Nathan
>
>
> > On Dec 14, 2016, at 2:47 PM, Dave Turner <drdavetur...@gmail.com> wrote:
> >
> > ------------------------------------------------------------
> --------------
> > No OpenFabrics connection schemes reported that they were able to be
> > used on a specific port.  As such, the openib BTL (OpenFabrics
> > support) will be disabled for this port.
> >
> >   Local host:           elf22
> >   Local device:         mlx4_2
> >   Local port:           1
> >   CPCs attempted:       rdmacm, udcm
> > ------------------------------------------------------------
> --------------
> >
> > We have had no problems using 1.10.4 on RoCE but 2.0.1 fails to
> > find either connection manager.  I've read that rdmacm may have
> > issues under 2.0.1 so udcm may be the only one working.  Are there
> > any known issues with that on RoCE?  Or does this just mean we
> > don't have RoCE configured correctly?
> >
> >                   Dave Turner
> >
> > --
> > Work:     davetur...@ksu.edu     (785) 532-7791
> >              2219 Engineering Hall, Manhattan KS  66506
> > Home:    drdavetur...@gmail.com
> >               cell: (785) 770-5929
> > <ompi_info.2.0.1.all>_______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> --
Work:     davetur...@ksu.edu     (785) 532-7791
             2219 Engineering Hall, Manhattan KS  66506
Home:    drdavetur...@gmail.com
              cell: (785) 770-5929

Attachment: NetPIPE.debug1
Description: Binary data

Attachment: NetPIPE.debug2
Description: Binary data

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to