I created a gist with my config.log and other files:

From: users <users-boun...@lists.open-mpi.org> on behalf of "Kenny, Joseph P 
via users" <users@lists.open-mpi.org>
Reply-To: Open MPI Users <users@lists.open-mpi.org>
Date: Thursday, November 29, 2018 at 9:27 AM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: "Kenny, Joseph P" <jpke...@sandia.gov>
Subject: [EXTERNAL] [OMPI users] Trouble verifying btl for tcp and RoCE


I’m trying to do some RoCE benchmarking on a cluster with Mellanox HCA’s:
02:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

I’m finding it quite challenging to understand what btl is actually being used 
based on openmpi’s debug output.  I’m using openmpi 4.0.0 (along with a handful 
of older releases).  For example, here’s a command line that I use to run a 16 
node HPL test, trying to ensure that internode communication goes over a 
RoCE-capable btl rather than tcp:

/home/jpkenny/install/openmpi-4.0.0-carnac/bin/mpirun --mca btl_base_verbose 
100 --mca btl ^tcp -n 64 -N 4 -hostfile hosts.txt  ./xhpl

Among the interesting debug messages I see are messages of the form:
[en257.eth:118902] openib BTL: rdmacm CPC unavailable for use on mlx5_0:1; 
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           en254
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
[en262.eth:103810] select: init of component openib returned failure
[en264.eth:171198] select: init of component openib returned failure
[en264.eth:171198] mca: base: close: component openib closed
[en264.eth:171198] mca: base: close: unloading component openib
[en264.eth:171198] select: initializing btl component uct
[en264.eth:171198] select: init of component uct returned failure
[en264.eth:171198] mca: base: close: component uct closed
[en264.eth:171198] mca: base: close: unloading component uct

So, it looks to me like openib and uct transports are both failing, yet when I 
read out rdma counters with ethtool I see that the bulk of the traffic is going 
over rdma somehow (eth2 is the MT27800):
ib counters before:
     rx_vport_rdma_unicast_packets: 115943830
     rx_vport_rdma_unicast_bytes: 195602189248
     tx_vport_rdma_unicast_packets: 273170117
     tx_vport_rdma_unicast_bytes: 374057100818
eth0 counters before:
        RX packets 87474728  bytes 43335706060 (40.3 GiB)
        TX packets 61137838  bytes 71187999781 (66.2 GiB)
eth2 counters before:
        RX packets 49490077  bytes 81084834515 (75.5 GiB)
        TX packets 532970764  bytes 1742134134428 (1.5 TiB)
ib counters after:
     rx_vport_rdma_unicast_packets: 117188033
     rx_vport_rdma_unicast_bytes: 200088022302
     tx_vport_rdma_unicast_packets: 274456328
     tx_vport_rdma_unicast_bytes: 378587627052
eth0 counters after:
        RX packets 87481208  bytes 43336915153 (40.3 GiB)
        TX packets 61143485  bytes 71189606766 (66.3 GiB)
eth2 counters after:
        RX packets 49490077  bytes 81084834515 (75.5 GiB)
        TX packets 532970764  bytes 1742134134428 (1.5 TiB)

Yet, looking at the debug output after xhpl runs, I only see vader and self 
getting unloaded.  The evidence suggests that there is no working intranode 
btl, yet the job runs properly and it looks like rdma transfers are occurring.  
Equally perplexing behavior is observed when I exclude openib/uct and expect to 
run over tcp.  What’s actually going on here?

I’ll attach output from ompi_info along with the debug output that I’m 
referring to.  I tried to include a compressed config.log, but the message was 
too big.


users mailing list

Reply via email to