Hi there,

New to using UCX, as a result of having built OpenMPI without it and running 
tests and getting warned. Installed UCX from the distribution:

[novosirj@amarel-test2 ~]$ rpm -qa ucx
ucx-1.5.2-1.el7.x86_64

…and rebuilt OpenMPI. Built fine. However, I’m getting some pretty unhelpful 
messages about not using the IB card. I looked around the internet some and set 
a couple of environment variables to get a little more information:

OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
export OMPI_MCA_pml_ucx_verbose=100

Here’s what happens:

[novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
./mpihello-gcc-8-openmpi-4.0.6 
srun: job 13993927 queued and waiting for resources
srun: job 13993927 has been allocated resources
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

 Local host:   gpu004
 Local device: mlx4_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

 Local host:   gpu004
 Local device: mlx4_0
--------------------------------------------------------------------------
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
UCX version 1.5.2
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
UCX version 1.5.2
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level 
is none
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level 
is none
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close
[gpu004.amarel.rutgers.edu:02326] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02327] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
Hello world from processor gpu004.amarel.rutgers.edu, rank 0 out of 2 processors
Hello world from processor gpu004.amarel.rutgers.edu, rank 1 out of 2 processors

Here’s the output of a couple more commands that seem to be recommended when 
looking into this:

[novosirj@gpu004 ~]$ ucx_info -d
#
# Memory domain: self
#            component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 8 bytes
#
#   Transport: self
#
#   Device: self
#
#      capabilities:
#            bandwidth: 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8k
#             am_bcopy: <= 8k
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: tcp
#            component: tcp
#
#   Transport: tcp
#
#   Device: eno1
#
#      capabilities:
#            bandwidth: 113.16 MB/sec
#              latency: 5776 nsec
#             overhead: 50000 nsec
#             am_bcopy: <= 8k
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#   Device: ib0
#
#      capabilities:
#            bandwidth: 6239.81 MB/sec
#              latency: 5210 nsec
#             overhead: 50000 nsec
#             am_bcopy: <= 8k
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#
# Memory domain: ib/mlx4_0
#            component: ib
#             register: unlimited, cost: 90 nsec
#           remote key: 16 bytes
#           local memory handle is required for zcopy
#
#   Transport: rc
#
#   Device: mlx4_0:1
#
#      capabilities:
#            bandwidth: 6433.22 MB/sec
#              latency: 900 nsec + 1 * N
#             overhead: 75 nsec
#            put_short: <= 88
#            put_bcopy: <= 8k
#            put_zcopy: <= 1g, up to 6 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 2k
#            get_bcopy: <= 8k
#            get_zcopy: 33..1g, up to 6 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 2k
#             am_short: <= 87
#             am_bcopy: <= 8191
#             am_zcopy: <= 8191, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 2k
#            am header: <= 127
#               domain: device
#           connection: to ep
#             priority: 10
#       device address: 3 bytes
#           ep address: 4 bytes
#       error handling: peer failure
#
#
#   Transport: ud
#
#   Device: mlx4_0:1
#
#      capabilities:
#            bandwidth: 6433.22 MB/sec
#              latency: 910 nsec
#             overhead: 105 nsec
#             am_short: <= 172
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 7 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4k
#            am header: <= 3984
#           connection: to ep, to iface
#             priority: 10
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure
#
#
# Memory domain: rdmacm
#            component: rdmacm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >
#
# Memory domain: sysv
#            component: sysv
#             allocate: unlimited
#           remote key: 32 bytes
#
#   Transport: mm
#
#   Device: sysv
#
#      capabilities:
#            bandwidth: 6911.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 92
#             am_bcopy: <= 8k
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: none
#
#
# Memory domain: posix
#            component: posix
#             allocate: unlimited
#           remote key: 37 bytes
#
#   Transport: mm
#
#   Device: posix
#
#      capabilities:
#            bandwidth: 6911.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 92
#             am_bcopy: <= 8k
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: none
#
#
# Memory domain: cma
#            component: cma
#             register: unlimited, cost: 9 nsec
#
#   Transport: cma
#
#   Device: cma
#
#      capabilities:
#            bandwidth: 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 400 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: none
#

[novosirj@gpu004 ~]$ ucx_info -p -u t
#
# UCP context
#
#            md 0  :  self
#            md 1  :  tcp
#            md 2  :  ib/mlx4_0
#            md 3  :  rdmacm
#            md 4  :  sysv
#            md 5  :  posix
#            md 6  :  cma
#
#      resource 0  :  md 0  dev 0  flags -- self/self
#      resource 1  :  md 1  dev 1  flags -- tcp/eno1
#      resource 2  :  md 1  dev 2  flags -- tcp/ib0
#      resource 3  :  md 2  dev 3  flags -- rc/mlx4_0:1
#      resource 4  :  md 2  dev 3  flags -- ud/mlx4_0:1
#      resource 5  :  md 3  dev 4  flags -s rdmacm/sockaddr
#      resource 6  :  md 4  dev 5  flags -- mm/sysv
#      resource 7  :  md 5  dev 6  flags -- mm/posix
#      resource 8  :  md 6  dev 7  flags -- cma/cma
#
# memory: 0.84MB, file descriptors: 2
# create time: 5.032 ms
#

Thanks for any help you can offer. What am I missing?

--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
    `'

Reply via email to