We are upgrading a cluster from RHEL6 to RHEL8, and have migrated some nodes to a new partition and reimaged with RHEL8. I am having some issues getting openmpi to work with infiniband on the nodes upgraded to RHEL8.
For testing purposes, I am trying to run a simple MPI "hello world" code on
the local RHEL8 host (obviously, also having issues on multiple nodes, but
trying to simplify).
If I run with BTL set to vader,self or tcp,self on command line, the MPI
code
runs as expected. If I set to openib,self (or leave unset), the job just
hangs indefinitely, e.g.
bash> mpirun -H localhost -v --mca mpi_cuda_support 0 --mca
btl_openib_verbose 1 --mca btl openib,self -n 1 --show-progress -d
--debug-daemons ./hello-world-mpi
[compute-a20-3.XXX.YYY.ZZZ:30383] procdir:
/tmp/ompi.compute-a20-3.34676/pid.30383/0/0
[compute-a20-3.XXX.YYY.ZZZ:30383] jobdir:
/tmp/ompi.compute-a20-3.34676/pid.30383/0
[compute-a20-3.XXX.YYY.ZZZ:30383] top:
/tmp/ompi.compute-a20-3.34676/pid.30383
[compute-a20-3.XXX.YYY.ZZZ:30383] top: /tmp/ompi.compute-a20-3.34676
[compute-a20-3.XXX.YYY.ZZZ:30383] tmp: /tmp
[compute-a20-3.XXX.YYY.ZZZ:30383] sess_dir_cleanup: job session dir does
not exist
[compute-a20-3.XXX.YYY.ZZZ:30383] sess_dir_cleanup: top session dir not
empty - leaving
[compute-a20-3.XXX.YYY.ZZZ:30383] procdir:
/tmp/ompi.compute-a20-3.34676/pid.30383/0/0
[compute-a20-3.XXX.YYY.ZZZ:30383] jobdir:
/tmp/ompi.compute-a20-3.34676/pid.30383/0
[compute-a20-3.XXX.YYY.ZZZ:30383] top:
/tmp/ompi.compute-a20-3.34676/pid.30383
[compute-a20-3.XXX.YYY.ZZZ:30383] top: /tmp/ompi.compute-a20-3.34676
[compute-a20-3.XXX.YYY.ZZZ:30383] tmp: /tmp
[compute-a20-3.XXX.YYY.ZZZ:30383] [[29315,0],0] orted_cmd: received
add_local_procs
[compute-a20-3.XXX.YYY.ZZZ:30383] [[29315,0],0] Releasing job data for
[INVALID]
App launch reported: 1 (out of 1) daemons - 0 (out of 1) procs
MPIR_being_debugged = 0
MPIR_debug_state = 1
MPIR_partial_attach_ok = 1
MPIR_i_am_starter = 0
MPIR_forward_output = 0
MPIR_proctable_size = 1
MPIR_proctable:
(i, host, exe, pid) = (0, compute-a20-3,
/software/hello-world/1.0/gcc/8.4.0/openmpi/3.1.5/linux-rhel8-x86_64/bin/./hello-world-mpi,
30387)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[compute-a20-3.XXX.YYY.ZZZ:30387] procdir:
/tmp/ompi.compute-a20-3.34676/pid.30383/1/0
[compute-a20-3.XXX.YYY.ZZZ:30387] jobdir:
/tmp/ompi.compute-a20-3.34676/pid.30383/1
[compute-a20-3.XXX.YYY.ZZZ:30387] top:
/tmp/ompi.compute-a20-3.34676/pid.30383
[compute-a20-3.XXX.YYY.ZZZ:30387] top: /tmp/ompi.compute-a20-3.34676
[compute-a20-3.XXX.YYY.ZZZ:30387] tmp: /tmp
[compute-a20-3][[29315,1],0][btl_openib_ini.c:172:opal_btl_openib_ini_query]
Querying INI files for vendor 0x02c9, part ID 4099
[compute-a20-3][[29315,1],0][btl_openib_ini.c:188:opal_btl_openib_ini_query]
Found corresponding INI values: Mellanox Hermon
[compute-a20-3][[29315,1],0][btl_openib_ini.c:172:opal_btl_openib_ini_query]
Querying INI files for vendor 0x0000, part ID 0
[compute-a20-3][[29315,1],0][btl_openib_ini.c:188:opal_btl_openib_ini_query]
Found corresponding INI values: default
At this point the code just hangs indefinitely. I see a PID 30387 named
hello-world-mpi with 3 threads,, which is consuming ~100% of a CPU core but
strace just shows doing epool_wait calls.
The "Releasing job data for [INVALID]" looks suspicious, but looking at
source code I think that is just because I am running outside of a
scheduler so no job number. I suspect the problem is the 0 in the line
App launch reported: 1 (out of 1) daemons - 0 (out of 1) procs
but I am at a loss as to why or how to fix it.
I can run the same example above on one of the nodes still at RHEL6
(compiled for the OpenMPI we have on that system) and it works as expected.
I am able to run ibv_tc_pingpong between nodes (both between a pair of
RHEL8 nodes, a pair of RHEL6 nodes, and mixed (one RHEL6 and one RHEL8, and
of course within the same node), so I do not see any obvious Infiniband
issues.
If anyone could give suggestions/tips/ideas on how to proceed/diagnose/fix
this issue would be grateful. Thanks in advance for any suggestions.
================================================
System/etc details
================================================
The issue is occurring on RHEL8 system, specifically 8.1 with kernel
4.18.0-147.5.1.el8_1.x86_64
running OpenMPI 3.1.5 (built with gcc 8.4.0 using spack)
The issue is in the openib BTL (vader and tcp BTLs seem to be working) and
is using OpenFabrics from Mellanox
(libibverbs-41mlnx1-OFED.5.0.0.0.9.50100.0.src.rpm)
We are using a subnet manager running on a Mellanox FDR IB switch
(SX_PPC_M460EX)
The "working" RHEL6 system are running 6.10, kernel
2.6.32-754.25.1.el6.x86_64, with OpenMPI 1.10.2 built with gcc 6.1.0)
The memorylocked limit on both RHEL8 and RHEL6 is unlimited.
On the RHEL8 node, ibv_devinfo returns:
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.32.5100
node_guid: f452:1403:0070:1c80
sys_image_guid: f452:1403:0070:1c83
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x1
board_id: DEL0A30000019
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 532
port_lid: 536
port_lmc: 0x00
link_layer: InfiniBand
(The "working" RHEL6 system basically has an identical result from
ibv_devinfo,
with exception of different values node_guid, sys_image_guid, and port_lid.)
The results of ompi-info --all on the RHEL8 node is attached. As indicated
earlier, I am running on the same node as the mpirun command is issued.
The result of ifconfig -a on the RHEL8 node is:
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.103.132.13 netmask 255.255.224.0 broadcast 10.103.159.255
inet6 fe80::3617:ebff:fee6:6a31 prefixlen 64 scopeid 0x20<link>
ether 34:17:eb:e6:6a:31 txqueuelen 1000 (Ethernet)
RX packets 1599943 bytes 345477382 (329.4 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2147871 bytes 3010964444 (2.8 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0x91120000-9113ffff
eno2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether 34:17:eb:e6:6a:32 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0x91100000-9111ffff
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
inet 192.168.68.13 netmask 255.255.224.0 broadcast 192.168.95.255
inet6 fe80::f652:1403:70:1c81 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in
ifconfig(8).
infiniband
A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256
(InfiniBand)
RX packets 49701 bytes 45121502 (43.0 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 25427 bytes 5740480 (5.4 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 476287 bytes 23889166 (22.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 476287 bytes 23889166 (22.7 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
--
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroads [email protected]
5825 University Research Park (301) 405-6135
University of Maryland
College Park, MD 20740-3831
ompi-info.all.a20-3.bz2
Description: application/bzip
