Can these patches be added to the ucx package please? This issue is affecting all Genoa clusters with Infiniband.
Here's the type of error it causes: root@rschhpc210:~# ucx_perftest [1698428074.879303] [rschhpc210:13557:0] perftest.c:899 UCX WARN CPU affinity is not set (bound to 384 cpus). Performance may be impacted. Waiting for connection... Accepted connection from 10.3.8.219:54350 +----------------------------------------------------------------------------------------------------------+ | API: protocol layer | | Test: am latency | | Data layout: (automatic) | | Send memory: host | | Recv memory: host | | Message size: 1048576 | | AM header size: 0 | +----------------------------------------------------------------------------------------------------------+ [rschhpc210:13557:0:13557] ib_mlx5_log.c:162 Remote access on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0) [rschhpc210:13557:0:13557] ib_mlx5_log.c:162 RC QP 0x3177 wqe[60241]: RDMA_READ s-- [rva 0x7fc08799c000 rkey 0x2f1b1] [va 0x7fc4e3f63000 len 1048576 lkey 0x1bdd26] [rqpn 0x102 dlid=33 sl=0 port=1 src_path_bits=0] ==== backtrace (tid: 13557) ==== 0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x7fc4e5535fc4] 1 /lib/x86_64-linux-gnu/libucs.so.0(ucs_fatal_error_message+0xb6) [0x7fc4e5536176] 2 /lib/x86_64-linux-gnu/libucs.so.0(+0x25c9a) [0x7fc4e553ac9a] 3 /lib/x86_64-linux-gnu/libucs.so.0(ucs_log_dispatch+0xe4) [0x7fc4e55344a4] 4 /lib/x86_64-linux-gnu/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x5ed) [0x7fc4e509d6fd] 5 /lib/x86_64-linux-gnu/ucx/libuct_ib.so.0(+0x3eb16) [0x7fc4e50b9b16] 6 /lib/x86_64-linux-gnu/libucp.so.0(ucp_worker_progress+0x7a) [0x7fc4e55ed28a] 7 ucx_perftest(+0x416de) [0x56329edf56de] 8 ucx_perftest(+0x1ff92) [0x56329edd3f92] 9 ucx_perftest(+0x82ea) [0x56329edbc2ea] 10 ucx_perftest(+0x5a94) [0x56329edb9a94] 11 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc4e5229d90] 12 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fc4e5229e40] 13 ucx_perftest(+0x6375) [0x56329edba375] ================================= Aborted (core dumped) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2055222 Title: ucx library fails with Genoa CPUs and InfiniBand To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ucx/+bug/2055222/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
