Can these patches be added to the ucx package please?  This issue is
affecting all Genoa clusters with Infiniband.

Here's the type of error it causes:

root@rschhpc210:~# ucx_perftest
[1698428074.879303] [rschhpc210:13557:0] perftest.c:899 UCX WARN CPU affinity 
is not set (bound to 384 cpus). Performance may be impacted.
Waiting for connection...
Accepted connection from 10.3.8.219:54350
+----------------------------------------------------------------------------------------------------------+
| API: protocol layer |
| Test: am latency |
| Data layout: (automatic) |
| Send memory: host |
| Recv memory: host |
| Message size: 1048576 |
| AM header size: 0 |
+----------------------------------------------------------------------------------------------------------+
[rschhpc210:13557:0:13557] ib_mlx5_log.c:162 Remote access on mlx5_0:1/IB (synd 
0x13 vend 0x88 hw_synd 0/0)
[rschhpc210:13557:0:13557] ib_mlx5_log.c:162 RC QP 0x3177 wqe[60241]: RDMA_READ 
s-- [rva 0x7fc08799c000 rkey 0x2f1b1] [va 0x7fc4e3f63000 len 1048576 lkey 
0x1bdd26] [rqpn 0x102 dlid=33 sl=0 port=1 src_path_bits=0]
==== backtrace (tid: 13557) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x7fc4e5535fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(ucs_fatal_error_message+0xb6) 
[0x7fc4e5536176]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x25c9a) [0x7fc4e553ac9a]
3 /lib/x86_64-linux-gnu/libucs.so.0(ucs_log_dispatch+0xe4) [0x7fc4e55344a4]
4 
/lib/x86_64-linux-gnu/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x5ed) 
[0x7fc4e509d6fd]
5 /lib/x86_64-linux-gnu/ucx/libuct_ib.so.0(+0x3eb16) [0x7fc4e50b9b16]
6 /lib/x86_64-linux-gnu/libucp.so.0(ucp_worker_progress+0x7a) [0x7fc4e55ed28a]
7 ucx_perftest(+0x416de) [0x56329edf56de]
8 ucx_perftest(+0x1ff92) [0x56329edd3f92]
9 ucx_perftest(+0x82ea) [0x56329edbc2ea]
10 ucx_perftest(+0x5a94) [0x56329edb9a94]
11 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc4e5229d90]
12 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fc4e5229e40]
13 ucx_perftest(+0x6375) [0x56329edba375]
=================================
Aborted (core dumped)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2055222

Title:
  ucx library fails with Genoa CPUs and InfiniBand

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ucx/+bug/2055222/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to