After seeing several failures with RMA with the change needed to get
4.0.5 through IMB, I looked for simple tests.  So, I built the mpich
3.4b1 tests -- or the ones that would build, and I haven't checked why
some fail -- and ran the rma set.

Three out of 180 passed.  Many (most?) aborted in ucx, like I saw with
production code, with a backtrace like below; others at least reported
an MPI error.  This was on two nodes of a ppc64le RHEL7 IB system with
4.0.5, ucx 1.9, and MCA parameters from the ucx FAQ (though I got the
same result without those parameters).  I haven't tried to reproduce it
on x86_64, but it seems unlikely to be CPU-specific.

Is there anything we can do to run RMA without just moving to mpich?  Do
releases actually get tested on run-of-the-mill IB+Lustre systems?

+ mpirun -n 2 winname
[gpu005:50906:0:50906]  ucp_worker.c:183  Fatal: failed to set active message 
handler id 1: Invalid parameter
==== backtrace (tid:  50906) ====
 0 0x000000000005453c ucs_debug_print_backtrace()  .../src/ucs/debug/debug.c:656
 1 0x0000000000028218 ucp_worker_set_am_handlers()  
.../src/ucp/core/ucp_worker.c:182
 2 0x0000000000029ae0 ucp_worker_iface_deactivate()  
.../src/ucp/core/ucp_worker.c:816
 3 0x0000000000029ae0 ucp_worker_iface_check_events()  
.../src/ucp/core/ucp_worker.c:766
 4 0x0000000000029ae0 ucp_worker_iface_deactivate()  
.../src/ucp/core/ucp_worker.c:819
 5 0x0000000000029ae0 ucp_worker_iface_unprogress_ep()  
.../src/ucp/core/ucp_worker.c:841
 6 0x00000000000582a8 ucp_wireup_ep_t_cleanup()  
.../src/ucp/wireup/wireup_ep.c:381
 7 0x0000000000068124 ucs_class_call_cleanup_chain()  
.../src/ucs/type/class.c:56
 8 0x0000000000057420 ucp_wireup_ep_t_delete()  
.../src/ucp/wireup/wireup_ep.c:28
 9 0x0000000000013de8 uct_ep_destroy()  .../src/uct/base/uct_iface.c:546
10 0x00000000000252f4 ucp_proxy_ep_replace()  
.../src/ucp/core/ucp_proxy_ep.c:236
11 0x0000000000057b88 ucp_wireup_ep_progress()  
.../src/ucp/wireup/wireup_ep.c:89
12 0x0000000000049820 ucs_callbackq_slow_proxy()  
.../src/ucs/datastruct/callbackq.c:400
13 0x000000000002ca04 ucs_callbackq_dispatch()  
.../src/ucs/datastruct/callbackq.h:211
14 0x000000000002ca04 uct_worker_progress()  .../src/uct/api/uct.h:2346
15 0x000000000002ca04 ucp_worker_progress()  .../src/ucp/core/ucp_worker.c:2040
16 0x000000000000c144 progress_callback()  osc_ucx_component.c:0
17 0x00000000000374ac opal_progress()  ???:0
18 0x000000000006cc74 ompi_request_default_wait()  ???:0
19 0x00000000000e6fcc ompi_coll_base_sendrecv_actual()  ???:0
20 0x00000000000e5530 ompi_coll_base_allgather_intra_two_procs()  ???:0
21 0x0000000000006c44 ompi_coll_tuned_allgather_intra_dec_fixed()  ???:0
22 0x000000000000dc20 component_select()  osc_ucx_component.c:0
23 0x0000000000115b90 ompi_osc_base_select()  ???:0
24 0x0000000000075264 ompi_win_create()  ???:0
25 0x00000000000cb4e8 PMPI_Win_create()  ???:0
26 0x0000000010006ecc MTestGetWin()  .../mpich-3.4b1/test/mpi/util/mtest.c:1173
27 0x0000000010002e40 main()  .../mpich-3.4b1/test/mpi/rma/winname.c:25
28 0x0000000000025200 generic_start_main.isra.0()  libc-start.c:0
29 0x00000000000253f4 __libc_start_main()  ???:0

followed by the abort backtrace

Reply via email to