Hi,

A couple of our users have reported issues using UCX in OpenMPI 3.1.2. It’s 
failing with this message:

[r1071:27563:0:27563] rc_verbs_iface.c:63   FATAL: send completion with error: 
local protection error

The actual MPI calls provoking this are different between the two applications 
— one is an MPI_Bcast and the other is an MPI_Waitany — but in both cases it 
ends up in ompi_request_default_wait_all and then into the progress engines:

 0 0x00000000000373dc ucs_log_dispatch()  
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/ucs/../../../src/ucs/debug/log.c:169
 1 0x00000000000368ff uct_rc_verbs_iface_poll_tx()  
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:88
 2 0x00000000000368ff uct_rc_verbs_iface_progress()  
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116
 3 0x00000000000179d2 ucs_callbackq_dispatch()  
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/ucs/datastruct/callbackq.h:208
 4 0x0000000000018e0a uct_worker_progress()  
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/uct/api/uct.h:1631
 5 0x00000000000050a9 mca_pml_ucx_progress()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/pml/ucx/../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:466
 6 0x000000000002b554 opal_progress()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228
 7 0x000000000004a7fa sync_wait_st()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../opal/threads/wait_sync.h:83
 8 0x000000000004b073 ompi_request_default_wait_all()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:237
 9 0x00000000000ce548 ompi_coll_base_bcast_intra_generic()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_bcast.c:98
10 0x00000000000ced08 ompi_coll_base_bcast_intra_pipeline()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_bcast.c:280
11 0x0000000000004f28 ompi_coll_tuned_bcast_intra_dec_fixed()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/tuned/../../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:303
12 0x0000000000067b60 PMPI_Bcast()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mpi/c/profile/pbcast.c:111

and

 0 0x00000000000373dc ucs_log_dispatch()  
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/ucs/../../../src/ucs/debug/log.c:169
 1 0x00000000000368ff uct_rc_verbs_iface_poll_tx()  
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:88
 2 0x00000000000368ff uct_rc_verbs_iface_progress()  
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116
 3 0x00000000000179d2 ucs_callbackq_dispatch()  
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/ucs/datastruct/callbackq.h:208
 4 0x0000000000018e0a uct_worker_progress()  
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/uct/api/uct.h:1631
 5 0x0000000000005099 mca_pml_ucx_progress()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/pml/ucx/../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:466
 6 0x000000000002b554 opal_progress()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228
 7 0x00000000000331cc ompi_sync_wait_mt()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/threads/wait_sync.c:85
 8 0x000000000004ad0b ompi_request_default_wait_any()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:131
 9 0x00000000000b91ab PMPI_Waitany()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mpi/c/profile/pwaitany.c:83

I’m not sure if it’s an issue with the ucx PML or with UCX itself, though. In 
both cases, disabling ucx and using yalla or ob1 works fine. Has anyone else 
seen this?

Thanks,
Ben
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to