You could try running with `-x UCC_LOG_LEVEL=info` (add this to your mpirun
command) to get additional info from the UCC initialization steps. However,
your initial configuration parameters for Open MPI does not indicate it was
built with UCC support. Where did you find the configure options ?

  George.


On Tue, Dec 9, 2025 at 4:26 PM 'Collin Strassburger' via Open MPI users <
[email protected]> wrote:

> Hello Howard,
>
>
>
> Thanks for the info!
>
>
>
> I’ll look into getting in touch with the groups you mentioned 😊
>
>
>
> Warm regards,
>
> Collin Strassburger (he/him)
>
>
>
> *From:* 'Pritchard Jr., Howard' via Open MPI users <
> [email protected]>
> *Sent:* Tuesday, December 9, 2025 4:24 PM
> *To:* [email protected]
> *Subject:* Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting
>
>
>
> *CAUTION:* This email originated from outside of the organization. Do not
> click links or open attachments unless you recognize the sender and know
> the content is safe.
>
>
>
> Hi Collin,
>
>
>
> Well,  I would hope that at scale UCC (10s of nodes) would provide some
> benefit.
>
>
>
> I’d suggest getting in touch with someone on the Nvidia payroll to figure
> out what may be going on with UCC initialization on your system.
>
> Or there’s a UCX mail list that has a UCC WG community that may be able to
> help you.
>
>
>
> See https://elist.ornl.gov/mailman/listinfo/ucx-group to sign up.
>
>
>
> It is not a noisy mail list.
>
>
>
> Howard
>
>
>
> *From: *'Collin Strassburger' via Open MPI users <[email protected]
> >
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Tuesday, December 9, 2025 at 2:10 PM
> *To: *"[email protected]" <[email protected]>
> *Subject: *RE: [EXTERNAL] [OMPI users] Multi-host troubleshooting
>
>
>
> Hello Howard,
>
>
>
> Running with export OMPI_MCA_coll=^ucc resulted in a working run of the
> code!
>
>
>
> Are there any downsides to using OMPI_MCA_coll=^ucc to side-step this
> issue?
>
>
>
>
>
> Warm regards,
>
> Collin Strassburger (he/him)
>
>
>
> *From:* 'Pritchard Jr., Howard' via Open MPI users <
> [email protected]>
> *Sent:* Tuesday, December 9, 2025 3:54 PM
> *To:* [email protected]
> *Subject:* Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting
>
>
>
> *CAUTION:* This email originated from outside of the organization. Do not
> click links or open attachments unless you recognize the sender and know
> the content is safe.
>
>
>
> Hi Collin,
>
>
>
> This is much more helpful.
>
>
>
> Let’s first try to turn off “optimizations”.
>
> Could you return with the following MCA param set?
>
>
>
> export OMPI_MCA_coll=^ucc
>
>
>
> and see if that helps?
>
>
>
> Also this points to possible problems with your system’s IB network setup.
>
>
>
> Howard
>
>
>
>
>
> *From: *'Collin Strassburger' via Open MPI users <[email protected]
> >
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Tuesday, December 9, 2025 at 1:50 PM
> *To: *"[email protected]" <[email protected]>
> *Subject: *RE: [EXTERNAL] [OMPI users] Multi-host troubleshooting
>
>
>
> Hit “enter” a little too soon.
>
>
>
> Here’s the rest that was intended to be included:
>
> (gdb) bt full
>
> #0  __GI___pthread_mutex_unlock_usercnt (decr=1, mutex=<optimized out>) at
> ./nptl/pthread_mutex_unlock.c:72
>
>         type = <optimized out>
>
>         type = <optimized out>
>
>         __PRETTY_FUNCTION__ = "__pthread_mutex_unlock_usercnt"
>
>         __value = <optimized out>
>
>         __value = <optimized out>
>
> #1  ___pthread_mutex_unlock (mutex=<optimized out>) at
> ./nptl/pthread_mutex_unlock.c:368
>
> No locals.
>
> #2  0x00007757365728f3 in evthread_posix_unlock () from
> /opt/hpcx/ompi/lib/libopen-pal.so.40
>
> No symbol table info available.
>
> #3  0x000077573656b8e8 in opal_libevent2022_event_base_loop () from
> /opt/hpcx/ompi/lib/libopen-pal.so.40
>
> No symbol table info available.
>
> #4  0x000077573651d216 in opal_progress_events () at
> runtime/opal_progress.c:191
>
>         now = <optimized out>
>
>         events = <optimized out>
>
>         lock = 1
>
>         events = <optimized out>
>
>         now = <optimized out>
>
> #5  opal_progress_events () at runtime/opal_progress.c:172
>
>         events = 0
>
>         lock = 1
>
>         now = <optimized out>
>
> #6  0x000077573651d374 in opal_progress () at runtime/opal_progress.c:247
>
>         num_calls = 486064400
>
>         i = <optimized out>
>
>         events = <optimized out>
>
> #7  0x0000775736a0fb9b in ompi_request_default_test_all (count=<optimized
> out>, requests=0x559820b67528,
>
> --Type <RET> for more, q to quit, c to continue without paging--
>
>     completed=<optimized out>, statuses=<optimized out>) at
> request/req_test.c:196
>
>         i = <optimized out>
>
>         rc = <optimized out>
>
>         rptr = <optimized out>
>
>         num_completed = <optimized out>
>
>         request = <optimized out>
>
> #8  0x00007757344d29f1 in oob_allgather_test (req=0x559820b67500) at
> coll_ucc_module.c:192
>
>         oob_req = 0x559820b67500
>
>         comm = <optimized out>
>
>         tmpsend = <optimized out>
>
>         tmprecv = <optimized out>
>
>         msglen = <optimized out>
>
>         probe_count = 5
>
>         rank = <optimized out>
>
>         size = <optimized out>
>
>         sendto = 0
>
>         recvfrom = 0
>
>         recvdatafrom = <optimized out>
>
>         senddatafrom = <optimized out>
>
>         completed = 0
>
>         probe = <optimized out>
>
> #9  0x00007757344a1a5c in ucc_core_addr_exchange (context=context@entry
> =0x5598209637f0,
>
>     oob=oob@entry=0x559820963808, 
> addr_storage=addr_storage@entry=0x559820963900)
> at core/ucc_context.c:461
>
>         addr_lens = <optimized out>
>
>         attr = {mask = 12, type = UCC_CONTEXT_EXCLUSIVE, sync_type =
> UCC_NO_SYNC_COLLECTIVES,
>
>           ctx_addr = 0x55982060bb00, ctx_addr_len = 467,
> global_work_buffer_size = 8589934593}
>
>         status = <optimized out>
>
>         i = <optimized out>
>
>         max_addrlen = <optimized out>
>
>         poll = <optimized out>
>
>         __func__ = "ucc_core_addr_exchange"
>
> #10 0x00007757344a2657 in ucc_context_create_proc_info (lib=0x559820962900,
>
>     params=params@entry=0x7fff7b0e04f0, config=0x559820962690,
>
>     context=context@entry=0x7757344df3c8 <mca_coll_ucc_component+392>,
>
>     proc_info=0x7757344cfa60 <ucc_local_proc>) at core/ucc_context.c:723
>
>         topo_required = 1
>
>         created_ctx_counter = <optimized out>
>
>         b_params = {params = {mask = 4, type = UCC_CONTEXT_EXCLUSIVE,
> sync_type = UCC_NO_SYNC_COLLECTIVES,
>
>             oob = {allgather = 0x7757344d2a40 <oob_allgather>, req_test =
> 0x7757344d2800 <oob_allgather_test>,
>
>               req_free = 0x7757344d27e0 <oob_allgather_free>,
>
>               coll_info = 0x5597f1b1c020 <ompi_mpi_comm_world>, n_oob_eps
> = 2, oob_ep = 1}, ctx_id = 0,
>
>             mem_params = {segments = 0x0, n_segments = 0}},
> estimated_num_eps = 2, estimated_num_ppn = 1,
>
> --Type <RET> for more, q to quit, c to continue without paging--
>
>           thread_mode = UCC_THREAD_SINGLE, prefix = 0x559820963630
> "OMPI_UCC_", context = 0x5598209637f0}
>
>         b_ctx = 0x559820967580
>
>         c_attr = {attr = {mask = 0, type = UCC_CONTEXT_EXCLUSIVE,
> sync_type = UCC_NO_SYNC_COLLECTIVES,
>
>             ctx_addr = 0x0, ctx_addr_len = 0, global_work_buffer_size =
> 0}, topo_required = 1}
>
>         l_attr = {super = {mask = 0, attr = {mask = 0, thread_mode =
> UCC_THREAD_SINGLE, coll_types = 2172,
>
>               reduction_types = 0, sync_type = UCC_NO_SYNC_COLLECTIVES},
> min_team_size = 0, max_team_size = 0,
>
>             flags = 2}, tls = 0x559820962d20, tls_forced = 0x559820962bd0}
>
>         cl_lib = <optimized out>
>
>         tl_ctx = <optimized out>
>
>         tl_lib = <optimized out>
>
>         ctx = 0x5598209637f0
>
>         status = <optimized out>
>
>         i = <optimized out>
>
>         j = <optimized out>
>
>         n_tl_ctx = <optimized out>
>
>         num_cls = <optimized out>
>
>         __func__ = "ucc_context_create_proc_info"
>
>         error = <optimized out>
>
> #11 0x00007757344a31f0 in ucc_context_create (lib=<optimized out>,
> params=params@entry=0x7fff7b0e04f0,
>
>     config=<optimized out>, context=context@entry=0x7757344df3c8
> <mca_coll_ucc_component+392>)
>
>     at core/ucc_context.c:866
>
> No locals.
>
> #12 0x00007757344d2c81 in mca_coll_ucc_init_ctx () at coll_ucc_module.c:294
>
>         cm = <optimized out>
>
>         str_buf = "1\000", 'A' <repeats 30 times>, '\000' <repeats 17
> times>, "\006\016{\377\177\000\000\325\301j6Ww\000\000", '\032' <repeats 32
> times>, "3\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000
> ;\2006Ww\000\000\000\003\000\000\000\000\000\000
> \000\354\241G\221VU\214
> N\304j6Ww\000\000\300:\2006Ww\000\000\000\354\241G\221VU\214\220\301c
> \230U\000\000 4O4Ww\000\000@\006\016{\377\177\000\000"...
>
>         del_fn = <optimized out>
>
>         copy_fn = <optimized out>
>
>         lib_config = 0x5598209630b0
>
>         ctx_config = 0x559820962690
>
>         tm_requested = <optimized out>
>
>         lib_params = {mask = 1, thread_mode = UCC_THREAD_SINGLE,
> coll_types = 13816973012072644543,
>
>           reduction_types = 13816973012072644576, sync_type =
> UCC_NO_SYNC_COLLECTIVES}
>
>         ctx_params = {mask = 4, type = UCC_CONTEXT_EXCLUSIVE, sync_type =
> UCC_NO_SYNC_COLLECTIVES, oob = {
>
>             allgather = 0x7757344d2a40 <oob_allgather>, req_test =
> 0x7757344d2800 <oob_allgather_test>,
>
>             req_free = 0x7757344d27e0 <oob_allgather_free>, coll_info =
> 0x5597f1b1c020 <ompi_mpi_comm_world>,
>
>             n_oob_eps = 2, oob_ep = 1}, ctx_id = 0, mem_params = {segments
> = 0x0, n_segments = 0}}
>
>         __FUNCTION__ = "mca_coll_ucc_init_ctx"
>
> #13 0x00007757344d45df in mca_coll_ucc_comm_query (comm=0x5597f1b1c020
> <ompi_mpi_comm_world>,
>
>     priority=0x7fff7b0e06fc) at coll_ucc_module.c:480
>
> --Type <RET> for more, q to quit, c to continue without paging--
>
>         cm = <optimized out>
>
>         ucc_module = <optimized out>
>
> #14 0x0000775736a41d9c in query_2_0_0 (module=<synthetic pointer>,
> priority=0x7fff7b0e06fc,
>
>     comm=0x5597f1b1c020 <ompi_mpi_comm_world>, component=0x7757344df240
> <mca_coll_ucc_component>)
>
>     at base/coll_base_comm_select.c:540
>
>         ret = <optimized out>
>
> #15 query (module=<synthetic pointer>, priority=0x7fff7b0e06fc,
> comm=<optimized out>,
>
>     component=0x7757344df240 <mca_coll_ucc_component>) at
> base/coll_base_comm_select.c:523
>
>         coll100 = 0x7757344df240 <mca_coll_ucc_component>
>
> #16 check_one_component (module=<synthetic pointer>,
> component=0x7757344df240 <mca_coll_ucc_component>,
>
>     comm=<optimized out>) at base/coll_base_comm_select.c:486
>
>         err = <optimized out>
>
>         priority = 0
>
>         err = <optimized out>
>
>         priority = <optimized out>
>
> #17 check_components (comm=comm@entry=0x5597f1b1c020
> <ompi_mpi_comm_world>, components=<optimized out>)
>
>     at base/coll_base_comm_select.c:406
>
>         priority = <optimized out>
>
>         flag = 0
>
>         count_include = 0
>
>         component = 0x7757344df240 <mca_coll_ucc_component>
>
>         cli = 0x55982061b6b0
>
>         module = 0x0
>
>         selectable = 0x5598209971a0
>
>         avail = <optimized out>
>
>         info_val = "-", '\277' <repeats 23 times>,
> "\340\277\277\277\277\277\277\277", '\000' <repeats 32 times>,
> "-\277-.\277.%%\277\344\"\277318\277,
> 8!$\277.-\277\344#\277;\3442\277\360\a\016{\377\177\000\000\325\301j6Ww",
> '\000' <repeats 18 times>,
> "AAAAAAAA\000\354\241G\221VU\214\v\000\000\000AAAAB\000\000\000\000\000\000\000
> ;\2006Ww\000\000\210\247\231 \230U\000\000
> ;\2006Ww\000\000\000\000\000\000\000\000\000\000\032\032\032\032\032\032\032\0328\004"...
>
>         coll_argv = 0x0
>
>         coll_exclude = 0x0
>
>         coll_include = 0x0
>
> #18 0x0000775736a42396 in mca_coll_base_comm_select (comm=0x5597f1b1c020
> <ompi_mpi_comm_world>)
>
>     at base/coll_base_comm_select.c:114
>
>         selectable = <optimized out>
>
>         item = <optimized out>
>
>         which_func = <synthetic pointer>
>
>         ret = <optimized out>
>
> #19 0x0000775736a8f5c3 in ompi_mpi_init (argc=<optimized out>,
> argv=<optimized out>, requested=0,
>
>     provided=provided@entry=0x7fff7b0e0984, reinit_ok=reinit_ok@entry=false)
> at runtime/ompi_mpi_init.c:957
>
>         ret = 0
>
> --Type <RET> for more, q to quit, c to continue without paging--
>
>         procs = 0x5598208bbb30
>
>         nprocs = 1
>
>         error = <optimized out>
>
>         errtrk = {active = false, status = 0}
>
>         info = {super = {obj_class = 0x7757365f4ac0 <opal_list_t_class>,
> obj_reference_count = 1},
>
>           opal_list_sentinel = {super = {obj_class = 0x0,
> obj_reference_count = 0},
>
>             opal_list_next = 0x7fff7b0e08f0, opal_list_prev =
> 0x7fff7b0e08f0, item_free = 0},
>
>           opal_list_length = 0}
>
>         kv = <optimized out>
>
>         active = false
>
>         background_fence = false
>
>         expected = <optimized out>
>
>         desired = 1
>
>         error = <optimized out>
>
> #20 0x0000775736a32b41 in PMPI_Init (argc=0x7fff7b0e09dc,
> argv=0x7fff7b0e09d0) at pinit.c:67
>
>         err = <optimized out>
>
>         provided = 0
>
>         env = <optimized out>
>
>         required = <optimized out>
>
> #21 0x00005597f1b1924d in main (argc=1, argv=0x7fff7b0e0c28) at
> hello_c.c:18
>
>         rank = 0
>
>         size = 64
>
>         len = 0
>
>         version =
> "\371\"\000\000\000\000\000\000\002\000\000\000\000\000\000\000)\003\217\321\000\000\000\000\310\n\016{\377\177\000\000\020\v\016{\377\177\000\000
> ۝\2576Ww\000\000\020\000\000\000\000\000\000\000@
> \000\000\000\000\000\000\000\000\000`\001\000\000\000\000\v\000\000\000\000\000\000\000\377\377\377\377\377\377\377\377@\000\000\000\000\000\000\000\b\000\000\000\000\000\000\000\000\000\270\000\000\000\000\000\000\b\000\000\000\000\000\000\000\000\270\000\000\000\000\000\000\200\000\000\000\000\000\000\000\000p\001\000\000\000\000\000\000p\001\000\000\000\000\000\000\020\000\000\000\000\000\000\200\000\000\000\000\000\000\310\n\016{\377\177\000\000\006\000\000\000U\000\000\000\004%\016{\377\177\000\000\326\354\2606Ww",
> '\000' <repeats 50 times>...
>
>
>
> Collin Strassburger (he/him)
>
>
>
> *From:* 'Collin Strassburger' via Open MPI users <[email protected]>
>
> *Sent:* Tuesday, December 9, 2025 3:40 PM
> *To:* [email protected]
> *Subject:* RE: [EXTERNAL] [OMPI users] Multi-host troubleshooting
>
>
>
> *CAUTION:* This email originated from outside of the organization. Do not
> click links or open attachments unless you recognize the sender and know
> the content is safe.
>
>
>
> Hello Howard,
>
>
>
> This is the output I get from attaching gdb to it from the 2nd host
> (mpirun --host hades1,hades2
> /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c):
>
> gdb /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c 525423
>
> [generic gdb intro text]
>
>
>
> For help, type "help".
>
> Type "apropos word" to search for commands related to "word"...
>
> Reading symbols from
> /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c...
>
> Attaching to program:
> /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c, process 525423
>
> [New LWP 525427]
>
> [New LWP 525426]
>
> --Type <RET> for more, q to quit, c to continue without paging--
>
> [New LWP 525425]
>
> [New LWP 525424]
>
> [Thread debugging using libthread_db enabled]
>
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>
> 0x000070fffef6b68f in opal_libevent2022_event_base_loop () from
> /opt/hpcx/ompi/lib/libopen-pal.so.40
>
>
>
>
>
> Collin Strassburger (he/him)
>
>
>
> *From:* 'Pritchard Jr., Howard' via Open MPI users <
> [email protected]>
> *Sent:* Tuesday, December 9, 2025 3:27 PM
> *To:* [email protected]
> *Subject:* Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting
>
>
>
> *CAUTION:* This email originated from outside of the organization. Do not
> click links or open attachments unless you recognize the sender and know
> the content is safe.
>
>
>
> Hello Collin,
>
>
>
> If you can do it, could you try to ssh into one of the nodes where a
> hello_c process is running and attach to it with a debugger and get a
> traceback?
>
>
>
> Howard
>
>
>
> *From: *'Collin Strassburger' via Open MPI users <[email protected]
> >
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Tuesday, December 9, 2025 at 1:19 PM
> *To: *Open MPI Users <[email protected]>
> *Subject: *[EXTERNAL] [OMPI users] Multi-host troubleshooting
>
>
>
> Hello,
>
>
>
> I am dealing with an odd mpi issue that I am unsure how to continue
> diagnosing.
>
>
>
> Following the outline given by:
> https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
> <https://urldefense.com/v3/__https:/www.open-mpi.org/faq/?category=running*diagnose-multi-host-problems__;Iw!!Bt8fGhp8LhKGRg!DRYP94rDXIEmdklDrsyM2wrE-tkiXvRFPv3wNgs_QqJDo2u6ltIg7OwPot_gKi3FSnGX1TdHa9QBl6J5JTaV$>,
> steps 1-3 complete without any issues
>
> i.e. ssh remotehost hostname works
>
> Paths include the nvidia hpcx paths when checked both with ssh and mpirun
>
> Mpirun --host node1,node2 hostname works correctly
>
> Mpirun --host node1,node2 env | grep –i path yields identical paths which
> include the paths required by hpcx
>
> (This is all through passwordless login)
>
>
>
> Step 4 calls to run mpirun --hosts node1,node2 hello_c.  I have locally
> compiled the code and confirmed that it works on each machine
> individually.  The same code is shared between the machines.  However, it
> does not run across multiple hosts at once.  It simply hangs until
> Ctrl-C’d.  I have attached the --mca plm_base_verbose 10 logs; while I
> don’t see anything in them, I am not well versed enough in OpenMPI to think
> that I understand the full implications of it all.
>
>
>
> Notes:
>
> No firewall is present between the machines (minimal install is the base,
> so ufw and iptables are not present by default and have not yet been
> installed)
>
> Journalctl does not report any errors.
>
> The machines have identical hardware and utilized the same configuration
> script.
>
> Calling “mpirun --hosts node1,node2 mpirun --version” returns identical
> results
>
> Calling “mpirun --hosts node1,node2 env | grep -i path” returns identical
> results
>
>
>
> OS: Ubuntu 24.04 LTS
>
> OMPI: 4.1.7rc1 from Nvidia HPCX
>
> Configure options:
>
>     --prefix=${HPCX_HOME}/ompi \
>
>     --with-hcoll=${HPCX_HOME}/hcoll \
>
>     --with-ucx=${HPCX_HOME}/ucx \
>
>     --with-platform=contrib/platform/mellanox/optimized \
>
>     --with-tm=/opt/pbs/ \
>
>     --with-slurm=no \
>
>     --with-pmix \
>
>     --with-hwloc=internal
>
>
>
> I’m rather at a loss on what to try/check next.  Any thoughts on how to
> continue troubleshooting this issue?
>
>
>
> Warm regards,
>
> Collin Strassburger (he/him)
>
>
> ------------------------------
>
> The information contained in this e-mail and any attachments from Bihrle
> Applied Research may contain confidential and/or proprietary information,
> and is intended only for the named recipient to whom it was originally
> addressed. If you are not the intended recipient, any disclosure,
> distribution, or copying of this e-mail or its attachments is strictly
> prohibited. If you have received this e-mail in error, please notify the
> sender immediately by return e-mail and permanently delete the e-mail and
> any attachments.
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].

Reply via email to