I did more digging and you are correct; after updating another node (4), nodes 
1 and 4 are happy to run together while 2 has an issue.  Thanks, George!
Now that I see a “correct” UCC_LOG_LEVEL=info run that has each node reporting 
the ucc_constructor, I can see how you could tell.  I’ll be sure to note that 
down for the future.

Collin Strassburger (he/him)

From: 'George Bosilca' via Open MPI users <[email protected]>
Sent: Wednesday, December 10, 2025 3:32 PM
To: [email protected]
Subject: Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you recognize the sender and know the content 
is safe.

This conclusion is not really accurate. Based on the provided logs UCC works as 
expected, it disables all modules related to CUDA if no CUDA library is 
available (not when devices are not available).

For me the correct  conclusion is that without restricting the collective 
modules to be used, Open MPI should not be executes on asymmetric setups where 
different nodes have different hardware/software available.

  George.


On Wed, Dec 10, 2025 at 3:20 PM 'Collin Strassburger' via Open MPI users 
<[email protected]<mailto:[email protected]>> wrote:
For anyone who comes across this information in the future while doing their 
own troubleshooting,
This appears to be a bugged nvidia implementation/configuration of ucc that 
does not operate correctly without nvidia GPU devices installed despite these 
packages being recommended for anyone using nvidia IB.

Your collective assistance on helping me troubleshoot the issue was greatly 
appreciated.
(No further assistance is requested)

Collin Strassburger (he/him)

From: 'George Bosilca' via Open MPI users 
<[email protected]<mailto:[email protected]>>
Sent: Wednesday, December 10, 2025 10:42 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you recognize the sender and know the content 
is safe.

There you go, the misconfiguration of the second host prevents UCC, then OMPI, 
from properly loading its dependencies. As a result, one host has UCC support 
and will call the collective through UCC (or at least try) while the second 
host will redirect all collectives to the Open MPI tuned module. Open MPI 
cannot run in such asymmetric setup.

  George.


On Wed, Dec 10, 2025 at 10:33 AM 'Collin Strassburger' via Open MPI users 
<[email protected]<mailto:[email protected]>> wrote:
Hello Joachim,

I had a similar thought (about it being only 1 node) when I first saw the 
message.  It appears to be a reporting issue rather than an actual difference 
between the nodes.
Here’s the output of the command:
mpirun --host hades1,hades2 ldd /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so
        linux-vdso.so.1 (0x00007ffd343f5000)
        libucs.so.0 => /opt/hpcx/ucx/lib/libucs.so.0 (0x00007509b52a6000)
        libcuda.so.1 => not found
        libcudart.so.12 => 
/usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x00007509b4e00000)
        libnvidia-ml.so.1 => not found
        libucc.so.1 => /opt/hpcx/ucc/lib/libucc.so.1 (0x00007509b525b000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007509b4a00000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007509b5172000)
        libucm.so.0 => /opt/hpcx/ucx/lib/libucm.so.0 (0x00007509b5154000)
        /lib64/ld-linux-x86-64.so.2 (0x00007509b5338000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007509b514f000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x00007509b514a000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007509b5143000)
        linux-vdso.so.1 (0x00007ffc0379e000)
        libucs.so.0 => /opt/hpcx/ucx/lib/libucs.so.0 (0x00007625012a5000)
        libcuda.so.1 => not found
        libcudart.so.12 => 
/usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x0000762500e00000)
        libnvidia-ml.so.1 => not found
        libucc.so.1 => /opt/hpcx/ucc/lib/libucc.so.1 (0x000076250125a000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000762500a00000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000762501171000)
        libucm.so.0 => /opt/hpcx/ucx/lib/libucm.so.0 (0x0000762501153000)
        /lib64/ld-linux-x86-64.so.2 (0x0000762501337000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000076250114e000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x0000762501149000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x0000762501142000)

Given these results indicating that libcuda.so.1 cannot be found, I think I'll 
check that the cuda LD paths are being sourced correctly.

Warm regards,
Collin Strassburger (he/him)

-----Original Message-----
From: 'Joachim Jenke' via Open MPI users 
<[email protected]<mailto:[email protected]>>
Sent: Wednesday, December 10, 2025 10:12 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting

Hi Collin,

Am 10.12.25 um 15:36 schrieb 'Collin Strassburger' via Open MPI users:
> /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so (libcuda.so.1: cannot open
> shared object file: No such file or directory)

Is it only the second host that cannot find libcuda.so? Do you have the
library installed on both nodes?

What is the output for:

mpirun --hosts node1,node2 ldd /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so

- Joachim
--
Dr. rer. nat. Joachim Jenke
Deputy Group Lead

IT Center
Group: HPC - Parallelism, Runtime Analysis & Machine Learning
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074  Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
[email protected]<mailto:[email protected]>
www.itc.rwth-aachen.de<http://www.itc.rwth-aachen.de/>

To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:users%[email protected]>.
________________________________
The information contained in this e-mail and any attachments from Bihrle 
Applied Research may contain confidential and/or proprietary information, and 
is intended only for the named recipient to whom it was originally addressed. 
If you are not the intended recipient, any disclosure, distribution, or copying 
of this e-mail or its attachments is strictly prohibited. If you have received 
this e-mail in error, please notify the sender immediately by return e-mail and 
permanently delete the e-mail and any attachments.

To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:users%[email protected]>.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].

Reply via email to