Hello Joachim,

I had a similar thought (about it being only 1 node) when I first saw the 
message.  It appears to be a reporting issue rather than an actual difference 
between the nodes.
Here’s the output of the command:
mpirun --host hades1,hades2 ldd /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so
        linux-vdso.so.1 (0x00007ffd343f5000)
        libucs.so.0 => /opt/hpcx/ucx/lib/libucs.so.0 (0x00007509b52a6000)
        libcuda.so.1 => not found
        libcudart.so.12 => 
/usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x00007509b4e00000)
        libnvidia-ml.so.1 => not found
        libucc.so.1 => /opt/hpcx/ucc/lib/libucc.so.1 (0x00007509b525b000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007509b4a00000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007509b5172000)
        libucm.so.0 => /opt/hpcx/ucx/lib/libucm.so.0 (0x00007509b5154000)
        /lib64/ld-linux-x86-64.so.2 (0x00007509b5338000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007509b514f000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x00007509b514a000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007509b5143000)
        linux-vdso.so.1 (0x00007ffc0379e000)
        libucs.so.0 => /opt/hpcx/ucx/lib/libucs.so.0 (0x00007625012a5000)
        libcuda.so.1 => not found
        libcudart.so.12 => 
/usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x0000762500e00000)
        libnvidia-ml.so.1 => not found
        libucc.so.1 => /opt/hpcx/ucc/lib/libucc.so.1 (0x000076250125a000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000762500a00000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000762501171000)
        libucm.so.0 => /opt/hpcx/ucx/lib/libucm.so.0 (0x0000762501153000)
        /lib64/ld-linux-x86-64.so.2 (0x0000762501337000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000076250114e000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x0000762501149000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x0000762501142000)

Given these results indicating that libcuda.so.1 cannot be found, I think I'll 
check that the cuda LD paths are being sourced correctly.

Warm regards,
Collin Strassburger (he/him)

-----Original Message-----
From: 'Joachim Jenke' via Open MPI users <[email protected]>
Sent: Wednesday, December 10, 2025 10:12 AM
To: [email protected]
Subject: Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting

Hi Collin,

Am 10.12.25 um 15:36 schrieb 'Collin Strassburger' via Open MPI users:
> /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so (libcuda.so.1: cannot open
> shared object file: No such file or directory)

Is it only the second host that cannot find libcuda.so? Do you have the
library installed on both nodes?

What is the output for:

mpirun --hosts node1,node2 ldd /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so

- Joachim
--
Dr. rer. nat. Joachim Jenke
Deputy Group Lead

IT Center
Group: HPC - Parallelism, Runtime Analysis & Machine Learning
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074  Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
[email protected]
www.itc.rwth-aachen.de

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
________________________________
The information contained in this e-mail and any attachments from Bihrle 
Applied Research may contain confidential and/or proprietary information, and 
is intended only for the named recipient to whom it was originally addressed. 
If you are not the intended recipient, any disclosure, distribution, or copying 
of this e-mail or its attachments is strictly prohibited. If you have received 
this e-mail in error, please notify the sender immediately by return e-mail and 
permanently delete the e-mail and any attachments.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].

Reply via email to