Hello Joachim,
I had a similar thought (about it being only 1 node) when I first saw the
message. It appears to be a reporting issue rather than an actual difference
between the nodes.
Here’s the output of the command:
mpirun --host hades1,hades2 ldd /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so
linux-vdso.so.1 (0x00007ffd343f5000)
libucs.so.0 => /opt/hpcx/ucx/lib/libucs.so.0 (0x00007509b52a6000)
libcuda.so.1 => not found
libcudart.so.12 =>
/usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x00007509b4e00000)
libnvidia-ml.so.1 => not found
libucc.so.1 => /opt/hpcx/ucc/lib/libucc.so.1 (0x00007509b525b000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007509b4a00000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007509b5172000)
libucm.so.0 => /opt/hpcx/ucx/lib/libucm.so.0 (0x00007509b5154000)
/lib64/ld-linux-x86-64.so.2 (0x00007509b5338000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007509b514f000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
(0x00007509b514a000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007509b5143000)
linux-vdso.so.1 (0x00007ffc0379e000)
libucs.so.0 => /opt/hpcx/ucx/lib/libucs.so.0 (0x00007625012a5000)
libcuda.so.1 => not found
libcudart.so.12 =>
/usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x0000762500e00000)
libnvidia-ml.so.1 => not found
libucc.so.1 => /opt/hpcx/ucc/lib/libucc.so.1 (0x000076250125a000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000762500a00000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000762501171000)
libucm.so.0 => /opt/hpcx/ucx/lib/libucm.so.0 (0x0000762501153000)
/lib64/ld-linux-x86-64.so.2 (0x0000762501337000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000076250114e000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
(0x0000762501149000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x0000762501142000)
Given these results indicating that libcuda.so.1 cannot be found, I think I'll
check that the cuda LD paths are being sourced correctly.
Warm regards,
Collin Strassburger (he/him)
-----Original Message-----
From: 'Joachim Jenke' via Open MPI users <[email protected]>
Sent: Wednesday, December 10, 2025 10:12 AM
To: [email protected]
Subject: Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting
Hi Collin,
Am 10.12.25 um 15:36 schrieb 'Collin Strassburger' via Open MPI users:
> /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so (libcuda.so.1: cannot open
> shared object file: No such file or directory)
Is it only the second host that cannot find libcuda.so? Do you have the
library installed on both nodes?
What is the output for:
mpirun --hosts node1,node2 ldd /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so
- Joachim
--
Dr. rer. nat. Joachim Jenke
Deputy Group Lead
IT Center
Group: HPC - Parallelism, Runtime Analysis & Machine Learning
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074 Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
[email protected]
www.itc.rwth-aachen.de
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
________________________________
The information contained in this e-mail and any attachments from Bihrle
Applied Research may contain confidential and/or proprietary information, and
is intended only for the named recipient to whom it was originally addressed.
If you are not the intended recipient, any disclosure, distribution, or copying
of this e-mail or its attachments is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by return e-mail and
permanently delete the e-mail and any attachments.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].