There you go, the misconfiguration of the second host prevents UCC, then OMPI, from properly loading its dependencies. As a result, one host has UCC support and will call the collective through UCC (or at least try) while the second host will redirect all collectives to the Open MPI tuned module. Open MPI cannot run in such asymmetric setup.
George. On Wed, Dec 10, 2025 at 10:33 AM 'Collin Strassburger' via Open MPI users < [email protected]> wrote: > Hello Joachim, > > I had a similar thought (about it being only 1 node) when I first saw the > message. It appears to be a reporting issue rather than an actual > difference between the nodes. > Here’s the output of the command: > mpirun --host hades1,hades2 ldd /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so > linux-vdso.so.1 (0x00007ffd343f5000) > libucs.so.0 => /opt/hpcx/ucx/lib/libucs.so.0 (0x00007509b52a6000) > libcuda.so.1 => not found > libcudart.so.12 => > /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 > (0x00007509b4e00000) > libnvidia-ml.so.1 => not found > libucc.so.1 => /opt/hpcx/ucc/lib/libucc.so.1 (0x00007509b525b000) > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007509b4a00000) > libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007509b5172000) > libucm.so.0 => /opt/hpcx/ucx/lib/libucm.so.0 (0x00007509b5154000) > /lib64/ld-linux-x86-64.so.2 (0x00007509b5338000) > libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007509b514f000) > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 > (0x00007509b514a000) > librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007509b5143000) > linux-vdso.so.1 (0x00007ffc0379e000) > libucs.so.0 => /opt/hpcx/ucx/lib/libucs.so.0 (0x00007625012a5000) > libcuda.so.1 => not found > libcudart.so.12 => > /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 > (0x0000762500e00000) > libnvidia-ml.so.1 => not found > libucc.so.1 => /opt/hpcx/ucc/lib/libucc.so.1 (0x000076250125a000) > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000762500a00000) > libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000762501171000) > libucm.so.0 => /opt/hpcx/ucx/lib/libucm.so.0 (0x0000762501153000) > /lib64/ld-linux-x86-64.so.2 (0x0000762501337000) > libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000076250114e000) > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 > (0x0000762501149000) > librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x0000762501142000) > > Given these results indicating that libcuda.so.1 cannot be found, I think > I'll check that the cuda LD paths are being sourced correctly. > > Warm regards, > Collin Strassburger (he/him) > > -----Original Message----- > From: 'Joachim Jenke' via Open MPI users <[email protected]> > Sent: Wednesday, December 10, 2025 10:12 AM > To: [email protected] > Subject: Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting > > Hi Collin, > > Am 10.12.25 um 15:36 schrieb 'Collin Strassburger' via Open MPI users: > > /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so (libcuda.so.1: cannot open > > shared object file: No such file or directory) > > Is it only the second host that cannot find libcuda.so? Do you have the > library installed on both nodes? > > What is the output for: > > mpirun --hosts node1,node2 ldd /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so > > - Joachim > -- > Dr. rer. nat. Joachim Jenke > Deputy Group Lead > > IT Center > Group: HPC - Parallelism, Runtime Analysis & Machine Learning > Division: Computational Science and Engineering > RWTH Aachen University > Seffenter Weg 23 > D 52074 Aachen (Germany) > Tel: +49 241 80- 24765 > Fax: +49 241 80-624765 > [email protected] > www.itc.rwth-aachen.de > > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > ________________________________ > The information contained in this e-mail and any attachments from Bihrle > Applied Research may contain confidential and/or proprietary information, > and is intended only for the named recipient to whom it was originally > addressed. If you are not the intended recipient, any disclosure, > distribution, or copying of this e-mail or its attachments is strictly > prohibited. If you have received this e-mail in error, please notify the > sender immediately by return e-mail and permanently delete the e-mail and > any attachments. > > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
