For anyone who comes across this information in the future while doing their own troubleshooting, This appears to be a bugged nvidia implementation/configuration of ucc that does not operate correctly without nvidia GPU devices installed despite these packages being recommended for anyone using nvidia IB.
Your collective assistance on helping me troubleshoot the issue was greatly appreciated. (No further assistance is requested) Collin Strassburger (he/him) From: 'George Bosilca' via Open MPI users <[email protected]> Sent: Wednesday, December 10, 2025 10:42 AM To: [email protected] Subject: Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. There you go, the misconfiguration of the second host prevents UCC, then OMPI, from properly loading its dependencies. As a result, one host has UCC support and will call the collective through UCC (or at least try) while the second host will redirect all collectives to the Open MPI tuned module. Open MPI cannot run in such asymmetric setup. George. On Wed, Dec 10, 2025 at 10:33 AM 'Collin Strassburger' via Open MPI users <[email protected]<mailto:[email protected]>> wrote: Hello Joachim, I had a similar thought (about it being only 1 node) when I first saw the message. It appears to be a reporting issue rather than an actual difference between the nodes. Here’s the output of the command: mpirun --host hades1,hades2 ldd /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so linux-vdso.so.1 (0x00007ffd343f5000) libucs.so.0 => /opt/hpcx/ucx/lib/libucs.so.0 (0x00007509b52a6000) libcuda.so.1 => not found libcudart.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x00007509b4e00000) libnvidia-ml.so.1 => not found libucc.so.1 => /opt/hpcx/ucc/lib/libucc.so.1 (0x00007509b525b000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007509b4a00000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007509b5172000) libucm.so.0 => /opt/hpcx/ucx/lib/libucm.so.0 (0x00007509b5154000) /lib64/ld-linux-x86-64.so.2 (0x00007509b5338000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007509b514f000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007509b514a000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007509b5143000) linux-vdso.so.1 (0x00007ffc0379e000) libucs.so.0 => /opt/hpcx/ucx/lib/libucs.so.0 (0x00007625012a5000) libcuda.so.1 => not found libcudart.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x0000762500e00000) libnvidia-ml.so.1 => not found libucc.so.1 => /opt/hpcx/ucc/lib/libucc.so.1 (0x000076250125a000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000762500a00000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000762501171000) libucm.so.0 => /opt/hpcx/ucx/lib/libucm.so.0 (0x0000762501153000) /lib64/ld-linux-x86-64.so.2 (0x0000762501337000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000076250114e000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x0000762501149000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x0000762501142000) Given these results indicating that libcuda.so.1 cannot be found, I think I'll check that the cuda LD paths are being sourced correctly. Warm regards, Collin Strassburger (he/him) -----Original Message----- From: 'Joachim Jenke' via Open MPI users <[email protected]<mailto:[email protected]>> Sent: Wednesday, December 10, 2025 10:12 AM To: [email protected]<mailto:[email protected]> Subject: Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting Hi Collin, Am 10.12.25 um 15:36 schrieb 'Collin Strassburger' via Open MPI users: > /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so (libcuda.so.1: cannot open > shared object file: No such file or directory) Is it only the second host that cannot find libcuda.so? Do you have the library installed on both nodes? What is the output for: mpirun --hosts node1,node2 ldd /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so - Joachim -- Dr. rer. nat. Joachim Jenke Deputy Group Lead IT Center Group: HPC - Parallelism, Runtime Analysis & Machine Learning Division: Computational Science and Engineering RWTH Aachen University Seffenter Weg 23 D 52074 Aachen (Germany) Tel: +49 241 80- 24765 Fax: +49 241 80-624765 [email protected]<mailto:[email protected]> www.itc.rwth-aachen.de<http://www.itc.rwth-aachen.de/> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:users%[email protected]>. ________________________________ The information contained in this e-mail and any attachments from Bihrle Applied Research may contain confidential and/or proprietary information, and is intended only for the named recipient to whom it was originally addressed. If you are not the intended recipient, any disclosure, distribution, or copying of this e-mail or its attachments is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by return e-mail and permanently delete the e-mail and any attachments. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:users%[email protected]>. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:[email protected]>. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
