RE: [OMPI users] Long file paths causing segfault

2025-12-10 Thread Edgar Gabriel
Thank you for the bug report. Could you please create an issue on here: https://github.com/open-mpi/ompi/issues ? I will take a look, but its easier to keep track if there is an issue associated with the request. Best regards Edgar From: 'Cooper Burns' via Open MPI users Sent: Wednesday, Decem

RE: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-10 Thread 'Collin Strassburger' via Open MPI users
I did more digging and you are correct; after updating another node (4), nodes 1 and 4 are happy to run together while 2 has an issue. Thanks, George! Now that I see a “correct” UCC_LOG_LEVEL=info run that has each node reporting the ucc_constructor, I can see how you could tell. I’ll be sure t

[OMPI users] Long file paths causing segfault

2025-12-10 Thread 'Cooper Burns' via Open MPI users
Hello all - I am running into an issue with OpenMPI 5.0.6, 5.0.9 (latest) that looks like a bug in the MPI_File_open API. If I try to open a file where the provided filename is very long (on my system it looks like anything above 262ish characters has the issue) the MPI_File_open call just segfau

Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-10 Thread 'George Bosilca' via Open MPI users
This conclusion is not really accurate. Based on the provided logs UCC works as expected, it disables all modules related to CUDA if no CUDA library is available (not when devices are not available). For me the correct conclusion is that without restricting the collective modules to be used, Open

RE: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-10 Thread 'Collin Strassburger' via Open MPI users
For anyone who comes across this information in the future while doing their own troubleshooting, This appears to be a bugged nvidia implementation/configuration of ucc that does not operate correctly without nvidia GPU devices installed despite these packages being recommended for anyone using

Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-10 Thread 'George Bosilca' via Open MPI users
There you go, the misconfiguration of the second host prevents UCC, then OMPI, from properly loading its dependencies. As a result, one host has UCC support and will call the collective through UCC (or at least try) while the second host will redirect all collectives to the Open MPI tuned module. O

RE: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-10 Thread 'Collin Strassburger' via Open MPI users
Hello Joachim, I had a similar thought (about it being only 1 node) when I first saw the message. It appears to be a reporting issue rather than an actual difference between the nodes. Here’s the output of the command: mpirun --host hades1,hades2 ldd /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so

Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-10 Thread 'Joachim Jenke' via Open MPI users
Hi Collin, Am 10.12.25 um 15:36 schrieb 'Collin Strassburger' via Open MPI users: /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so (libcuda.so.1: cannot open shared object file: No such file or directory) Is it only the second host that cannot find libcuda.so? Do you have the library installed on both

RE: [EXTERNAL] [OMPI users] Multi-host troubleshooting

2025-12-10 Thread 'Collin Strassburger' via Open MPI users
Hello George, Running with the ucc log level at info results in: [1765376551.284834] [hades2:544873:0] ucc_constructor.c:188 UCC INFO version: 1.4.0, loaded from: /opt/hpcx/ucc/lib/libucc.so.1, cfg file: /opt/hpcx/ucc/share/ucc.conf Running with debug is shown below: mpirun --host hades1,had