Thank you for the bug report. Could you please create an issue on here:
https://github.com/open-mpi/ompi/issues ?
I will take a look, but its easier to keep track if there is an issue
associated with the request.
Best regards
Edgar
From: 'Cooper Burns' via Open MPI users
Sent: Wednesday, Decem
I did more digging and you are correct; after updating another node (4), nodes
1 and 4 are happy to run together while 2 has an issue. Thanks, George!
Now that I see a “correct” UCC_LOG_LEVEL=info run that has each node reporting
the ucc_constructor, I can see how you could tell. I’ll be sure t
Hello all -
I am running into an issue with OpenMPI 5.0.6, 5.0.9 (latest) that looks
like a bug in the MPI_File_open API.
If I try to open a file where the provided filename is very long (on my
system it looks like anything above 262ish characters has the issue) the
MPI_File_open call just segfau
This conclusion is not really accurate. Based on the provided logs UCC
works as expected, it disables all modules related to CUDA if no CUDA
library is available (not when devices are not available).
For me the correct conclusion is that without restricting the collective
modules to be used, Open
For anyone who comes across this information in the future while doing their
own troubleshooting,
This appears to be a bugged nvidia implementation/configuration of ucc that
does not operate correctly without nvidia GPU devices installed despite these
packages being recommended for anyone using
There you go, the misconfiguration of the second host prevents UCC, then
OMPI, from properly loading its dependencies. As a result, one host has UCC
support and will call the collective through UCC (or at least try) while
the second host will redirect all collectives to the Open MPI tuned module.
O
Hello Joachim,
I had a similar thought (about it being only 1 node) when I first saw the
message. It appears to be a reporting issue rather than an actual difference
between the nodes.
Here’s the output of the command:
mpirun --host hades1,hades2 ldd /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so
Hi Collin,
Am 10.12.25 um 15:36 schrieb 'Collin Strassburger' via Open MPI users:
/opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so (libcuda.so.1: cannot open
shared object file: No such file or directory)
Is it only the second host that cannot find libcuda.so? Do you have the
library installed on both
Hello George,
Running with the ucc log level at info results in:
[1765376551.284834] [hades2:544873:0] ucc_constructor.c:188 UCC INFO
version: 1.4.0, loaded from: /opt/hpcx/ucc/lib/libucc.so.1, cfg file:
/opt/hpcx/ucc/share/ucc.conf
Running with debug is shown below:
mpirun --host hades1,had