What file system are you running your code on ? And is the same directory
shared across all nodes? I have seen this error if users try to use a
non-shared directory for MPI I/O operations ( e.g. /tmp which is a different
drive/folder on each node).
Thanks
Edgar
-Original Message-
From
Ok, I got more issues. Maybe someone on the list can help me:
Open MPI version: 4.1.1 download from github source
Compile on Centos 8.4 using GCC 8.4.1
Configured is:
./configure --enable-shared --enable-static \
--without-tm \
--enable-mpi-cxx \
--enable-wrapper-runpath \
--enable-
As a workaround for now, I have found that setting OMPI_MCA_pml=ucx seems to
get around this issue. I'm not sure why this works, but perhaps there is
different initialization that happens such that the offending device search
problem doesn't occur?
Thanks,
David
I too have been getting this using 4.1.1, but not with the master nightly
tarballs from mid-October. I still have it on my to-do list to open a github
issue. The problem seems to come from device detection in the ucx pml: on some
ranks, it fails to find a device and thus the ucx pml disqualifies
fairly frequently, but not everytime when trying to run xhpl on a new
machine i'm bumping into this. it happens with a single node or
multiple nodes
node1 selected pml ob1, but peer on node1 selected pml ucx
if i rerun the exact same command a few minutes later, it works fine.
the machine is new