Re: [OMPI users] mca_sharedfp_lockfile issues

2021-11-02 Thread Gabriel, Edgar via users
What file system are you running your code on ? And is the same directory shared across all nodes? I have seen this error if users try to use a non-shared directory for MPI I/O operations ( e.g. /tmp which is a different drive/folder on each node). Thanks Edgar -Original Message- From

[OMPI users] mca_sharedfp_lockfile issues

2021-11-02 Thread bend linux4ms.net via users
Ok, I got more issues. Maybe someone on the list can help me: Open MPI version: 4.1.1 download from github source Compile on Centos 8.4 using GCC 8.4.1 Configured is: ./configure --enable-shared --enable-static \ --without-tm \ --enable-mpi-cxx \ --enable-wrapper-runpath \ --enable-

Re: [OMPI users] [EXTERNAL] strange pml error

2021-11-02 Thread Shrader, David Lee via users
As a workaround for now, I have found that setting OMPI_MCA_pml=ucx seems to get around this issue. I'm not sure why this works, but perhaps there is different initialization that happens such that the offending device search problem doesn't occur? Thanks, David

Re: [OMPI users] [EXTERNAL] strange pml error

2021-11-02 Thread Shrader, David Lee via users
I too have been getting this using 4.1.1, but not with the master nightly tarballs from mid-October. I still have it on my to-do list to open a github issue. The problem seems to come from device detection in the ucx pml: on some ranks, it fails to find a device and thus the ucx pml disqualifies

[OMPI users] strange pml error

2021-11-02 Thread Michael Di Domenico via users
fairly frequently, but not everytime when trying to run xhpl on a new machine i'm bumping into this. it happens with a single node or multiple nodes node1 selected pml ob1, but peer on node1 selected pml ucx if i rerun the exact same command a few minutes later, it works fine. the machine is new