this seemed to help me as well, so far at least. still have a lot more testing to do
On Tue, Nov 2, 2021 at 4:15 PM Shrader, David Lee <dshra...@lanl.gov> wrote: > > As a workaround for now, I have found that setting OMPI_MCA_pml=ucx seems to > get around this issue. I'm not sure why this works, but perhaps there is > different initialization that happens such that the offending device search > problem doesn't occur? > > > Thanks, > > David > > > > ________________________________ > From: Shrader, David Lee > Sent: Tuesday, November 2, 2021 2:09 PM > To: Open MPI Users > Cc: Michael Di Domenico > Subject: Re: [EXTERNAL] [OMPI users] strange pml error > > > I too have been getting this using 4.1.1, but not with the master nightly > tarballs from mid-October. I still have it on my to-do list to open a github > issue. The problem seems to come from device detection in the ucx pml: on > some ranks, it fails to find a device and thus the ucx pml disqualifies > itself. Which then just leaves the ob1 pml. > > > Thanks, > > David > > > > ________________________________ > From: users <users-boun...@lists.open-mpi.org> on behalf of Michael Di > Domenico via users <users@lists.open-mpi.org> > Sent: Tuesday, November 2, 2021 1:35 PM > To: Open MPI Users > Cc: Michael Di Domenico > Subject: [EXTERNAL] [OMPI users] strange pml error > > fairly frequently, but not everytime when trying to run xhpl on a new > machine i'm bumping into this. it happens with a single node or > multiple nodes > > node1 selected pml ob1, but peer on node1 selected pml ucx > > if i rerun the exact same command a few minutes later, it works fine. > the machine is new and i'm the only one using it so there are no user > conflicts > > the software stack is > > slurm 21.8.2.1 > ompi 4.1.1 > pmix 3.2.3 > ucx 1.9.0 > > the hardware is HPE w/ mellanox edr cards (but i doubt that matters) > > any thoughts?