I opened an issue, and a fix looks like it went in to the 4.1.2 release branch 
already. I tested the patch on my 4.1.1 release tarball, and the error no 
longer occurs.


Here is the link to the issue:


https://github.com/open-mpi/ompi/issues/9617


Thanks,

David


________________________________
From: users <users-boun...@lists.open-mpi.org> on behalf of Michael Di Domenico 
via users <users@lists.open-mpi.org>
Sent: Wednesday, November 3, 2021 8:58 AM
Cc: Michael Di Domenico; Open MPI Users
Subject: Re: [OMPI users] [EXTERNAL] strange pml error

this seemed to help me as well, so far at least.  still have a lot
more testing to do

On Tue, Nov 2, 2021 at 4:15 PM Shrader, David Lee <dshra...@lanl.gov> wrote:
>
> As a workaround for now, I have found that setting OMPI_MCA_pml=ucx seems to 
> get around this issue. I'm not sure why this works, but perhaps there is 
> different initialization that happens such that the offending device search 
> problem doesn't occur?
>
>
> Thanks,
>
> David
>
>
>
> ________________________________
> From: Shrader, David Lee
> Sent: Tuesday, November 2, 2021 2:09 PM
> To: Open MPI Users
> Cc: Michael Di Domenico
> Subject: Re: [EXTERNAL] [OMPI users] strange pml error
>
>
> I too have been getting this using 4.1.1, but not with the master nightly 
> tarballs from mid-October. I still have it on my to-do list to open a github 
> issue. The problem seems to come from device detection in the ucx pml: on 
> some ranks, it fails to find a device and thus the ucx pml disqualifies 
> itself. Which then just leaves the ob1 pml.
>
>
> Thanks,
>
> David
>
>
>
> ________________________________
> From: users <users-boun...@lists.open-mpi.org> on behalf of Michael Di 
> Domenico via users <users@lists.open-mpi.org>
> Sent: Tuesday, November 2, 2021 1:35 PM
> To: Open MPI Users
> Cc: Michael Di Domenico
> Subject: [EXTERNAL] [OMPI users] strange pml error
>
> fairly frequently, but not everytime when trying to run xhpl on a new
> machine i'm bumping into this.  it happens with a single node or
> multiple nodes
>
> node1 selected pml ob1, but peer on node1 selected pml ucx
>
> if i rerun the exact same command a few minutes later, it works fine.
> the machine is new and i'm the only one using it so there are no user
> conflicts
>
> the software stack is
>
> slurm 21.8.2.1
> ompi 4.1.1
> pmix 3.2.3
> ucx 1.9.0
>
> the hardware is HPE w/ mellanox edr cards (but i doubt that matters)
>
> any thoughts?

Reply via email to