lem seems to come from device detection in the ucx pml: on
> some ranks, it fails to find a device and thus the ucx pml disqualifies
> itself. Which then just leaves the ob1 pml.
>
>
> Thanks,
>
> David
>
>
>
>
> From: users on beh
fairly frequently, but not everytime when trying to run xhpl on a new
machine i'm bumping into this. it happens with a single node or
multiple nodes
node1 selected pml ob1, but peer on node1 selected pml ucx
if i rerun the exact same command a few minutes later, it works fine.
the machine is
On Mon, Mar 22, 2021 at 11:13 AM Pritchard Jr., Howard wrote:
> https://github.com/Sandia-OpenSHMEM/SOS
> if you want to use OpenSHMEM over OPA.
> If you have lots of cycles for development work, you could write an OFI SPML
> for the OSHMEM component of Open MPI.
thanks, i am aware of the
i can build and run openmpi on an opa network just fine, but it turns
out building openshmem fails. the message is (no spml) found
looking at the config log it looks like it tries to build spml ikrit
and ucx which fail. i turn ucx off because it doesn't support opa and
isn't needed.
so this
port_lid: 99
> > port_lmc: 0x00
> > link_layer: InfiniBand
> >
> > using gcc/gfortran 9.3.0
> >
> > Built Open MPI 4.0.5 without any special configure options.
> >
>
for whatever it's worth running the test program on my OPA cluster
seems to work. well it keeps spitting out [INFO MEMORY] lines, not
sure if it's supposed to stop at some point
i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, without-{psm,ucx,verbs}
On Tue, Jan 26, 2021 at 3:44 PM
i haven't compiled openmpi in a while, but i'm in the process of
upgrading our cluster.
the last time i did this there were specific versions of mpi/pmix/ucx
that were all tested and supposed to work together. my understanding
of this was because pmi/ucx was under rapid development and the api's