[OMPI users] Help running OpenMPI in prrte

2023-01-11 Thread Jonathon Anderson via users
I am getting an error and crash when trying to use PRRTE to run a
containerized instance of OSU Micro-Benchmarks built against OpenMPI. The
same container works using PMI2 support in Slurm. Full details are
available at https://github.com/openpmix/prrte/issues/1635, but they
suggested I reach out to OMPI.

Error output follows. Can anyone point me in the right direction to
understand what I'm doing wrong?

$ prterun -n 2 --map-by=ppr:1:node --hostfile
~/janderson/workflows/util/prrte/hostfile.txt
./osu-micro-benchmarks.sif osu_init
--
Open MPI's OFI driver detected multiple equidistant NICs from the
current process,
but had insufficient information to ensure MPI processes fairly pick a
NIC for use.
This may negatively impact performance. A more modern PMIx server is
necessary to
resolve this issue.

Note: This message is displayed only when the OFI component's verbosity level is
1851085648 or higher.
--
c5.190935map_hfi_mem: mmap of rcvhdr_bufbase (0xdabbad00040b) size
262144 failed: Resource temporarily unavailable
c5.190935osu_init: An unrecoverable error occurred while communicating
with the driver
[c5:190935] *** Process received signal ***
[c5:190935] Signal: Aborted (6)
[c5:190935] Signal code:  (-6)
[c5:190935] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7f8c6ec62cf0]
[c5:190935] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f8c6e8d9acf]
[c5:190935] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f8c6e8acea5]
[c5:190935] [ 3]
/opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x47804)[0x7f8c6c5af804]
[c5:190935] [ 4]
/opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xde3e)[0x7f8c6c575e3e]
[c5:190935] [ 5]
/opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xecdb)[0x7f8c6c576cdb]
[c5:190935] [ 6]
/opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x11353)[0x7f8c6c579353]
[c5:190935] [ 7]
/opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(psm2_ep_open+0x209)[0x7f8c6c57aa49]
[c5:190935] [ 8]
/opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0x9cb14)[0x7f8c6dfdfb14]
[c5:190935] [ 9]
/opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0xa62be)[0x7f8c6dfe92be]
[c5:190935] [10]
/opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(+0x8cd2d)[0x7f8c6e2d0d2d]
[c5:190935] [11]
/opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(mca_btl_base_select+0xe3)[0x7f8c6e2c0b83]
[c5:190935] [12]
/opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_r2_component_init+0x12)[0x7f8c6ef47f42]
[c5:190935] [13]
/opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7f8c6ef46084]
[c5:190935] [14]
/opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(ompi_mpi_init+0x64c)[0x7f8c6f1105cc]
[c5:190935] [15]
/opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f8c6ef1fa4e]
[c5:190935] [16]
/opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x4015be]
[c5:190935] [17] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f8c6e8c5d85]
[c5:190935] [18]
/opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x40176e]
[c5:190935] *** End of error message ***
--
Open MPI's OFI driver detected multiple equidistant NICs from the
current process,
but had insufficient information to ensure MPI processes fairly pick a
NIC for use.
This may negatively impact performance. A more modern PMIx server is
necessary to
resolve this issue.

Note: This message is displayed only when the OFI component's verbosity level is
-1891646640 or higher.
--
c6.191679map_hfi_mem: mmap of rcvhdr_bufbase (0xdabbad00040b) size
262144 failed: Resource temporarily unavailable
c6.191679osu_init: An unrecoverable error occurred while communicating
with the driver
[c6:191679] *** Process received signal ***
[c6:191679] Signal: Aborted (6)
[c6:191679] Signal code:  (-6)
[c6:191679] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7f518fb09cf0]
[c6:191679] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f518f780acf]
[c6:191679] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f518f753ea5]
[c6:191679] [ 3]

Re: [OMPI users] ucx configuration

2023-01-11 Thread Gilles Gouaillardet via users
You can pick one test, make it standalone, and open an issue on GitHub.

How does (vanilla) Open MPI compare to your vendor Open MPI based library?

Cheers,

Gilles

On Wed, Jan 11, 2023 at 10:20 PM Dave Love via users <
users@lists.open-mpi.org> wrote:

> Gilles Gouaillardet via users  writes:
>
> > Dave,
> >
> > If there is a bug you would like to report, please open an issue at
> > https://github.com/open-mpi/ompi/issues and provide all the required
> > information
> > (in this case, it should also include the UCX library you are using and
> how
> > it was obtained or built).
>
> There are hundreds of failures I was interested in resolving with the
> latest versions, though I think somewhat fewer than with previous UCX
> versions.
>
> I'd like to know how it's recommended I should build to ensure I'm
> starting from the right place for any investigation.  Possible interplay
> between OMPI and UCX options seems worth understanding specifically, and
> I think it's reasonable to ask how to configure things to work together
> generally, when there are so many options without much explanation.
>
> I have tried raising issues previously without much luck but, given the
> number of failures, something is fundamentally wrong, and I doubt you
> want the output from the whole set.
>
> Perhaps the MPICH test set in a "portable" configuration is expected to
> fail with OMPI for some reason, and someone can comment on that.
> However, it's the only comprehensive set I know is available, and
> originally even IMB crashed, so I'm not inclined to blame the tests
> initially, and wonder how this stuff is tested.


Re: [OMPI users] ucx configuration

2023-01-11 Thread Dave Love via users
Gilles Gouaillardet via users  writes:

> Dave,
>
> If there is a bug you would like to report, please open an issue at
> https://github.com/open-mpi/ompi/issues and provide all the required
> information
> (in this case, it should also include the UCX library you are using and how
> it was obtained or built).

There are hundreds of failures I was interested in resolving with the
latest versions, though I think somewhat fewer than with previous UCX
versions.

I'd like to know how it's recommended I should build to ensure I'm
starting from the right place for any investigation.  Possible interplay
between OMPI and UCX options seems worth understanding specifically, and
I think it's reasonable to ask how to configure things to work together
generally, when there are so many options without much explanation.

I have tried raising issues previously without much luck but, given the
number of failures, something is fundamentally wrong, and I doubt you
want the output from the whole set.

Perhaps the MPICH test set in a "portable" configuration is expected to
fail with OMPI for some reason, and someone can comment on that.
However, it's the only comprehensive set I know is available, and
originally even IMB crashed, so I'm not inclined to blame the tests
initially, and wonder how this stuff is tested.