[OMPI users] Help running OpenMPI in prrte
I am getting an error and crash when trying to use PRRTE to run a containerized instance of OSU Micro-Benchmarks built against OpenMPI. The same container works using PMI2 support in Slurm. Full details are available at https://github.com/openpmix/prrte/issues/1635, but they suggested I reach out to OMPI. Error output follows. Can anyone point me in the right direction to understand what I'm doing wrong? $ prterun -n 2 --map-by=ppr:1:node --hostfile ~/janderson/workflows/util/prrte/hostfile.txt ./osu-micro-benchmarks.sif osu_init -- Open MPI's OFI driver detected multiple equidistant NICs from the current process, but had insufficient information to ensure MPI processes fairly pick a NIC for use. This may negatively impact performance. A more modern PMIx server is necessary to resolve this issue. Note: This message is displayed only when the OFI component's verbosity level is 1851085648 or higher. -- c5.190935map_hfi_mem: mmap of rcvhdr_bufbase (0xdabbad00040b) size 262144 failed: Resource temporarily unavailable c5.190935osu_init: An unrecoverable error occurred while communicating with the driver [c5:190935] *** Process received signal *** [c5:190935] Signal: Aborted (6) [c5:190935] Signal code: (-6) [c5:190935] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7f8c6ec62cf0] [c5:190935] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f8c6e8d9acf] [c5:190935] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f8c6e8acea5] [c5:190935] [ 3] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x47804)[0x7f8c6c5af804] [c5:190935] [ 4] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xde3e)[0x7f8c6c575e3e] [c5:190935] [ 5] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xecdb)[0x7f8c6c576cdb] [c5:190935] [ 6] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x11353)[0x7f8c6c579353] [c5:190935] [ 7] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(psm2_ep_open+0x209)[0x7f8c6c57aa49] [c5:190935] [ 8] /opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0x9cb14)[0x7f8c6dfdfb14] [c5:190935] [ 9] /opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0xa62be)[0x7f8c6dfe92be] [c5:190935] [10] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(+0x8cd2d)[0x7f8c6e2d0d2d] [c5:190935] [11] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(mca_btl_base_select+0xe3)[0x7f8c6e2c0b83] [c5:190935] [12] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_r2_component_init+0x12)[0x7f8c6ef47f42] [c5:190935] [13] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7f8c6ef46084] [c5:190935] [14] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(ompi_mpi_init+0x64c)[0x7f8c6f1105cc] [c5:190935] [15] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f8c6ef1fa4e] [c5:190935] [16] /opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x4015be] [c5:190935] [17] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f8c6e8c5d85] [c5:190935] [18] /opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x40176e] [c5:190935] *** End of error message *** -- Open MPI's OFI driver detected multiple equidistant NICs from the current process, but had insufficient information to ensure MPI processes fairly pick a NIC for use. This may negatively impact performance. A more modern PMIx server is necessary to resolve this issue. Note: This message is displayed only when the OFI component's verbosity level is -1891646640 or higher. -- c6.191679map_hfi_mem: mmap of rcvhdr_bufbase (0xdabbad00040b) size 262144 failed: Resource temporarily unavailable c6.191679osu_init: An unrecoverable error occurred while communicating with the driver [c6:191679] *** Process received signal *** [c6:191679] Signal: Aborted (6) [c6:191679] Signal code: (-6) [c6:191679] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7f518fb09cf0] [c6:191679] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f518f780acf] [c6:191679] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f518f753ea5] [c6:191679] [ 3]
Re: [OMPI users] ucx configuration
You can pick one test, make it standalone, and open an issue on GitHub. How does (vanilla) Open MPI compare to your vendor Open MPI based library? Cheers, Gilles On Wed, Jan 11, 2023 at 10:20 PM Dave Love via users < users@lists.open-mpi.org> wrote: > Gilles Gouaillardet via users writes: > > > Dave, > > > > If there is a bug you would like to report, please open an issue at > > https://github.com/open-mpi/ompi/issues and provide all the required > > information > > (in this case, it should also include the UCX library you are using and > how > > it was obtained or built). > > There are hundreds of failures I was interested in resolving with the > latest versions, though I think somewhat fewer than with previous UCX > versions. > > I'd like to know how it's recommended I should build to ensure I'm > starting from the right place for any investigation. Possible interplay > between OMPI and UCX options seems worth understanding specifically, and > I think it's reasonable to ask how to configure things to work together > generally, when there are so many options without much explanation. > > I have tried raising issues previously without much luck but, given the > number of failures, something is fundamentally wrong, and I doubt you > want the output from the whole set. > > Perhaps the MPICH test set in a "portable" configuration is expected to > fail with OMPI for some reason, and someone can comment on that. > However, it's the only comprehensive set I know is available, and > originally even IMB crashed, so I'm not inclined to blame the tests > initially, and wonder how this stuff is tested.
Re: [OMPI users] ucx configuration
Gilles Gouaillardet via users writes: > Dave, > > If there is a bug you would like to report, please open an issue at > https://github.com/open-mpi/ompi/issues and provide all the required > information > (in this case, it should also include the UCX library you are using and how > it was obtained or built). There are hundreds of failures I was interested in resolving with the latest versions, though I think somewhat fewer than with previous UCX versions. I'd like to know how it's recommended I should build to ensure I'm starting from the right place for any investigation. Possible interplay between OMPI and UCX options seems worth understanding specifically, and I think it's reasonable to ask how to configure things to work together generally, when there are so many options without much explanation. I have tried raising issues previously without much luck but, given the number of failures, something is fundamentally wrong, and I doubt you want the output from the whole set. Perhaps the MPICH test set in a "portable" configuration is expected to fail with OMPI for some reason, and someone can comment on that. However, it's the only comprehensive set I know is available, and originally even IMB crashed, so I'm not inclined to blame the tests initially, and wonder how this stuff is tested.