Never seen anything like that before - am I reading those errors correctly that 
it cannot find the "write" function symbol in libc?? Frankly, if that's true 
then it sounds like something is borked in the system.


> On Jan 25, 2022, at 8:26 AM, Matthias Leopold via users 
> <users@lists.open-mpi.org> wrote:
> 
> just in case anyone wants to do more debugging: I ran "srun --mpi=pmix" now 
> with "LD_DEBUG=all", the lines preceding the error are
> 
>   1263345:    symbol=write;  lookup in 
> file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
> 
>   1263345:    binding file 
> /msc/sw/hpc-sdk/Linux_x86_64/21.9/comm_libs/mpi/lib/libopen-pal.so.40 [0] to 
> /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `write' [GLIBC_2.2.5]
> 
> [foo:1263345] OPAL ERROR: Error in file 
> ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
> 
> 
> again: PMIx library version used by SLURM is 3.2.3
> 
> thx
> Matthias
> 
> Am 25.01.22 um 11:04 schrieb Gilles Gouaillardet:
>> Matthias,
>> Thanks for the clarifications.
>> Unfortunately, I cannot connect the dots and I must be missing something.
>> If I recap correctly:
>>  - SLURM has builtin PMIx support
>>  - Open MPI has builtin PMIx support
>>  - srun explicitly requires PMIx (srun --mpi=pmix_v3 ...)
>>  - and yet Open MPI issues an error message stating missing support for PMI 
>> (aka SLURM provided PMI1/PMI2)
>> So it seems Open PMI builtin PMIx client is unable to find/communicate with 
>> SLURM PMIx server
>> PMIx has cross version compatibility (e.g. client and server can have some 
>> different versions), but with some restrictions
>> Could this be the root cause?
>> What is the PMIx library version used by SLURM?
>> Ralph, do you see something wrong on why Open MPI and SLURM cannot 
>> communicate via PMIx?
>> Cheers,
>> Gilles
>> On Tue, Jan 25, 2022 at 5:47 PM Matthias Leopold 
>> <matthias.leop...@meduniwien.ac.at 
>> <mailto:matthias.leop...@meduniwien.ac.at>> wrote:
>>    Hi Gilles,
>>    I'm indeed using srun, I didn't have luck using mpirun yet.
>>    Are option 2 + 3 of your list really different things? As far as I
>>    understood now I need "Open MPI with PMI support", THEN I can use srun
>>    with PMIx. Right now using "srun --mpi=pmix(_v3)" gives the error
>>    mentioned below.
>>    Best,
>>    Matthias
>>    Am 25.01.22 um 07:17 schrieb Gilles Gouaillardet via users:
>>     > Matthias,
>>     >
>>     > do you run the MPI application with mpirun or srun?
>>     >
>>     > The error log suggests you are using srun, and SLURM only
>>    provides only
>>     > PMI support.
>>     > If this is the case, then you have three options:
>>     >   - use mpirun
>>     >   - rebuild Open MPI with PMI support as Ralph previously explained
>>     >   - use SLURM PMIx:
>>     >      srun --mpi=list
>>     >      will list the PMI flavors provided by SLURM
>>     >     a) if PMIx is not supported, contact your sysadmin and ask for it
>>     >     b) if PMIx is supported but is not the default, ask for it, for
>>     > example with
>>     >         srun --mpi=pmix_v3 ...
>>     >
>>     > Cheers,
>>     >
>>     > Gilles
>>     >
>>     > On Tue, Jan 25, 2022 at 12:30 AM Ralph Castain via users
>>     > <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>    <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>>
>>    wrote:
>>     >
>>     >     You should probably ask them - I see in the top one that they
>>    used a
>>     >     platform file, which likely had the missing option in it. The
>>    bottom
>>     >     one does not use that platform file, so it was probably missed.
>>     >
>>     >
>>     >      > On Jan 24, 2022, at 7:17 AM, Matthias Leopold via users
>>     >     <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>    <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>>
>>    wrote:
>>     >      >
>>     >      > To be sure: both packages were provided by NVIDIA (I didn't
>>     >     compile them)
>>     >      >
>>     >      > Am 24.01.22 um 16:13 schrieb Matthias Leopold:
>>     >      >> Thx, but I don't see this option in any of the two versions:
>>     >      >> /usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info (works with
>>    slurm):
>>     >      >>   Configure command line: '--build=x86_64-linux-gnu'
>>     >     '--prefix=/usr' '--includedir=${prefix}/include'
>>     >     '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info'
>>     >     '--sysconfdir=/etc' '--localstatedir=/var'
>>    '--disable-silent-rules'
>>     >     '--libexecdir=${prefix}/lib/openmpi' '--disable-maintainer-mode'
>>     >     '--disable-dependency-tracking'
>>     >     '--prefix=/usr/mpi/gcc/openmpi-4.1.2a1'
>>     >     '--with-platform=contrib/platform/mellanox/optimized'
>>     >      >> lmod ompi (doesn't work with slurm)
>>     >      >>   Configure command line:
>>     >         
>> '--prefix=/proj/nv/libraries/Linux_x86_64/dev/openmpi4/205295-dev-clean-1'
>>     >     'CC=nvc -nomp' 'CXX=nvc++ -nomp' 'FC=nvfortran -nomp' 'CFLAGS=-O1
>>     >     -fPIC -c99 -tp p7-64' 'CXXFLAGS=-O1 -fPIC -tp p7-64' 'FCFLAGS=-O1
>>     >     -fPIC -tp p7-64' 'LD=ld' '--enable-shared' '--enable-static'
>>     >     '--without-tm' '--enable-mpi-cxx' '--disable-wrapper-runpath'
>>     >     '--enable-mpirun-prefix-by-default' '--with-libevent=internal'
>>     >     '--with-slurm' '--without-libnl' '--enable-mpi1-compatibility'
>>     >     '--enable-mca-no-build=btl-uct' '--without-verbs'
>>     >     '--with-cuda=/proj/cuda/11.0/Linux_x86_64'
>>     >         
>> '--with-ucx=/proj/nv/libraries/Linux_x86_64/dev/openmpi4/205295-dev-clean-1'
>>     >     Matthias
>>     >      >> Am 24.01.22 um 15:59 schrieb Ralph Castain via users:
>>     >      >>> If you look at your configure line, you forgot to include
>>     >     --with-pmi=<path-to-slurm-pmi-lib>. We don't build the Slurm PMI
>>     >     support by default due to the GPL licensing issues - you have to
>>     >     point at it.
>>     >      >>>
>>     >      >>>
>>     >      >>>> On Jan 24, 2022, at 6:41 AM, Matthias Leopold via users
>>     >     <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>    <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>>
>>    wrote:
>>     >      >>>>
>>     >      >>>> Hi,
>>     >      >>>>
>>     >      >>>> we have 2 DGX A100 machines and I'm trying to run
>>    nccl-tests
>>     >     (https://github.com/NVIDIA/nccl-tests
>>    <https://github.com/NVIDIA/nccl-tests>
>>     >     <https://github.com/NVIDIA/nccl-tests
>>    <https://github.com/NVIDIA/nccl-tests>>) in various ways to
>>     >     understand how things work.
>>     >      >>>>
>>     >      >>>> I can successfully run nccl-tests on both nodes with Slurm
>>     >     (via srun) when built directly on a compute node against Open MPI
>>     >     4.1.2 coming from a NVIDIA deb package.
>>     >      >>>>
>>     >      >>>> I can also build nccl-tests in a lmod environment with
>>    NVIDIA
>>     >     HPC SDK 21.09 with Open MPI 4.0.5. When I run this with Slurm
>>    (via
>>     >     srun) I get the following message:
>>     >      >>>>
>>     >      >>>> [foo:1140698] OPAL ERROR: Error in file
>>     >     ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
>>     >      >>>>
>>     >      >>>>
>>     >         
>> --------------------------------------------------------------------------
>>     >
>>     >      >>>>
>>     >      >>>> The application appears to have been direct launched
>>    using "srun",
>>     >      >>>>
>>     >      >>>> but OMPI was not built with SLURM's PMI support and
>>    therefore
>>     >     cannot
>>     >      >>>>
>>     >      >>>> execute. There are several options for building PMI
>>    support under
>>     >      >>>>
>>     >      >>>> SLURM, depending upon the SLURM version you are using:
>>     >      >>>>
>>     >      >>>>
>>     >      >>>>
>>     >      >>>>   version 16.05 or later: you can use SLURM's PMIx
>>    support. This
>>     >      >>>>
>>     >      >>>>   requires that you configure and build SLURM --with-pmix.
>>     >      >>>>
>>     >      >>>>
>>     >      >>>>
>>     >      >>>>   Versions earlier than 16.05: you must use either SLURM's
>>     >     PMI-1 or
>>     >      >>>>
>>     >      >>>>   PMI-2 support. SLURM builds PMI-1 by default, or you can
>>     >     manually
>>     >      >>>>
>>     >      >>>>   install PMI-2. You must then build Open MPI using
>>    --with-pmi
>>     >     pointing
>>     >      >>>>
>>     >      >>>>   to the SLURM PMI library location.
>>     >      >>>>
>>     >      >>>>
>>     >      >>>>
>>     >      >>>> Please configure as appropriate and try again.
>>     >      >>>>
>>     >      >>>>
>>     >         
>> --------------------------------------------------------------------------
>>     >
>>     >      >>>>
>>     >      >>>> *** An error occurred in MPI_Init
>>     >      >>>>
>>     >      >>>> *** on a NULL communicator
>>     >      >>>>
>>     >      >>>> *** MPI_ERRORS_ARE_FATAL (processes in this
>>    communicator will
>>     >     now abort,
>>     >      >>>>
>>     >      >>>> ***    and potentially your MPI job)
>>     >      >>>>
>>     >      >>>>
>>     >      >>>>
>>     >      >>>> When I look at PMI support in both Open MPI packages I
>>    don't
>>     >     see a lot of difference:
>>     >      >>>>
>>     >      >>>> “/usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info --parsable
>>    | grep
>>     >     -i pmi”:
>>     >      >>>>
>>     >      >>>> mca:pmix:isolated:version:“mca:2.1.0”
>>     >      >>>> mca:pmix:isolated:version:“api:2.0.0”
>>     >      >>>> mca:pmix:isolated:version:“component:4.1.2”
>>     >      >>>> mca:pmix:flux:version:“mca:2.1.0”
>>     >      >>>> mca:pmix:flux:version:“api:2.0.0”
>>     >      >>>> mca:pmix:flux:version:“component:4.1.2”
>>     >      >>>> mca:pmix:pmix3x:version:“mca:2.1.0”
>>     >      >>>> mca:pmix:pmix3x:version:“api:2.0.0”
>>     >      >>>> mca:pmix:pmix3x:version:“component:4.1.2”
>>     >      >>>> mca:ess:pmi:version:“mca:2.1.0”
>>     >      >>>> mca:ess:pmi:version:“api:3.0.0”
>>     >      >>>> mca:ess:pmi:version:“component:4.1.2”
>>     >      >>>>
>>     >      >>>>
>>    “/msc/sw/hpc-sdk/Linux_x86_64/21.9/comm_libs/mpi/bin/ompi_info
>>     >     --parsable | grep -i pmi”:
>>     >      >>>>
>>     >      >>>> mca:pmix:isolated:version:“mca:2.1.0”
>>     >      >>>> mca:pmix:isolated:version:“api:2.0.0”
>>     >      >>>> mca:pmix:isolated:version:“component:4.0.5”
>>     >      >>>> mca:pmix:pmix3x:version:“mca:2.1.0”
>>     >      >>>> mca:pmix:pmix3x:version:“api:2.0.0”
>>     >      >>>> mca:pmix:pmix3x:version:“component:4.0.5”
>>     >      >>>> mca:ess:pmi:version:“mca:2.1.0”
>>     >      >>>> mca:ess:pmi:version:“api:3.0.0”
>>     >      >>>> mca:ess:pmi:version:“component:4.0.5”
>>     >      >>>>
>>     >      >>>> I don't know if that's the right place I'm looking at,
>>    but to
>>     >     me it seems it's an Open MPI topic, this is why I'm posting here.
>>     >     Please explain what's missing in my case.
>>     >      >>>>
>>     >      >>>> Slurm is 21.08.5. "MpiDefault" in slurm.conf is "pmix".
>>     >      >>>> Both Open MPI versions have Slurm support.
>>     >      >>>>
>>     >      >>>> thx
>>     >      >>>> Matthias
>>     >      >>>
>>     >      >>>
>>     >      >
>>     >      > --
>>     >      > Matthias Leopold
>>     >      > IT Systems & Communications
>>     >      > Medizinische Universität Wien
>>     >      > Spitalgasse 23 / BT 88 / Ebene 00
>>     >      > A-1090 Wien
>>     >      > Tel: +43 1 40160-21241
>>     >      > Fax: +43 1 40160-921200
>>     >
>>    --     Matthias Leopold
>>    IT Systems & Communications
>>    Medizinische Universität Wien
>>    Spitalgasse 23 / BT 88 / Ebene 00
>>    A-1090 Wien
>>    Tel: +43 1 40160-21241
>>    Fax: +43 1 40160-921200
> 
> -- 
> Matthias Leopold
> IT Systems & Communications
> Medizinische Universität Wien
> Spitalgasse 23 / BT 88 / Ebene 00
> A-1090 Wien
> Tel: +43 1 40160-21241
> Fax: +43 1 40160-921200


Reply via email to