Never seen anything like that before - am I reading those errors correctly that it cannot find the "write" function symbol in libc?? Frankly, if that's true then it sounds like something is borked in the system.
> On Jan 25, 2022, at 8:26 AM, Matthias Leopold via users > <users@lists.open-mpi.org> wrote: > > just in case anyone wants to do more debugging: I ran "srun --mpi=pmix" now > with "LD_DEBUG=all", the lines preceding the error are > > 1263345: symbol=write; lookup in > file=/lib/x86_64-linux-gnu/libpthread.so.0 [0] > > 1263345: binding file > /msc/sw/hpc-sdk/Linux_x86_64/21.9/comm_libs/mpi/lib/libopen-pal.so.40 [0] to > /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `write' [GLIBC_2.2.5] > > [foo:1263345] OPAL ERROR: Error in file > ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112 > > > again: PMIx library version used by SLURM is 3.2.3 > > thx > Matthias > > Am 25.01.22 um 11:04 schrieb Gilles Gouaillardet: >> Matthias, >> Thanks for the clarifications. >> Unfortunately, I cannot connect the dots and I must be missing something. >> If I recap correctly: >> - SLURM has builtin PMIx support >> - Open MPI has builtin PMIx support >> - srun explicitly requires PMIx (srun --mpi=pmix_v3 ...) >> - and yet Open MPI issues an error message stating missing support for PMI >> (aka SLURM provided PMI1/PMI2) >> So it seems Open PMI builtin PMIx client is unable to find/communicate with >> SLURM PMIx server >> PMIx has cross version compatibility (e.g. client and server can have some >> different versions), but with some restrictions >> Could this be the root cause? >> What is the PMIx library version used by SLURM? >> Ralph, do you see something wrong on why Open MPI and SLURM cannot >> communicate via PMIx? >> Cheers, >> Gilles >> On Tue, Jan 25, 2022 at 5:47 PM Matthias Leopold >> <matthias.leop...@meduniwien.ac.at >> <mailto:matthias.leop...@meduniwien.ac.at>> wrote: >> Hi Gilles, >> I'm indeed using srun, I didn't have luck using mpirun yet. >> Are option 2 + 3 of your list really different things? As far as I >> understood now I need "Open MPI with PMI support", THEN I can use srun >> with PMIx. Right now using "srun --mpi=pmix(_v3)" gives the error >> mentioned below. >> Best, >> Matthias >> Am 25.01.22 um 07:17 schrieb Gilles Gouaillardet via users: >> > Matthias, >> > >> > do you run the MPI application with mpirun or srun? >> > >> > The error log suggests you are using srun, and SLURM only >> provides only >> > PMI support. >> > If this is the case, then you have three options: >> > - use mpirun >> > - rebuild Open MPI with PMI support as Ralph previously explained >> > - use SLURM PMIx: >> > srun --mpi=list >> > will list the PMI flavors provided by SLURM >> > a) if PMIx is not supported, contact your sysadmin and ask for it >> > b) if PMIx is supported but is not the default, ask for it, for >> > example with >> > srun --mpi=pmix_v3 ... >> > >> > Cheers, >> > >> > Gilles >> > >> > On Tue, Jan 25, 2022 at 12:30 AM Ralph Castain via users >> > <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>> >> wrote: >> > >> > You should probably ask them - I see in the top one that they >> used a >> > platform file, which likely had the missing option in it. The >> bottom >> > one does not use that platform file, so it was probably missed. >> > >> > >> > > On Jan 24, 2022, at 7:17 AM, Matthias Leopold via users >> > <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>> >> wrote: >> > > >> > > To be sure: both packages were provided by NVIDIA (I didn't >> > compile them) >> > > >> > > Am 24.01.22 um 16:13 schrieb Matthias Leopold: >> > >> Thx, but I don't see this option in any of the two versions: >> > >> /usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info (works with >> slurm): >> > >> Configure command line: '--build=x86_64-linux-gnu' >> > '--prefix=/usr' '--includedir=${prefix}/include' >> > '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' >> > '--sysconfdir=/etc' '--localstatedir=/var' >> '--disable-silent-rules' >> > '--libexecdir=${prefix}/lib/openmpi' '--disable-maintainer-mode' >> > '--disable-dependency-tracking' >> > '--prefix=/usr/mpi/gcc/openmpi-4.1.2a1' >> > '--with-platform=contrib/platform/mellanox/optimized' >> > >> lmod ompi (doesn't work with slurm) >> > >> Configure command line: >> > >> '--prefix=/proj/nv/libraries/Linux_x86_64/dev/openmpi4/205295-dev-clean-1' >> > 'CC=nvc -nomp' 'CXX=nvc++ -nomp' 'FC=nvfortran -nomp' 'CFLAGS=-O1 >> > -fPIC -c99 -tp p7-64' 'CXXFLAGS=-O1 -fPIC -tp p7-64' 'FCFLAGS=-O1 >> > -fPIC -tp p7-64' 'LD=ld' '--enable-shared' '--enable-static' >> > '--without-tm' '--enable-mpi-cxx' '--disable-wrapper-runpath' >> > '--enable-mpirun-prefix-by-default' '--with-libevent=internal' >> > '--with-slurm' '--without-libnl' '--enable-mpi1-compatibility' >> > '--enable-mca-no-build=btl-uct' '--without-verbs' >> > '--with-cuda=/proj/cuda/11.0/Linux_x86_64' >> > >> '--with-ucx=/proj/nv/libraries/Linux_x86_64/dev/openmpi4/205295-dev-clean-1' >> > Matthias >> > >> Am 24.01.22 um 15:59 schrieb Ralph Castain via users: >> > >>> If you look at your configure line, you forgot to include >> > --with-pmi=<path-to-slurm-pmi-lib>. We don't build the Slurm PMI >> > support by default due to the GPL licensing issues - you have to >> > point at it. >> > >>> >> > >>> >> > >>>> On Jan 24, 2022, at 6:41 AM, Matthias Leopold via users >> > <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>> >> wrote: >> > >>>> >> > >>>> Hi, >> > >>>> >> > >>>> we have 2 DGX A100 machines and I'm trying to run >> nccl-tests >> > (https://github.com/NVIDIA/nccl-tests >> <https://github.com/NVIDIA/nccl-tests> >> > <https://github.com/NVIDIA/nccl-tests >> <https://github.com/NVIDIA/nccl-tests>>) in various ways to >> > understand how things work. >> > >>>> >> > >>>> I can successfully run nccl-tests on both nodes with Slurm >> > (via srun) when built directly on a compute node against Open MPI >> > 4.1.2 coming from a NVIDIA deb package. >> > >>>> >> > >>>> I can also build nccl-tests in a lmod environment with >> NVIDIA >> > HPC SDK 21.09 with Open MPI 4.0.5. When I run this with Slurm >> (via >> > srun) I get the following message: >> > >>>> >> > >>>> [foo:1140698] OPAL ERROR: Error in file >> > ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112 >> > >>>> >> > >>>> >> > >> -------------------------------------------------------------------------- >> > >> > >>>> >> > >>>> The application appears to have been direct launched >> using "srun", >> > >>>> >> > >>>> but OMPI was not built with SLURM's PMI support and >> therefore >> > cannot >> > >>>> >> > >>>> execute. There are several options for building PMI >> support under >> > >>>> >> > >>>> SLURM, depending upon the SLURM version you are using: >> > >>>> >> > >>>> >> > >>>> >> > >>>> version 16.05 or later: you can use SLURM's PMIx >> support. This >> > >>>> >> > >>>> requires that you configure and build SLURM --with-pmix. >> > >>>> >> > >>>> >> > >>>> >> > >>>> Versions earlier than 16.05: you must use either SLURM's >> > PMI-1 or >> > >>>> >> > >>>> PMI-2 support. SLURM builds PMI-1 by default, or you can >> > manually >> > >>>> >> > >>>> install PMI-2. You must then build Open MPI using >> --with-pmi >> > pointing >> > >>>> >> > >>>> to the SLURM PMI library location. >> > >>>> >> > >>>> >> > >>>> >> > >>>> Please configure as appropriate and try again. >> > >>>> >> > >>>> >> > >> -------------------------------------------------------------------------- >> > >> > >>>> >> > >>>> *** An error occurred in MPI_Init >> > >>>> >> > >>>> *** on a NULL communicator >> > >>>> >> > >>>> *** MPI_ERRORS_ARE_FATAL (processes in this >> communicator will >> > now abort, >> > >>>> >> > >>>> *** and potentially your MPI job) >> > >>>> >> > >>>> >> > >>>> >> > >>>> When I look at PMI support in both Open MPI packages I >> don't >> > see a lot of difference: >> > >>>> >> > >>>> “/usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info --parsable >> | grep >> > -i pmi”: >> > >>>> >> > >>>> mca:pmix:isolated:version:“mca:2.1.0” >> > >>>> mca:pmix:isolated:version:“api:2.0.0” >> > >>>> mca:pmix:isolated:version:“component:4.1.2” >> > >>>> mca:pmix:flux:version:“mca:2.1.0” >> > >>>> mca:pmix:flux:version:“api:2.0.0” >> > >>>> mca:pmix:flux:version:“component:4.1.2” >> > >>>> mca:pmix:pmix3x:version:“mca:2.1.0” >> > >>>> mca:pmix:pmix3x:version:“api:2.0.0” >> > >>>> mca:pmix:pmix3x:version:“component:4.1.2” >> > >>>> mca:ess:pmi:version:“mca:2.1.0” >> > >>>> mca:ess:pmi:version:“api:3.0.0” >> > >>>> mca:ess:pmi:version:“component:4.1.2” >> > >>>> >> > >>>> >> “/msc/sw/hpc-sdk/Linux_x86_64/21.9/comm_libs/mpi/bin/ompi_info >> > --parsable | grep -i pmi”: >> > >>>> >> > >>>> mca:pmix:isolated:version:“mca:2.1.0” >> > >>>> mca:pmix:isolated:version:“api:2.0.0” >> > >>>> mca:pmix:isolated:version:“component:4.0.5” >> > >>>> mca:pmix:pmix3x:version:“mca:2.1.0” >> > >>>> mca:pmix:pmix3x:version:“api:2.0.0” >> > >>>> mca:pmix:pmix3x:version:“component:4.0.5” >> > >>>> mca:ess:pmi:version:“mca:2.1.0” >> > >>>> mca:ess:pmi:version:“api:3.0.0” >> > >>>> mca:ess:pmi:version:“component:4.0.5” >> > >>>> >> > >>>> I don't know if that's the right place I'm looking at, >> but to >> > me it seems it's an Open MPI topic, this is why I'm posting here. >> > Please explain what's missing in my case. >> > >>>> >> > >>>> Slurm is 21.08.5. "MpiDefault" in slurm.conf is "pmix". >> > >>>> Both Open MPI versions have Slurm support. >> > >>>> >> > >>>> thx >> > >>>> Matthias >> > >>> >> > >>> >> > > >> > > -- >> > > Matthias Leopold >> > > IT Systems & Communications >> > > Medizinische Universität Wien >> > > Spitalgasse 23 / BT 88 / Ebene 00 >> > > A-1090 Wien >> > > Tel: +43 1 40160-21241 >> > > Fax: +43 1 40160-921200 >> > >> -- Matthias Leopold >> IT Systems & Communications >> Medizinische Universität Wien >> Spitalgasse 23 / BT 88 / Ebene 00 >> A-1090 Wien >> Tel: +43 1 40160-21241 >> Fax: +43 1 40160-921200 > > -- > Matthias Leopold > IT Systems & Communications > Medizinische Universität Wien > Spitalgasse 23 / BT 88 / Ebene 00 > A-1090 Wien > Tel: +43 1 40160-21241 > Fax: +43 1 40160-921200