Leandro,

First you must make sure SLURM has been built with PMIx (preferably
PMIx 3.1.5) and the pmix plugin was built.

>From the Open MPI point of view, you do not need the
--with-ompi-pmix-rte option.

If you want to uses srun, just make sure it uses pmix.
you can
srun --mpi=list

If you want to use mpirun under SLURM, make sure the nodes in your
machine file have been allocated by SLURM

Cheers,

Gilles
to list the available plugins and the default ones, then you would typically do
srun --mpi=pmix_v3 a.out

Cheers,

Gilles

On Tue, May 12, 2020 at 11:03 AM Leandro via users
<users@lists.open-mpi.org> wrote:
>
> Well, I have to build these packages because of specific compilers and some 
> options the developers need. I do this for years, Always worked fine. But, 
> nowadays I am using torque and maui on the clusters, and theses are giving me 
> problems now.
>
> The openmpi we use now, with torque, is built using a custom build ucx, a 
> compatible version of hwloc (the one in CentOS or EPEL repos does no work), 
> using cuda and mellanox drivers. All this work, and is in production.
>
> All what I want is do the same but to work with slurm. It is new for me but 
> following the documentation it says it need pmix to work. I don`t know what I 
> am missing in this.
>
> I'm over this for days and don't know what to do anymore.
>
> Any help would be appreciated.
>
> ---
> Leandro
>
>
> On Mon, May 11, 2020 at 8:28 PM Ralph Castain via users 
> <users@lists.open-mpi.org> wrote:
>>
>> I'm not sure I understand why you are trying to build CentOS rpms for PMIx, 
>> Slurm, or OMPI - all three are readily available online. Is there some 
>> particular reason you are trying to do this yourself? I ask because it is 
>> non-trivial to do and requires significant familiarity with both the 
>> intricacies of bpm building and the packages involved.
>>
>>
>> On May 11, 2020, at 6:23 AM, Leandro via users <users@lists.open-mpi.org> 
>> wrote:
>>
>> Hi,
>>
>> I'm trying to start using Slurm, and I followed all the instructions ti 
>> build PMIx, Slurm using pmix, but I can't make openmpi to work.
>>
>> According to PMIx documentation, I should compile openmpi using 
>> "--with-ompi-pmix-rte" but when I tried, It fails. I need to build this as 
>> CentOS rpms.
>>
>> Thanks in advance for your help. I pasted some info below.
>>
>> libtool: link: 
>> /tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icc
>>  -std=gnu99 -std=gnu99 -DOPAL_CONFIGURE_USER=\"root\" 
>> -DOPAL_CONFIGURE_HOST=\"gr10b17n05\" "-DOPAL_CONFIGURE_DATE=\"Fri May  8 
>> 13:35:51 -03 2020\"" -DOMPI_BUILD_USER=\"root\" 
>> -DOMPI_BUILD_HOST=\"gr10b17n05\" "-DOMPI_BUILD_DATE=\"Fri May  8 13:47:32 
>> -03 2020\"" "-DOMPI_BUILD_CFLAGS=\"-DNDEBUG -O3 -finline-functions 
>> -fno-strict-aliasing -restrict -Qoption,cpp,--extended_float_types 
>> -pthread\"" "-DOMPI_BUILD_CPPFLAGS=\"-I../../.. -I../../../orte/include    
>> \"" "-DOMPI_BUILD_CXXFLAGS=\"-DNDEBUG -O3 -finline-functions -pthread\"" 
>> "-DOMPI_BUILD_CXXCPPFLAGS=\"-I../../..  \"" "-DOMPI_BUILD_FFLAGS=\"-O2 -g 
>> -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong 
>> --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic 
>> -I/usr/lib64/gfortran/modules\"" -DOMPI_BUILD_FCFLAGS=\"-O3\" 
>> "-DOMPI_BUILD_LDFLAGS=\"-Wc,-static-intel -static-intel    -L/usr/lib64\"" 
>> "-DOMPI_BUILD_LIBS=\"-lrt -lutil  -lz  -lhwloc  -levent -levent_pthreads\"" 
>> -DOPAL_CC_ABSOLUTE=\"\" -DOMPI_CXX_ABSOLUTE=\"none\" -DNDEBUG -O3 
>> -finline-functions -fno-strict-aliasing -restrict 
>> -Qoption,cpp,--extended_float_types -pthread -static-intel -static-intel -o 
>> .libs/ompi_info ompi_info.o param.o  -L/usr/lib64 
>> ../../../ompi/.libs/libmpi.so -L/usr/lib -llustreapi 
>> /root/rpmbuild/BUILD/openmpi-4.0.2/opal/.libs/libopen-pal.so 
>> ../../../opal/.libs/libopen-pal.so -lfabric -lucp -lucm -lucs -luct -lrdmacm 
>> -libverbs /usr/lib64/libpmix.so -lmunge -lrt -lutil -lz 
>> /usr/lib64/libhwloc.so -lm -ludev -lltdl -levent -levent_pthreads -pthread 
>> -Wl,-rpath -Wl,/usr/lib64
>> icc: warning #10237: -lcilkrts linked in dynamically, static library not 
>> available
>> ../../../ompi/.libs/libmpi.so: undefined reference to `orte_process_info'
>> ../../../ompi/.libs/libmpi.so: undefined reference to `orte_show_help'
>> make[2]: *** [ompi_info] Error 1
>> make[2]: Leaving directory 
>> `/root/rpmbuild/BUILD/openmpi-4.0.2/ompi/tools/ompi_info'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory `/root/rpmbuild/BUILD/openmpi-4.0.2/ompi'
>> make: *** [all-recursive] Error 1
>> error: Bad exit status from /var/tmp/rpm-tmp.RyklCR (%build)
>>
>> The orte libraries are missing. When I don't use "-with-ompi-pmix-rte" it 
>> builds, but neither mpirun or srun works:
>>
>> c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes > cat machine_file
>> gr10b17n05
>> gr10b17n06
>> gr10b17n07
>> gr10b17n08
>> c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes > mpirun -machinefile 
>> machine_file ./mpihello
>> [gr10b17n07:115065] [[21391,0],2] ORTE_ERROR_LOG: Not found in file 
>> base/ess_base_std_orted.c at line 362
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   opal_pmix_base_select failed
>>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> ORTE was unable to reliably start one or more daemons.
>> This usually is caused by:
>>
>> * not finding the required libraries and/or binaries on
>>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>   settings, or configure OMPI with --enable-orterun-prefix-by-default
>>
>> * lack of authority to execute on one or more specified nodes.
>>   Please verify your allocation and authorities.
>>
>> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>>   Please check with your sys admin to determine the correct location to use.
>>
>> *  compilation of the orted with dynamic libraries when static are required
>>   (e.g., on Cray). Please check your configure cmd line and consider using
>>   one of the contrib/platform definitions for your system type.
>>
>> * an inability to create a connection back to mpirun due to a
>>   lack of common network interfaces and/or no route found between
>>   them. Please check network connectivity (including firewalls
>>   and network routing requirements).
>> --------------------------------------------------------------------------
>> [gr10b17n08:142030] [[21391,0],3] ORTE_ERROR_LOG: Not found in file 
>> base/ess_base_std_orted.c at line 362
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   opal_pmix_base_select failed
>>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> ORTE does not know how to route a message to the specified daemon
>> located on the indicated node:
>>
>>   my node:   gr10b17n05
>>   target node:  gr10b17n06
>>
>> This is usually an internal programming error that should be
>> reported to the developers. In the meantime, a workaround may
>> be to set the MCA param routed=direct on the command line or
>> in your environment. We apologize for the problem.
>> --------------------------------------------------------------------------
>> [gr10b17n05:171586] 1 more process has sent help message 
>> help-errmgr-base.txt / no-path
>> [gr10b17n05:171586] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
>> all help / error messages
>> c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes >
>>
>> --------------------------
>> c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > mpirun --nolocal -np 1 
>> --machinefile machine_file mpihello
>> [gr10pbs2:242828] [[60566,0],0] ORTE_ERROR_LOG: Not found in file 
>> ess_hnp_module.c at line 320
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   opal_pmix_base_select failed
>>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>> c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > mpirun --nolocal -np 1 
>> --machinefile machine_file mpihello
>> [gr10pbs2:237314] [[50968,0],0] ORTE_ERROR_LOG: Not found in file 
>> ess_hnp_module.c at line 320
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   opal_pmix_base_select failed
>>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>> c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes >
>>
>> c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > srun -N4 
>> /bw1nfs1/Projetos1/c315/Meus_testes/mpihello
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [gr10b17n05:172693] Local abort before MPI_INIT completed completed 
>> successfully, but am not able to aggregate error messages, and not able to 
>> guarantee that all other processes were killed!
>> srun: error: gr10b17n05: task 0: Exited with exit code 1
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   getting job size failed
>>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   getting job size failed
>>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   orte_ess_init failed
>>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   orte_ess_init failed
>>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>>   ompi_mpi_init: ompi_rte_init failed
>>   --> Returned "Not found" (-13) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>>   ompi_mpi_init: ompi_rte_init failed
>>   --> Returned "Not found" (-13) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> [gr10b17n07:116175] Local abort before MPI_INIT completed completed 
>> successfully, but am not able to aggregate error messages, and not able to 
>> guarantee that all other processes were killed!
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [gr10b17n06:142082] Local abort before MPI_INIT completed completed 
>> successfully, but am not able to aggregate error messages, and not able to 
>> guarantee that all other processes were killed!
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   getting job size failed
>>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   orte_ess_init failed
>>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>>   ompi_mpi_init: ompi_rte_init failed
>>   --> Returned "Not found" (-13) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [gr10b17n08:143134] Local abort before MPI_INIT completed completed 
>> successfully, but am not able to aggregate error messages, and not able to 
>> guarantee that all other processes were killed!
>> srun: error: gr10b17n07: task 2: Exited with exit code 1
>> srun: error: gr10b17n06: task 1: Exited with exit code 1
>> srun: error: gr10b17n08: task 3: Exited with exit code 1
>> c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes >
>>
>> Slurm information:
>> c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > srun --mpi=list
>> srun: MPI types are...
>> srun: pmix_v3
>> srun: none
>> srun: pmi2
>> srun: pmix
>>
>> The compilation lines used for PMIx and openmpi:
>>
>> MAKEFLAGS="-j24 V=99" rpmbuild -ba --define 'install_in_opt 0' --define 
>> "configure_options --enable-shared --enable-static --with-jansson=/usr 
>> --with-libevent=/usr --with-libevent-libdir=/usr/lib64 --with-hwloc=/usr 
>> --with-curl=/usr --without-opamgt --with-munge=/usr --with-lustre=/usr 
>> --enable-pmix-timing --enable-pmi-backward-compatibility 
>> --enable-pmix-binaries --with-devel-headers --with-tests-examples 
>> --disable-mca-dso --disable-weak-symbols 
>> AR=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xiar
>>  
>> LD=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xild
>>  
>> CC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icc
>>  
>> FC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
>>  
>> F90=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
>>  
>> F77=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
>>  
>> CXX=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icpc
>>  LDFLAGS='-Wc,-static-intel -static-intel' CFLAGS=-O3  FCFLAGS=-O3 
>> F77FLAGS=-O3  F90FLAGS=-O3  CXXFLAGS=-O3 MFLAGS='-j24 V99'" pmix-3.1.5.spec
>>
>> MAKEFLAGS="-j24 V=99" rpmbuild -ba --define 'install_in_opt 0' --define 
>> "configure_options --enable-shared --enable-static --with-libevent=/usr 
>> --with-libevent-libdir=/usr/lib64 --with-pmix=/usr 
>> --with-pmix-libdir=/usr/lib64 --enable-install-libpmix --with-ompi-pmix-rte 
>> --without-orte --with-slurm --with-ucx=/usr --with-cuda=/usr/local/cuda 
>> --with-gdrcopy=/usr --with-hwloc --enable-mpi-cxx --disable-mca-dso 
>> --enable-mpi-fortran --disable-weak-symbols --enable-mpi-thread-multiple 
>> --enable-contrib-no-build=vt --enable-mpirun-prefix-by-default 
>> --enable-orterun-prefix-by-default --with-cuda=/usr/local/cuda 
>> AR=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xiar
>>  
>> LD=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xild
>>  
>> CC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icc
>>  
>> FC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
>>  
>> F90=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
>>  
>> F77=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
>>  
>> CXX=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icpc
>>  LDFLAGS='-Wc,-static-intel -static-intel' CFLAGS=-O3  FCFLAGS=-O3 
>> F77FLAGS=-O3  F90FLAGS=-O3  CXXFLAGS=-O3 MFLAGS='-j24 V99'" 
>> openmpi-4.0.2.spec 2>&1 | tee /root/openmpi-2.log
>>
>> ---
>> Leandro
>>
>>

Reply via email to