Well, I have to build these packages because of specific compilers and some
options the developers need. I do this for years, Always worked fine. But,
nowadays I am using torque and maui on the clusters, and theses are giving
me problems now.

The openmpi we use now, with torque, is built using a custom build ucx, a
compatible version of hwloc (the one in CentOS or EPEL repos does no work),
using cuda and mellanox drivers. All this work, and is in production.

All what I want is do the same but to work with slurm. It is new for me but
following the documentation it says it need pmix to work. I don`t know what
I am missing in this.

I'm over this for days and don't know what to do anymore.

Any help would be appreciated.

---
*Leandro*


On Mon, May 11, 2020 at 8:28 PM Ralph Castain via users <
users@lists.open-mpi.org> wrote:

> I'm not sure I understand why you are trying to build CentOS rpms for
> PMIx, Slurm, or OMPI - all three are readily available online. Is there
> some particular reason you are trying to do this yourself? I ask because it
> is non-trivial to do and requires significant familiarity with both the
> intricacies of bpm building and the packages involved.
>
>
> On May 11, 2020, at 6:23 AM, Leandro via users <users@lists.open-mpi.org>
> wrote:
>
> Hi,
>
> I'm trying to start using Slurm, and I followed all the instructions ti
> build PMIx, Slurm using pmix, but I can't make openmpi to work.
>
> According to PMIx documentation, I should compile openmpi using
> "--with-ompi-pmix-rte" but when I tried, It fails. I need to build this as
> CentOS rpms.
>
> Thanks in advance for your help. I pasted some info below.
>
> libtool: link:
> /tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icc
> -std=gnu99 -std=gnu99 -DOPAL_CONFIGURE_USER=\"root\"
> -DOPAL_CONFIGURE_HOST=\"gr10b17n05\" "-DOPAL_CONFIGURE_DATE=\"Fri May  8
> 13:35:51 -03 2020\"" -DOMPI_BUILD_USER=\"root\"
> -DOMPI_BUILD_HOST=\"gr10b17n05\" "-DOMPI_BUILD_DATE=\"Fri May  8 13:47:32
> -03 2020\"" "-DOMPI_BUILD_CFLAGS=\"-DNDEBUG -O3 -finline-functions
> -fno-strict-aliasing -restrict -Qoption,cpp,--extended_float_types
> -pthread\"" "-DOMPI_BUILD_CPPFLAGS=\"-I../../.. -I../../../orte/include
>  \"" "-DOMPI_BUILD_CXXFLAGS=\"-DNDEBUG -O3 -finline-functions -pthread\""
> "-DOMPI_BUILD_CXXCPPFLAGS=\"-I../../..  \"" "-DOMPI_BUILD_FFLAGS=\"-O2 -g
> -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong
> --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic
> -I/usr/lib64/gfortran/modules\"" -DOMPI_BUILD_FCFLAGS=\"-O3\"
> "-DOMPI_BUILD_LDFLAGS=\"-Wc,-static-intel -static-intel    -L/usr/lib64\""
> "-DOMPI_BUILD_LIBS=\"-lrt -lutil  -lz  -lhwloc  -levent -levent_pthreads\""
> -DOPAL_CC_ABSOLUTE=\"\" -DOMPI_CXX_ABSOLUTE=\"none\" -DNDEBUG -O3
> -finline-functions -fno-strict-aliasing -restrict
> -Qoption,cpp,--extended_float_types -pthread -static-intel -static-intel -o
> .libs/ompi_info ompi_info.o param.o  -L/usr/lib64
> ../../../ompi/.libs/libmpi.so -L/usr/lib -llustreapi
> /root/rpmbuild/BUILD/openmpi-4.0.2/opal/.libs/libopen-pal.so
> ../../../opal/.libs/libopen-pal.so -lfabric -lucp -lucm -lucs -luct
> -lrdmacm -libverbs /usr/lib64/libpmix.so -lmunge -lrt -lutil -lz
> /usr/lib64/libhwloc.so -lm -ludev -lltdl -levent -levent_pthreads -pthread
> -Wl,-rpath -Wl,/usr/lib64
> icc: warning #10237: -lcilkrts linked in dynamically, static library not
> available
> ../../../ompi/.libs/libmpi.so: undefined reference to `orte_process_info'
> ../../../ompi/.libs/libmpi.so: undefined reference to `orte_show_help'
> make[2]: *** [ompi_info] Error 1
> make[2]: Leaving directory
> `/root/rpmbuild/BUILD/openmpi-4.0.2/ompi/tools/ompi_info'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/root/rpmbuild/BUILD/openmpi-4.0.2/ompi'
> make: *** [all-recursive] Error 1
> error: Bad exit status from /var/tmp/rpm-tmp.RyklCR (%build)
>
> The orte libraries are missing. When I don't use "-with-ompi-pmix-rte" it
> builds, but neither mpirun or srun works:
>
> c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes > cat machine_file
> gr10b17n05
> gr10b17n06
> gr10b17n07
> gr10b17n08
> c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes > mpirun -machinefile
> machine_file ./mpihello
> [gr10b17n07:115065] [[21391,0],2] ORTE_ERROR_LOG: Not found in file
> base/ess_base_std_orted.c at line 362
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   opal_pmix_base_select failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
>
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
>
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
>
> * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to
> use.
>
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
>
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> --------------------------------------------------------------------------
> [gr10b17n08:142030] [[21391,0],3] ORTE_ERROR_LOG: Not found in file
> base/ess_base_std_orted.c at line 362
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   opal_pmix_base_select failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> ORTE does not know how to route a message to the specified daemon
> located on the indicated node:
>
>   my node:   gr10b17n05
>   target node:  gr10b17n06
>
> This is usually an internal programming error that should be
> reported to the developers. In the meantime, a workaround may
> be to set the MCA param routed=direct on the command line or
> in your environment. We apologize for the problem.
> --------------------------------------------------------------------------
> [gr10b17n05:171586] 1 more process has sent help message
> help-errmgr-base.txt / no-path
> [gr10b17n05:171586] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see all help / error messages
> c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes >
>
> --------------------------
> c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > mpirun --nolocal -np
> 1 --machinefile machine_file mpihello
> [gr10pbs2:242828] [[60566,0],0] ORTE_ERROR_LOG: Not found in file
> ess_hnp_module.c at line 320
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   opal_pmix_base_select failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > mpirun --nolocal -np
> 1 --machinefile machine_file mpihello
> [gr10pbs2:237314] [[50968,0],0] ORTE_ERROR_LOG: Not found in file
> ess_hnp_module.c at line 320
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   opal_pmix_base_select failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes >
>
> c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > srun -N4
> /bw1nfs1/Projetos1/c315/Meus_testes/mpihello
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [gr10b17n05:172693] Local abort before MPI_INIT completed completed
> successfully, but am not able to aggregate error messages, and not able to
> guarantee that all other processes were killed!
> srun: error: gr10b17n05: task 0: Exited with exit code 1
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   getting job size failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   getting job size failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   orte_ess_init failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   orte_ess_init failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   ompi_mpi_init: ompi_rte_init failed
>   --> Returned "Not found" (-13) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   ompi_mpi_init: ompi_rte_init failed
>   --> Returned "Not found" (-13) instead of "Success" (0)
> --------------------------------------------------------------------------
> [gr10b17n07:116175] Local abort before MPI_INIT completed completed
> successfully, but am not able to aggregate error messages, and not able to
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [gr10b17n06:142082] Local abort before MPI_INIT completed completed
> successfully, but am not able to aggregate error messages, and not able to
> guarantee that all other processes were killed!
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   getting job size failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   orte_ess_init failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   ompi_mpi_init: ompi_rte_init failed
>   --> Returned "Not found" (-13) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [gr10b17n08:143134] Local abort before MPI_INIT completed completed
> successfully, but am not able to aggregate error messages, and not able to
> guarantee that all other processes were killed!
> srun: error: gr10b17n07: task 2: Exited with exit code 1
> srun: error: gr10b17n06: task 1: Exited with exit code 1
> srun: error: gr10b17n08: task 3: Exited with exit code 1
> c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes >
>
> Slurm information:
> c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > srun --mpi=list
> srun: MPI types are...
> srun: pmix_v3
> srun: none
> srun: pmi2
> srun: pmix
>
> The compilation lines used for PMIx and openmpi:
>
> MAKEFLAGS="-j24 V=99" rpmbuild -ba --define 'install_in_opt 0' --define
> "configure_options --enable-shared --enable-static --with-jansson=/usr
> --with-libevent=/usr --with-libevent-libdir=/usr/lib64 --with-hwloc=/usr
> --with-curl=/usr --without-opamgt --with-munge=/usr --with-lustre=/usr
> --enable-pmix-timing --enable-pmi-backward-compatibility
> --enable-pmix-binaries --with-devel-headers --with-tests-examples
> --disable-mca-dso --disable-weak-symbols
> AR=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xiar
> LD=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xild
> CC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icc
> FC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
> F90=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
> F77=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
> CXX=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icpc
> LDFLAGS='-Wc,-static-intel -static-intel' CFLAGS=-O3  FCFLAGS=-O3
> F77FLAGS=-O3  F90FLAGS=-O3  CXXFLAGS=-O3 MFLAGS='-j24 V99'" pmix-3.1.5.spec
>
> MAKEFLAGS="-j24 V=99" rpmbuild -ba --define 'install_in_opt 0' --define
> "configure_options --enable-shared --enable-static --with-libevent=/usr
> --with-libevent-libdir=/usr/lib64 --with-pmix=/usr
> --with-pmix-libdir=/usr/lib64 --enable-install-libpmix --with-ompi-pmix-rte
> --without-orte --with-slurm --with-ucx=/usr --with-cuda=/usr/local/cuda
> --with-gdrcopy=/usr --with-hwloc --enable-mpi-cxx --disable-mca-dso
> --enable-mpi-fortran --disable-weak-symbols --enable-mpi-thread-multiple
> --enable-contrib-no-build=vt --enable-mpirun-prefix-by-default
> --enable-orterun-prefix-by-default --with-cuda=/usr/local/cuda
> AR=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xiar
> LD=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xild
> CC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icc
> FC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
> F90=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
> F77=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
> CXX=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icpc
> LDFLAGS='-Wc,-static-intel -static-intel' CFLAGS=-O3  FCFLAGS=-O3
> F77FLAGS=-O3  F90FLAGS=-O3  CXXFLAGS=-O3 MFLAGS='-j24 V99'"
> openmpi-4.0.2.spec 2>&1 | tee /root/openmpi-2.log
>
> ---
> Leandro
>
>
>

Reply via email to