I have openmpi-3.0.1, pmix-1.2.4, and slurm-17.11.5 working well on a few clusters. For things like:
bill@headnode:~/src/relay$ srun -N 2 -n 2 -t 1 ./relay 1 c7-18 c7-19 size= 1, 16384 hops, 2 nodes in 0.03 sec ( 2.00 us/hop) 1953 KB/sec I've been having a tougher time trying to get openmpi-3.1, (external) pmix-2.1.1, and slurm-17.11.5 working. Anyone have similar working? I compiled them both with: ./configure --prefix=/share/apps/openmpi-3.1.0/gcc7 --with-pmix=/share/apps/pmix-2.1.1/gcc7 --with-libevent=external --disable-io-romio --disable-io-ompio ./configure --prefix=/share/apps/slurm-17.11.5/gcc7 --with-pmix=/share/apps/pmix-2.1.1/gcc7 Both config.log's look promising. No pmix related errors, and variables being set including the PMIX discovered flags. I did notice that the working openmpi configs had: #define OPAL_PMIX_V1 1 But the nonworking openmpi config had: #define OPAL_PMIX_V1 0 Although it's not too surprising since I'm trying to compile and link against pmix-2.1.1. The other relevant env variables set by the configure: OPAL_CONFIGURE_CLI=' \'\''--prefix=/share/apps/openmpi-3.1.0/gcc7\'\'' \'\''--with-pmix=/share/apps/pmix-2.1.1/gcc7\'\'' \'\''--with-libevent=external\'\'' \'\''--disable-io-romio\'\'' \'\''--disable-io-ompio\'\''' opal_pmix_ext1x_CPPFLAGS='-I/share/apps/pmix-2.1.1/gcc7/include' opal_pmix_ext1x_LDFLAGS='-L/share/apps/pmix-2.1.1/gcc7/lib' opal_pmix_ext1x_LIBS='-lpmix' opal_pmix_ext2x_CPPFLAGS='-I/share/apps/pmix-2.1.1/gcc7/include' opal_pmix_ext2x_LDFLAGS='-L/share/apps/pmix-2.1.1/gcc7/lib' Any hints on how to debug this? When I try to run: bill@demon:~/relay$ mpicc -O3 relay.c -o relay bill@demon:~/relay$ srun -N 2 -n 2 ./relay 1 [c2-50:01318] OPAL ERROR: Not initialized in file ext2x_client.c at line 109 -------------------------------------------------------------------------- The application appears to have been direct launched using "srun", but OMPI was not built with SLURM's PMI support and therefore cannot execute. There are several options for building PMI support under SLURM, depending upon the SLURM version you are using: version 16.05 or later: you can use SLURM's PMIx support. This requires that you configure and build SLURM --with-pmix. Versions earlier than 16.05: you must use either SLURM's PMI-1 or PMI-2 support. SLURM builds PMI-1 by default, or you can manually install PMI-2. You must then build Open MPI using --with-pmi pointing to the SLURM PMI library location. Please configure as appropriate and try again. -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [c2-50:01318] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users