I'm starting EasyBuild up on our new AMD "Genoa" platform with 1 AMD EPYC
9124 16-Core Processor with 2 threads/core, 384 GB RAM, Omni-Path (OPA)
fabric, and AlmaLinux 8.8 OS.
I wiped our existing EB modules so as to start with a clean slate. The
goal is to build the foss-2023a toolchain as a starting point for further
modules.
I previously experienced the same error as shown below with
OpenMPI-4.1.4-GCC-12.2.0.eb, and Kenneth suggested that the lack of
Infiniband hardware might be the problem. I had an Omni-Path (OPA fabric)
adapter lying around, so I installed it in the system and made sure that
IPoIB is working as expected.
The build of the OpenMPI-4.1.5-GCC-12.3.0.eb unfortunately fails with the
same "PML cm cannot be selected" error as before:
== 2023-10-03 09:36:16,437 build_log.py:171 ERROR EasyBuild crashed with
an error (at easybuild/base/exceptions.py:126 in __init__): Sanity check
failed: sanity check command OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n
8 /dev/shm/OpenMPI/4.1.5/GCC-12.3.0/mpi_test_hello_c exited with code 1
(output: [e000.nifl.fysik.dtu.dk:2392967] PML cm cannot be selected
[e000.nifl.fysik.dtu.dk:2392963] PML cm cannot be selected
)
sanity check command mpirun -n 1
/dev/shm/OpenMPI/4.1.5/GCC-12.3.0/mpi_test_hello_c exited with code 1
(output: [e000.nifl.fysik.dtu.dk:2392988] PML cm cannot be selected
)
sanity check command OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 8
/dev/shm/OpenMPI/4.1.5/GCC-12.3.0/mpi_test_hello_mpifh exited with code 1
(output: [e000.nifl.fysik.dtu.dk:2393008] PML cm cannot be selected
)
sanity check command mpirun -n 1
/dev/shm/OpenMPI/4.1.5/GCC-12.3.0/mpi_test_hello_mpifh exited with code 1
(output: [e000.nifl.fysik.dtu.dk:2393029] PML cm cannot be selected
)
sanity check command OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 8
/dev/shm/OpenMPI/4.1.5/GCC-12.3.0/mpi_test_hello_usempi exited with code 1
(output: [e000.nifl.fysik.dtu.dk:2393042] PML cm cannot be selected
)
sanity check command mpirun -n 1
/dev/shm/OpenMPI/4.1.5/GCC-12.3.0/mpi_test_hello_usempi exited with code 1
(output: [e000.nifl.fysik.dtu.dk:2393070] PML cm cannot be selected
) (at easybuild/framework/easyblock.py:3655 in _sanity_check_step)
== 2023-10-03 09:36:16,437 build_log.py:267 INFO ... (took 5 secs)
== 2023-10-03 09:36:16,437 filetools.py:2012 INFO Removing lock
/home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.5-GCC-12.3.0.lock...
== 2023-10-03 09:36:16,438 filetools.py:383 INFO Path
/home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.5-GCC-12.3.0.lock
successfully removed.
== 2023-10-03 09:36:16,438 filetools.py:2016 INFO Lock removed:
/home/modules/software/.locks/_home_modules_software_OpenMPI_4.1.5-GCC-12.3.0.lock
== 2023-10-03 09:36:16,438 easyblock.py:4277 WARNING build failed (first
300 chars): Sanity check failed: sanity check command
OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 8
/dev/shm/OpenMPI/4.1.5/GCC-12.3.0/mpi_test_hello_c exited with code 1
(output: node[e000.nifl.fysik.dtu.dk:2392967] PML cm cannot be selected
[e000.nifl.fysik.dtu.dk:2392963] PML cm cannot be selected
)
sanity chec
== 2023-10-03 09:36:16,438 easyblock.py:328 INFO Closing log for
application name OpenMPI version 4.1.5
Since we now have used the latest GCC 12.3.0, and we have installed an OPA
fabric, the problem would seem to be related to having the AMD "Genoa"
hardware.
Does anyone have suggestions for building OpenMPI successfully on this
platform?
Thanks,
Ole
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark