We are getting errors on our system that indicate that we should export OMPI_MCA_btl_vader_single_copy_mechanism=none
Our user originally reported > This occurs for both GCC and PGI. The errors we get if we do not set this > indicate something is going wrong in our communication which uses RMA, > specifically a call to MPI_Get(). Kernel version $ uname -r 3.10.0-957.10.1.el7.x86_64 $ ompi_info | grep vader MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.2) Our config.log file begins, It was created by Open MPI configure 4.0.2, which was generated by GNU Autoconf 2.69. Invocation command line was $ ./configure --prefix=/sw/arcts/centos7/stacks/gcc/8.2.0/openmpi/4.0.2 \ --with-pmix=/opt/pmix/2.1.3 --with-libevent=external --with-hwloc=/usr \ --with-slurm --without-verbs --enable-shared --with-ucx CC=gcc FC=gfortran and that resulted in this summary at the conclusion of configuration. Open MPI configuration: ----------------------- Version: 4.0.2 Build MPI C bindings: yes Build MPI C++ bindings (deprecated): no Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08 MPI Build Java bindings (experimental): no Build Open SHMEM support: yes Debug build: no Platform file: (none) Miscellaneous ----------------------- CUDA support: no HWLOC support: external Libevent support: external PMIx support: External (2x) Transports ----------------------- Cisco usNIC: no Cray uGNI (Gemini/Aries): no Intel Omnipath (PSM2): no Intel TrueScale (PSM): no Mellanox MXM: no Open UCX: yes OpenFabrics OFI Libfabric: no OpenFabrics Verbs: no Portals4: no Shared memory/copy in+copy out: yes Shared memory/Linux CMA: yes Shared memory/Linux KNEM: no Shared memory/XPMEM: no TCP: yes Resource Managers ----------------------- Cray Alps: no Grid Engine: no LSF: no Moab: no Slurm: yes ssh/rsh: yes Torque: no OMPIO File Systems ----------------------- Generic Unix FS: yes Lustre: no PVFS2/OrangeFS: no It seems that the MCA mechanism should be able to work, but it does not. Our system is running Slurm, and we have configured Slurm to use cgroups. I do not know whether this problem arises only within a job or also on a login node. Anyone know what else I might need to do to enable it? Thanks, in advance, -- bennet