I do not know much about vader, but one of my pull requests was recently merged concerning exactly this:
https://github.com/open-mpi/ompi/pull/6844 https://github.com/open-mpi/ompi/pull/6997 The changes in this pull requests are to detect if different Open MPI processes are running in different user namespaces (for example containers) and then it automatically falls back to 'none' instead of 'cma'. To know if the error you are seeing is related to my change it would be good to see the complete error message your users are getting and if containers are being used. Adrian On Mon, Feb 24, 2020 at 01:41:48PM -0500, Bennet Fauber via users wrote: > We are getting errors on our system that indicate that we should > > export OMPI_MCA_btl_vader_single_copy_mechanism=none > > Our user originally reported > > > This occurs for both GCC and PGI. The errors we get if we do not set this > > indicate something is going wrong in our communication which uses RMA, > > specifically a call to MPI_Get(). > > Kernel version > > $ uname -r > 3.10.0-957.10.1.el7.x86_64 > > $ ompi_info | grep vader > MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.2) > > Our config.log file begins, > > It was created by Open MPI configure 4.0.2, which was > generated by GNU Autoconf 2.69. Invocation command line was > > $ ./configure --prefix=/sw/arcts/centos7/stacks/gcc/8.2.0/openmpi/4.0.2 \ > --with-pmix=/opt/pmix/2.1.3 --with-libevent=external --with-hwloc=/usr \ > --with-slurm --without-verbs --enable-shared --with-ucx CC=gcc FC=gfortran > > and that resulted in this summary at the conclusion of configuration. > > Open MPI configuration: > ----------------------- > Version: 4.0.2 > Build MPI C bindings: yes > Build MPI C++ bindings (deprecated): no > Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08 > MPI Build Java bindings (experimental): no > Build Open SHMEM support: yes > Debug build: no > Platform file: (none) > > Miscellaneous > ----------------------- > CUDA support: no > HWLOC support: external > Libevent support: external > PMIx support: External (2x) > > Transports > ----------------------- > Cisco usNIC: no > Cray uGNI (Gemini/Aries): no > Intel Omnipath (PSM2): no > Intel TrueScale (PSM): no > Mellanox MXM: no > Open UCX: yes > OpenFabrics OFI Libfabric: no > OpenFabrics Verbs: no > Portals4: no > Shared memory/copy in+copy out: yes > Shared memory/Linux CMA: yes > Shared memory/Linux KNEM: no > Shared memory/XPMEM: no > TCP: yes > > Resource Managers > ----------------------- > Cray Alps: no > Grid Engine: no > LSF: no > Moab: no > Slurm: yes > ssh/rsh: yes > Torque: no > > OMPIO File Systems > ----------------------- > Generic Unix FS: yes > Lustre: no > PVFS2/OrangeFS: no > > It seems that the MCA mechanism should be able to work, but it does not. > > Our system is running Slurm, and we have configured Slurm to use > cgroups. I do not know whether this problem arises only within a job > or also on a login node. > > Anyone know what else I might need to do to enable it? > > Thanks, in advance, -- bennet