I do not know much about vader, but one of my pull requests was recently
merged concerning exactly this:

https://github.com/open-mpi/ompi/pull/6844
https://github.com/open-mpi/ompi/pull/6997

The changes in this pull requests are to detect if different Open MPI
processes are running in different user namespaces (for example
containers) and then it automatically falls back to 'none' instead of
'cma'.

To know if the error you are seeing is related to my change it would be
good to see the complete error message your users are getting and if
containers are being used.

                Adrian

On Mon, Feb 24, 2020 at 01:41:48PM -0500, Bennet Fauber via users wrote:
> We are getting errors on our system that indicate that we should
> 
>     export OMPI_MCA_btl_vader_single_copy_mechanism=none
> 
> Our user originally reported
> 
> > This occurs for both GCC and PGI.  The errors we get if we do not set this
> > indicate something is going wrong in our communication which uses RMA,
> > specifically a call to MPI_Get().
> 
> Kernel version
> 
> $ uname -r
> 3.10.0-957.10.1.el7.x86_64
> 
> $ ompi_info | grep vader
>                  MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.2)
> 
> Our config.log file begins,
> 
> It was created by Open MPI configure 4.0.2, which was
> generated by GNU Autoconf 2.69.  Invocation command line was
> 
>   $ ./configure --prefix=/sw/arcts/centos7/stacks/gcc/8.2.0/openmpi/4.0.2 \
>     --with-pmix=/opt/pmix/2.1.3 --with-libevent=external --with-hwloc=/usr \
>     --with-slurm --without-verbs --enable-shared --with-ucx CC=gcc FC=gfortran
> 
> and that resulted in this summary at the conclusion of configuration.
> 
> Open MPI configuration:
> -----------------------
> Version: 4.0.2
> Build MPI C bindings: yes
> Build MPI C++ bindings (deprecated): no
> Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
> MPI Build Java bindings (experimental): no
> Build Open SHMEM support: yes
> Debug build: no
> Platform file: (none)
> 
> Miscellaneous
> -----------------------
> CUDA support: no
> HWLOC support: external
> Libevent support: external
> PMIx support: External (2x)
> 
> Transports
> -----------------------
> Cisco usNIC: no
> Cray uGNI (Gemini/Aries): no
> Intel Omnipath (PSM2): no
> Intel TrueScale (PSM): no
> Mellanox MXM: no
> Open UCX: yes
> OpenFabrics OFI Libfabric: no
> OpenFabrics Verbs: no
> Portals4: no
> Shared memory/copy in+copy out: yes
> Shared memory/Linux CMA: yes
> Shared memory/Linux KNEM: no
> Shared memory/XPMEM: no
> TCP: yes
> 
> Resource Managers
> -----------------------
> Cray Alps: no
> Grid Engine: no
> LSF: no
> Moab: no
> Slurm: yes
> ssh/rsh: yes
> Torque: no
> 
> OMPIO File Systems
> -----------------------
> Generic Unix FS: yes
> Lustre: no
> PVFS2/OrangeFS: no
> 
> It seems that the MCA mechanism should be able to work, but it does not.
> 
> Our system is running Slurm, and we have configured Slurm to use
> cgroups.  I do not know whether this problem arises only within a job
> or also on a login node.
> 
> Anyone know what else I might need to do to enable it?
> 
> Thanks, in advance,    -- bennet

Reply via email to