We are getting errors on our system that indicate that we should

    export OMPI_MCA_btl_vader_single_copy_mechanism=none

Our user originally reported

> This occurs for both GCC and PGI.  The errors we get if we do not set this
> indicate something is going wrong in our communication which uses RMA,
> specifically a call to MPI_Get().

Kernel version

$ uname -r
3.10.0-957.10.1.el7.x86_64

$ ompi_info | grep vader
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.2)

Our config.log file begins,

It was created by Open MPI configure 4.0.2, which was
generated by GNU Autoconf 2.69.  Invocation command line was

  $ ./configure --prefix=/sw/arcts/centos7/stacks/gcc/8.2.0/openmpi/4.0.2 \
    --with-pmix=/opt/pmix/2.1.3 --with-libevent=external --with-hwloc=/usr \
    --with-slurm --without-verbs --enable-shared --with-ucx CC=gcc FC=gfortran

and that resulted in this summary at the conclusion of configuration.

Open MPI configuration:
-----------------------
Version: 4.0.2
Build MPI C bindings: yes
Build MPI C++ bindings (deprecated): no
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
MPI Build Java bindings (experimental): no
Build Open SHMEM support: yes
Debug build: no
Platform file: (none)

Miscellaneous
-----------------------
CUDA support: no
HWLOC support: external
Libevent support: external
PMIx support: External (2x)

Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no
Intel TrueScale (PSM): no
Mellanox MXM: no
Open UCX: yes
OpenFabrics OFI Libfabric: no
OpenFabrics Verbs: no
Portals4: no
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes

Resource Managers
-----------------------
Cray Alps: no
Grid Engine: no
LSF: no
Moab: no
Slurm: yes
ssh/rsh: yes
Torque: no

OMPIO File Systems
-----------------------
Generic Unix FS: yes
Lustre: no
PVFS2/OrangeFS: no

It seems that the MCA mechanism should be able to work, but it does not.

Our system is running Slurm, and we have configured Slurm to use
cgroups.  I do not know whether this problem arises only within a job
or also on a login node.

Anyone know what else I might need to do to enable it?

Thanks, in advance,    -- bennet

Reply via email to