Re: [OMPI users] Error intialising an OpenFabrics device.
I’ve begun getting this annoyingly generic warning, too. It appears to be coming from the openib provider. If you disable it with -mtl ^openib the warning goes away. Sent from my iPad > On Mar 13, 2021, at 3:28 PM, Bob Beattie via users > wrote: > > Hi everyone, > > To be honest, as an MPI / IB noob, I don't know if this falls under OpenMPI > or Mellanox > > Am running a small cluster of HP DL380 G6/G7 machines. > Each runs Ubuntu server 20.04 and has a Mellanox ConnectX-3 card, connected > by an IS dumb switch. > When I begin my MPI program (snappyHexMesh for OpenFOAM) I get an error > reported. > The error doesn't stop my programs or appear to cause any problems, so this > request for help is more about delving into the why. > > OMPI is compiled from source using v4.0.3; which is the default version for > Ubuntu 20.04 > This compiles and works. I did this because I wanted to understand the > compilation process whilst using a known working OMPI version. > > The Infiniband part is the Mellanox MLNXOFED installer v4.9-0.1.7.0 and I > install that with --dkms --without-fw-update --hpc --with-nfsrdma > > The actual error reported is: > Warning: There was an error initialising an OpenFabrics device. > Local host: of1 > Local device: mlx4_0 > > Then shortly after: > [of1:1015399] 19 more processes have sent help message > help-mpi-btl-openib.txt / error in device init > [of1:1015399] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > > Adding this MCA parameter to the mpirun line simply gives me 20 or so copies > of the first warning. > > Any ideas anyone ? > Cheers, > Bob.
Re: [OMPI users] Allreduce with Op
Thanks George, Pierre > On 13 Mar 2021, at 22:24, George Bosilca wrote: > > Hi Pierre, > > MPI is allowed to pipeline the collective communications. This explains why > the MPI_Op takes the len of the buffers as an argument. Because your MPI_Op > ignores this length it alters data outside the temporary buffer we use for > the segment. Other versions of the MPI_Allreduce implementation might choose > not to pipeline in which case applying the MPI_Op on the entire length of the > buffer (as you manually did in your code) is correct. > > George. > > > On Sat, Mar 13, 2021 at 4:47 AM Pierre Jolivet via users > wrote: >> Hello, >> The following piece of code generates Valgrind errors with OpenMPI 4.1.0, >> while it is Valgrind-clean with MPICH and OpenMPI 4.0.5. >> I don’t think I’m doing anything illegal, so could this be a regression >> introduced in 4.1.0? >> >> Thanks, >> Pierre >> >> $ /opt/openmpi-4.1.0/bin/mpicxx ompi.cxx -g -O0 -std=c++11 >> $ /opt/openmpi-4.1.0/bin/mpirun -n 4 valgrind --log-file=dump.%p.log ./a.out >> >> >> >> ==528== Invalid read of size 2 >> ==528==at 0x4011EB: main::{lambda(void*, void*, int*, >> ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) >> const (ompi.cxx:15) >> ==528==by 0x40127B: main::{lambda(void*, void*, int*, >> ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) >> (ompi.cxx:19) >> ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in >> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) >> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in >> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) >> ==528==by 0x48AAF00: PMPI_Allreduce (in >> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) >> ==528==by 0x401317: main (ompi.cxx:21) >> ==528== Address 0x7139f74 is 0 bytes after a block of size 4 alloc'd >> ==528==at 0x4839809: malloc (vg_replace_malloc.c:307) >> ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in >> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) >> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in >> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) >> ==528==by 0x48AAF00: PMPI_Allreduce (in >> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) >> ==528==by 0x401317: main (ompi.cxx:21) >> ==528== >> ==528== Invalid read of size 2 >> ==528==at 0x40120E: main::{lambda(void*, void*, int*, >> ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) >> const (ompi.cxx:16) >> ==528==by 0x40127B: main::{lambda(void*, void*, int*, >> ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) >> (ompi.cxx:19) >> ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in >> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) >> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in >> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) >> ==528==by 0x48AAF00: PMPI_Allreduce (in >> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) >> ==528==by 0x401317: main (ompi.cxx:21) >> ==528== Address 0x7139f76 is 2 bytes after a block of size 4 alloc'd >> ==528==at 0x4839809: malloc (vg_replace_malloc.c:307) >> ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in >> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) >> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in >> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) >> ==528==by 0x48AAF00: PMPI_Allreduce (in >> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) >> ==528==by 0x401317: main (ompi.cxx:21) >> ==528== >> ==528== Invalid read of size 2 >> ==528==at 0x401231: main::{lambda(void*, void*, int*, >> ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) >> const (ompi.cxx:18) >> ==528==by 0x40127B: main::{lambda(void*, void*, int*, >> ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) >> (ompi.cxx:19) >> ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in >> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) >> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in >> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) >> ==528==by 0x48AAF00: PMPI_Allreduce (in >> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) >> ==528==by 0x401317: main (ompi.cxx:21) >> ==528== Address 0x7139f78 is 4 bytes after a block of size 4 alloc'd >> ==528==at 0x4839809: malloc (vg_replace_malloc.c:307) >> ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in >> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) >> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in >> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) >> ==528==by 0x48AAF00: PMPI_Allreduce (in >> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) >> ==528==by 0x401317: main (ompi.cxx:21)
Re: [OMPI users] Allreduce with Op
Hi Pierre, MPI is allowed to pipeline the collective communications. This explains why the MPI_Op takes the len of the buffers as an argument. Because your MPI_Op ignores this length it alters data outside the temporary buffer we use for the segment. Other versions of the MPI_Allreduce implementation might choose not to pipeline in which case applying the MPI_Op on the entire length of the buffer (as you manually did in your code) is correct. George. On Sat, Mar 13, 2021 at 4:47 AM Pierre Jolivet via users < users@lists.open-mpi.org> wrote: > Hello, > The following piece of code generates Valgrind errors with OpenMPI 4.1.0, > while it is Valgrind-clean with MPICH and OpenMPI 4.0.5. > I don’t think I’m doing anything illegal, so could this be a regression > introduced in 4.1.0? > > Thanks, > Pierre > > $ /opt/openmpi-4.1.0/bin/mpicxx ompi.cxx -g -O0 -std=c++11 > $ /opt/openmpi-4.1.0/bin/mpirun -n 4 valgrind --log-file=dump.%p.log > ./a.out > > > > ==528== Invalid read of size 2 > ==528==at 0x4011EB: main::{lambda(void*, void*, int*, > ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) > const (ompi.cxx:15) > ==528==by 0x40127B: main::{lambda(void*, void*, int*, > ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) > (ompi.cxx:19) > ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in > /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) > ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in > /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) > ==528==by 0x48AAF00: PMPI_Allreduce (in > /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) > ==528==by 0x401317: main (ompi.cxx:21) > ==528== Address 0x7139f74 is 0 bytes after a block of size 4 alloc'd > ==528==at 0x4839809: malloc (vg_replace_malloc.c:307) > ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in > /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) > ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in > /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) > ==528==by 0x48AAF00: PMPI_Allreduce (in > /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) > ==528==by 0x401317: main (ompi.cxx:21) > ==528== > ==528== Invalid read of size 2 > ==528==at 0x40120E: main::{lambda(void*, void*, int*, > ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) > const (ompi.cxx:16) > ==528==by 0x40127B: main::{lambda(void*, void*, int*, > ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) > (ompi.cxx:19) > ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in > /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) > ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in > /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) > ==528==by 0x48AAF00: PMPI_Allreduce (in > /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) > ==528==by 0x401317: main (ompi.cxx:21) > ==528== Address 0x7139f76 is 2 bytes after a block of size 4 alloc'd > ==528==at 0x4839809: malloc (vg_replace_malloc.c:307) > ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in > /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) > ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in > /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) > ==528==by 0x48AAF00: PMPI_Allreduce (in > /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) > ==528==by 0x401317: main (ompi.cxx:21) > ==528== > ==528== Invalid read of size 2 > ==528==at 0x401231: main::{lambda(void*, void*, int*, > ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) > const (ompi.cxx:18) > ==528==by 0x40127B: main::{lambda(void*, void*, int*, > ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) > (ompi.cxx:19) > ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in > /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) > ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in > /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) > ==528==by 0x48AAF00: PMPI_Allreduce (in > /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) > ==528==by 0x401317: main (ompi.cxx:21) > ==528== Address 0x7139f78 is 4 bytes after a block of size 4 alloc'd > ==528==at 0x4839809: malloc (vg_replace_malloc.c:307) > ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in > /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) > ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in > /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) > ==528==by 0x48AAF00: PMPI_Allreduce (in > /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) > ==528==by 0x401317: main (ompi.cxx:21)
[OMPI users] Error intialising an OpenFabrics device.
Hi everyone, To be honest, as an MPI / IB noob, I don't know if this falls under OpenMPI or Mellanox Am running a small cluster of HP DL380 G6/G7 machines. Each runs Ubuntu server 20.04 and has a Mellanox ConnectX-3 card, connected by an IS dumb switch. When I begin my MPI program (snappyHexMesh for OpenFOAM) I get an error reported. The error doesn't stop my programs or appear to cause any problems, so this request for help is more about delving into the why. OMPI is compiled from source using v4.0.3; which is the default version for Ubuntu 20.04 This compiles and works. I did this because I wanted to understand the compilation process whilst using a known working OMPI version. The Infiniband part is the Mellanox MLNXOFED installer v4.9-0.1.7.0 and I install that with --dkms --without-fw-update --hpc --with-nfsrdma The actual error reported is: Warning: There was an error initialising an OpenFabrics device. Local host: of1 Local device: mlx4_0 Then shortly after: [of1:1015399] 19 more processes have sent help message help-mpi-btl-openib.txt / error in device init [of1:1015399] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages Adding this MCA parameter to the mpirun line simply gives me 20 or so copies of the first warning. Any ideas anyone ? Cheers, Bob.
[OMPI users] Allreduce with Op
Hello, The following piece of code generates Valgrind errors with OpenMPI 4.1.0, while it is Valgrind-clean with MPICH and OpenMPI 4.0.5. I don’t think I’m doing anything illegal, so could this be a regression introduced in 4.1.0? Thanks, Pierre $ /opt/openmpi-4.1.0/bin/mpicxx ompi.cxx -g -O0 -std=c++11 $ /opt/openmpi-4.1.0/bin/mpirun -n 4 valgrind --log-file=dump.%p.log ./a.out ompi.cxx Description: Binary data ==528== Invalid read of size 2 ==528==at 0x4011EB: main::{lambda(void*, void*, int*, ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) const (ompi.cxx:15) ==528==by 0x40127B: main::{lambda(void*, void*, int*, ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) (ompi.cxx:19) ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) ==528==by 0x48AAF00: PMPI_Allreduce (in /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) ==528==by 0x401317: main (ompi.cxx:21) ==528== Address 0x7139f74 is 0 bytes after a block of size 4 alloc'd ==528==at 0x4839809: malloc (vg_replace_malloc.c:307) ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) ==528==by 0x48AAF00: PMPI_Allreduce (in /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) ==528==by 0x401317: main (ompi.cxx:21) ==528== ==528== Invalid read of size 2 ==528==at 0x40120E: main::{lambda(void*, void*, int*, ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) const (ompi.cxx:16) ==528==by 0x40127B: main::{lambda(void*, void*, int*, ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) (ompi.cxx:19) ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) ==528==by 0x48AAF00: PMPI_Allreduce (in /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) ==528==by 0x401317: main (ompi.cxx:21) ==528== Address 0x7139f76 is 2 bytes after a block of size 4 alloc'd ==528==at 0x4839809: malloc (vg_replace_malloc.c:307) ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) ==528==by 0x48AAF00: PMPI_Allreduce (in /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) ==528==by 0x401317: main (ompi.cxx:21) ==528== ==528== Invalid read of size 2 ==528==at 0x401231: main::{lambda(void*, void*, int*, ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) const (ompi.cxx:18) ==528==by 0x40127B: main::{lambda(void*, void*, int*, ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) (ompi.cxx:19) ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) ==528==by 0x48AAF00: PMPI_Allreduce (in /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) ==528==by 0x401317: main (ompi.cxx:21) ==528== Address 0x7139f78 is 4 bytes after a block of size 4 alloc'd ==528==at 0x4839809: malloc (vg_replace_malloc.c:307) ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so) ==528==by 0x48AAF00: PMPI_Allreduce (in /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0) ==528==by 0x401317: main (ompi.cxx:21)