Re: [OMPI users] Error intialising an OpenFabrics device.

2021-03-13 Thread Heinz, Michael William via users
I’ve begun getting this annoyingly generic warning, too. It appears to be 
coming from the openib provider. If you disable it with -mtl ^openib the 
warning goes away.

Sent from my iPad

> On Mar 13, 2021, at 3:28 PM, Bob Beattie via users  
> wrote:
> 
> Hi everyone,
> 
> To be honest, as an MPI / IB noob, I don't know if this falls under OpenMPI 
> or Mellanox
> 
> Am running a small cluster of HP DL380 G6/G7 machines.
> Each runs Ubuntu server 20.04 and has a Mellanox ConnectX-3 card, connected 
> by an IS dumb switch.
> When I begin my MPI program (snappyHexMesh for OpenFOAM) I get an error 
> reported.
> The error doesn't stop my programs or appear to cause any problems, so this 
> request for help is more about delving into the why.
> 
> OMPI is compiled from source using v4.0.3; which is the default version for 
> Ubuntu 20.04
> This compiles and works.  I did this because I wanted to understand the 
> compilation process whilst using a known working OMPI version.
> 
> The Infiniband part is the Mellanox MLNXOFED installer v4.9-0.1.7.0 and I 
> install that with --dkms --without-fw-update --hpc --with-nfsrdma
> 
> The actual error reported is:
> Warning: There was an error initialising an OpenFabrics device.
>   Local host: of1
>   Local device: mlx4_0
> 
> Then shortly after:
> [of1:1015399] 19 more processes have sent help message 
> help-mpi-btl-openib.txt / error in device init
> [of1:1015399] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages
> 
> Adding this MCA parameter to the mpirun line simply gives me 20 or so copies 
> of the first warning.
> 
> Any ideas anyone ?
> Cheers,
> Bob.


Re: [OMPI users] Allreduce with Op

2021-03-13 Thread Pierre Jolivet via users
Thanks George,
Pierre

> On 13 Mar 2021, at 22:24, George Bosilca  wrote:
> 
> Hi Pierre,
> 
> MPI is allowed to pipeline the collective communications. This explains why 
> the MPI_Op takes the len of the buffers as an argument. Because your MPI_Op 
> ignores this length it alters data outside the temporary buffer we use for 
> the segment. Other versions of the MPI_Allreduce implementation might choose 
> not to pipeline in which case applying the MPI_Op on the entire length of the 
> buffer (as you manually did in your code) is correct.
> 
>   George.
> 
> 
> On Sat, Mar 13, 2021 at 4:47 AM Pierre Jolivet via users 
>  wrote:
>> Hello,
>> The following piece of code generates Valgrind errors with OpenMPI 4.1.0, 
>> while it is Valgrind-clean with MPICH and OpenMPI 4.0.5.
>> I don’t think I’m doing anything illegal, so could this be a regression 
>> introduced in 4.1.0?
>> 
>> Thanks,
>> Pierre
>> 
>> $ /opt/openmpi-4.1.0/bin/mpicxx ompi.cxx -g -O0 -std=c++11
>> $ /opt/openmpi-4.1.0/bin/mpirun -n 4 valgrind --log-file=dump.%p.log ./a.out
>> 
>> 
>> 
>> ==528== Invalid read of size 2
>> ==528==at 0x4011EB: main::{lambda(void*, void*, int*, 
>> ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) 
>> const (ompi.cxx:15)
>> ==528==by 0x40127B: main::{lambda(void*, void*, int*, 
>> ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) 
>> (ompi.cxx:19)
>> ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in 
>> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
>> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in 
>> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
>> ==528==by 0x48AAF00: PMPI_Allreduce (in 
>> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
>> ==528==by 0x401317: main (ompi.cxx:21)
>> ==528==  Address 0x7139f74 is 0 bytes after a block of size 4 alloc'd
>> ==528==at 0x4839809: malloc (vg_replace_malloc.c:307)
>> ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in 
>> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
>> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in 
>> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
>> ==528==by 0x48AAF00: PMPI_Allreduce (in 
>> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
>> ==528==by 0x401317: main (ompi.cxx:21)
>> ==528==
>> ==528== Invalid read of size 2
>> ==528==at 0x40120E: main::{lambda(void*, void*, int*, 
>> ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) 
>> const (ompi.cxx:16)
>> ==528==by 0x40127B: main::{lambda(void*, void*, int*, 
>> ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) 
>> (ompi.cxx:19)
>> ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in 
>> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
>> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in 
>> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
>> ==528==by 0x48AAF00: PMPI_Allreduce (in 
>> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
>> ==528==by 0x401317: main (ompi.cxx:21)
>> ==528==  Address 0x7139f76 is 2 bytes after a block of size 4 alloc'd
>> ==528==at 0x4839809: malloc (vg_replace_malloc.c:307)
>> ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in 
>> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
>> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in 
>> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
>> ==528==by 0x48AAF00: PMPI_Allreduce (in 
>> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
>> ==528==by 0x401317: main (ompi.cxx:21)
>> ==528==
>> ==528== Invalid read of size 2
>> ==528==at 0x401231: main::{lambda(void*, void*, int*, 
>> ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) 
>> const (ompi.cxx:18)
>> ==528==by 0x40127B: main::{lambda(void*, void*, int*, 
>> ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) 
>> (ompi.cxx:19)
>> ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in 
>> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
>> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in 
>> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
>> ==528==by 0x48AAF00: PMPI_Allreduce (in 
>> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
>> ==528==by 0x401317: main (ompi.cxx:21)
>> ==528==  Address 0x7139f78 is 4 bytes after a block of size 4 alloc'd
>> ==528==at 0x4839809: malloc (vg_replace_malloc.c:307)
>> ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in 
>> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
>> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in 
>> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
>> ==528==by 0x48AAF00: PMPI_Allreduce (in 
>> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
>> ==528==by 0x401317: main (ompi.cxx:21)


Re: [OMPI users] Allreduce with Op

2021-03-13 Thread George Bosilca via users
Hi Pierre,

MPI is allowed to pipeline the collective communications. This explains why
the MPI_Op takes the len of the buffers as an argument. Because your MPI_Op
ignores this length it alters data outside the temporary buffer we use for
the segment. Other versions of the MPI_Allreduce implementation might
choose not to pipeline in which case applying the MPI_Op on the entire
length of the buffer (as you manually did in your code) is correct.

  George.


On Sat, Mar 13, 2021 at 4:47 AM Pierre Jolivet via users <
users@lists.open-mpi.org> wrote:

> Hello,
> The following piece of code generates Valgrind errors with OpenMPI 4.1.0,
> while it is Valgrind-clean with MPICH and OpenMPI 4.0.5.
> I don’t think I’m doing anything illegal, so could this be a regression
> introduced in 4.1.0?
>
> Thanks,
> Pierre
>
> $ /opt/openmpi-4.1.0/bin/mpicxx ompi.cxx -g -O0 -std=c++11
> $ /opt/openmpi-4.1.0/bin/mpirun -n 4 valgrind --log-file=dump.%p.log
> ./a.out
>
>
>
> ==528== Invalid read of size 2
> ==528==at 0x4011EB: main::{lambda(void*, void*, int*,
> ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**)
> const (ompi.cxx:15)
> ==528==by 0x40127B: main::{lambda(void*, void*, int*,
> ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**)
> (ompi.cxx:19)
> ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in
> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
> ==528==by 0x48AAF00: PMPI_Allreduce (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x401317: main (ompi.cxx:21)
> ==528==  Address 0x7139f74 is 0 bytes after a block of size 4 alloc'd
> ==528==at 0x4839809: malloc (vg_replace_malloc.c:307)
> ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in
> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
> ==528==by 0x48AAF00: PMPI_Allreduce (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x401317: main (ompi.cxx:21)
> ==528==
> ==528== Invalid read of size 2
> ==528==at 0x40120E: main::{lambda(void*, void*, int*,
> ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**)
> const (ompi.cxx:16)
> ==528==by 0x40127B: main::{lambda(void*, void*, int*,
> ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**)
> (ompi.cxx:19)
> ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in
> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
> ==528==by 0x48AAF00: PMPI_Allreduce (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x401317: main (ompi.cxx:21)
> ==528==  Address 0x7139f76 is 2 bytes after a block of size 4 alloc'd
> ==528==at 0x4839809: malloc (vg_replace_malloc.c:307)
> ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in
> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
> ==528==by 0x48AAF00: PMPI_Allreduce (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x401317: main (ompi.cxx:21)
> ==528==
> ==528== Invalid read of size 2
> ==528==at 0x401231: main::{lambda(void*, void*, int*,
> ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**)
> const (ompi.cxx:18)
> ==528==by 0x40127B: main::{lambda(void*, void*, int*,
> ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**)
> (ompi.cxx:19)
> ==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in
> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
> ==528==by 0x48AAF00: PMPI_Allreduce (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x401317: main (ompi.cxx:21)
> ==528==  Address 0x7139f78 is 4 bytes after a block of size 4 alloc'd
> ==528==at 0x4839809: malloc (vg_replace_malloc.c:307)
> ==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in
> /opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
> ==528==by 0x48AAF00: PMPI_Allreduce (in
> /opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
> ==528==by 0x401317: main (ompi.cxx:21)


[OMPI users] Error intialising an OpenFabrics device.

2021-03-13 Thread Bob Beattie via users

Hi everyone,

To be honest, as an MPI / IB noob, I don't know if this falls under 
OpenMPI or Mellanox


Am running a small cluster of HP DL380 G6/G7 machines.
Each runs Ubuntu server 20.04 and has a Mellanox ConnectX-3 card, 
connected by an IS dumb switch.
When I begin my MPI program (snappyHexMesh for OpenFOAM) I get an error 
reported.
The error doesn't stop my programs or appear to cause any problems, so 
this request for help is more about delving into the why.


OMPI is compiled from source using v4.0.3; which is the default version 
for Ubuntu 20.04
This compiles and works.  I did this because I wanted to understand the 
compilation process whilst using a known working OMPI version.


The Infiniband part is the Mellanox MLNXOFED installer v4.9-0.1.7.0 and 
I install that with --dkms --without-fw-update --hpc --with-nfsrdma


The actual error reported is:
Warning: There was an error initialising an OpenFabrics device.
  Local host: of1
  Local device: mlx4_0

Then shortly after:
[of1:1015399] 19 more processes have sent help message 
help-mpi-btl-openib.txt / error in device init
[of1:1015399] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all help / error messages


Adding this MCA parameter to the mpirun line simply gives me 20 or so 
copies of the first warning.


Any ideas anyone ?
Cheers,
Bob.


[OMPI users] Allreduce with Op

2021-03-13 Thread Pierre Jolivet via users
Hello,
The following piece of code generates Valgrind errors with OpenMPI 4.1.0, while 
it is Valgrind-clean with MPICH and OpenMPI 4.0.5.
I don’t think I’m doing anything illegal, so could this be a regression 
introduced in 4.1.0?

Thanks,
Pierre

$ /opt/openmpi-4.1.0/bin/mpicxx ompi.cxx -g -O0 -std=c++11
$ /opt/openmpi-4.1.0/bin/mpirun -n 4 valgrind --log-file=dump.%p.log ./a.out



ompi.cxx
Description: Binary data


==528== Invalid read of size 2
==528==at 0x4011EB: main::{lambda(void*, void*, int*, 
ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) const 
(ompi.cxx:15)
==528==by 0x40127B: main::{lambda(void*, void*, int*, 
ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) (ompi.cxx:19)
==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in 
/opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in 
/opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
==528==by 0x48AAF00: PMPI_Allreduce (in 
/opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
==528==by 0x401317: main (ompi.cxx:21)
==528==  Address 0x7139f74 is 0 bytes after a block of size 4 alloc'd
==528==at 0x4839809: malloc (vg_replace_malloc.c:307)
==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in 
/opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in 
/opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
==528==by 0x48AAF00: PMPI_Allreduce (in 
/opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
==528==by 0x401317: main (ompi.cxx:21)
==528==
==528== Invalid read of size 2
==528==at 0x40120E: main::{lambda(void*, void*, int*, 
ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) const 
(ompi.cxx:16)
==528==by 0x40127B: main::{lambda(void*, void*, int*, 
ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) (ompi.cxx:19)
==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in 
/opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in 
/opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
==528==by 0x48AAF00: PMPI_Allreduce (in 
/opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
==528==by 0x401317: main (ompi.cxx:21)
==528==  Address 0x7139f76 is 2 bytes after a block of size 4 alloc'd
==528==at 0x4839809: malloc (vg_replace_malloc.c:307)
==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in 
/opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in 
/opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
==528==by 0x48AAF00: PMPI_Allreduce (in 
/opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
==528==by 0x401317: main (ompi.cxx:21)
==528==
==528== Invalid read of size 2
==528==at 0x401231: main::{lambda(void*, void*, int*, 
ompi_datatype_t**)#1}::operator()(void*, void*, int*, ompi_datatype_t**) const 
(ompi.cxx:18)
==528==by 0x40127B: main::{lambda(void*, void*, int*, 
ompi_datatype_t**)#1}::_FUN(void*, void*, int*, ompi_datatype_t**) (ompi.cxx:19)
==528==by 0x48EFFED: ompi_coll_base_allreduce_intra_ring (in 
/opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in 
/opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
==528==by 0x48AAF00: PMPI_Allreduce (in 
/opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
==528==by 0x401317: main (ompi.cxx:21)
==528==  Address 0x7139f78 is 4 bytes after a block of size 4 alloc'd
==528==at 0x4839809: malloc (vg_replace_malloc.c:307)
==528==by 0x48EF940: ompi_coll_base_allreduce_intra_ring (in 
/opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
==528==by 0x77CD93A: ompi_coll_tuned_allreduce_intra_dec_fixed (in 
/opt/openmpi-4.1.0/lib/openmpi/mca_coll_tuned.so)
==528==by 0x48AAF00: PMPI_Allreduce (in 
/opt/openmpi-4.1.0/lib/libmpi.so.40.30.0)
==528==by 0x401317: main (ompi.cxx:21)