Re: [OMPI users] MPI_type_free question

2020-12-03 Thread Patrick Bégou via users
Hi George and Gilles,

Thanks George for your suggestion. Is it valuable for 4.05 and 3.1
OpenMPI Versions ? I will have a look today at these tables. May be
writing a small piece of code juste creating and freeing subarray datatype.

Thanks Gilles for suggesting disabling the interconnect. it is a good
fast test and yes, *with "mpirun --mca pml ob1 --mca btl tcp,self" I
have no memory leak*. So this explain the differences between my laptop
and the cluster.
The implementation of type management is so different from 1.7.3  ?

A PhD student tells me he has also some trouble with this code on a
cluster Omnipath based. I will have to investigate too but not sure it
is the same problem.

Patrick

Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit :
> Patrick,
>
>
> based on George's idea, a simpler check is to retrieve the Fortran
> index via the (standard) MPI_Type_c2() function
>
> after you create a derived datatype.
>
>
> If the index keeps growing forever even after you MPI_Type_free(),
> then this clearly indicates a leak.
>
> Unfortunately, this simple test cannot be used to definitely rule out
> any memory leak.
>
>
> Note you can also
>
> mpirun --mca pml ob1 --mca btl tcp,self ...
>
> in order to force communications over TCP/IP and hence rule out any
> memory leak that could be triggered by your fast interconnect.
>
>
>
> In any case, a reproducer will greatly help us debugging this issue.
>
>
> Cheers,
>
>
> Gilles
>
>
>
> On 12/4/2020 7:20 AM, George Bosilca via users wrote:
>> Patrick,
>>
>> I'm afraid there is no simple way to check this. The main reason
>> being that OMPI use handles for MPI objects, and these handles are
>> not tracked by the library, they are supposed to be provided by the
>> user for each call. In your case, as you already called MPI_Type_free
>> on the datatype, you cannot produce a valid handle.
>>
>> There might be a trick. If the datatype is manipulated with any
>> Fortran MPI functions, then we convert the handle (which in fact is a
>> pointer) to an index into a pointer array structure. Thus, the index
>> will remain used, and can therefore be used to convert back into a
>> valid datatype pointer, until OMPI completely releases the datatype.
>> Look into the ompi_datatype_f_to_c_table table to see the datatypes
>> that exist and get their pointers, and then use these pointers as
>> arguments to ompi_datatype_dump() to see if any of these existing
>> datatypes are the ones you define.
>>
>> George.
>>
>>
>>
>>
>> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>
>>     Hi,
>>
>>     I'm trying to solve a memory leak since my new implementation of
>>     communications based on MPI_AllToAllW and MPI_type_Create_SubArray
>>     calls.  Arrays of SubArray types are created/destroyed at each
>>     time step and used for communications.
>>
>>     On my laptop the code runs fine (running for 15000 temporal
>>     itérations on 32 processes with oversubscription) but on our
>>     cluster memory used by the code increase until the OOMkiller stop
>>     the job. On the cluster we use IB QDR for communications.
>>
>>     Same Gcc/Gfortran 7.3 (built from sources), same sources of
>>     OpenMPI (3.1 or 4.0.5 tested), same sources of the fortran code on
>>     the laptop and on the cluster.
>>
>>     Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not
>>     show the problem (resident memory do not increase and we ran
>>     10 temporal iterations)
>>
>>     MPI_type_free manual says that it "/Marks the datatype object
>>     associated with datatype for deallocation/". But  how can I check
>>     that the deallocation is really done ?
>>
>>     Thanks for ant suggestions.
>>
>>     Patrick
>>



Re: [OMPI users] MPI_type_free question

2020-12-03 Thread Gilles Gouaillardet via users

Patrick,


based on George's idea, a simpler check is to retrieve the Fortran index 
via the (standard) MPI_Type_c2() function


after you create a derived datatype.


If the index keeps growing forever even after you MPI_Type_free(), then 
this clearly indicates a leak.


Unfortunately, this simple test cannot be used to definitely rule out 
any memory leak.



Note you can also

mpirun --mca pml ob1 --mca btl tcp,self ...

in order to force communications over TCP/IP and hence rule out any 
memory leak that could be triggered by your fast interconnect.




In any case, a reproducer will greatly help us debugging this issue.


Cheers,


Gilles



On 12/4/2020 7:20 AM, George Bosilca via users wrote:

Patrick,

I'm afraid there is no simple way to check this. The main reason being 
that OMPI use handles for MPI objects, and these handles are not 
tracked by the library, they are supposed to be provided by the user 
for each call. In your case, as you already called MPI_Type_free on 
the datatype, you cannot produce a valid handle.


There might be a trick. If the datatype is manipulated with any 
Fortran MPI functions, then we convert the handle (which in fact is a 
pointer) to an index into a pointer array structure. Thus, the index 
will remain used, and can therefore be used to convert back into a 
valid datatype pointer, until OMPI completely releases the datatype. 
Look into the ompi_datatype_f_to_c_table table to see the datatypes 
that exist and get their pointers, and then use these pointers as 
arguments to ompi_datatype_dump() to see if any of these existing 
datatypes are the ones you define.


George.




On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users 
mailto:users@lists.open-mpi.org>> wrote:


Hi,

I'm trying to solve a memory leak since my new implementation of
communications based on MPI_AllToAllW and MPI_type_Create_SubArray
calls.  Arrays of SubArray types are created/destroyed at each
time step and used for communications.

On my laptop the code runs fine (running for 15000 temporal
itérations on 32 processes with oversubscription) but on our
cluster memory used by the code increase until the OOMkiller stop
the job. On the cluster we use IB QDR for communications.

Same Gcc/Gfortran 7.3 (built from sources), same sources of
OpenMPI (3.1 or 4.0.5 tested), same sources of the fortran code on
the laptop and on the cluster.

Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not
show the problem (resident memory do not increase and we ran
10 temporal iterations)

MPI_type_free manual says that it "/Marks the datatype object
associated with datatype for deallocation/". But  how can I check
that the deallocation is really done ?

Thanks for ant suggestions.

Patrick



Re: [OMPI users] MPI_type_free question

2020-12-03 Thread George Bosilca via users
Patrick,

I'm afraid there is no simple way to check this. The main reason being that
OMPI use handles for MPI objects, and these handles are not tracked by the
library, they are supposed to be provided by the user for each call. In
your case, as you already called MPI_Type_free on the datatype, you cannot
produce a valid handle.

There might be a trick. If the datatype is manipulated with any Fortran MPI
functions, then we convert the handle (which in fact is a pointer) to an
index into a pointer array structure. Thus, the index will remain used, and
can therefore be used to convert back into a valid datatype pointer, until
OMPI completely releases the datatype. Look into
the ompi_datatype_f_to_c_table table to see the datatypes that exist and
get their pointers, and then use these pointers as arguments to
ompi_datatype_dump() to see if any of these existing datatypes are the ones
you define.

George.




On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users <
users@lists.open-mpi.org> wrote:

> Hi,
>
> I'm trying to solve a memory leak since my new implementation of
> communications based on MPI_AllToAllW and MPI_type_Create_SubArray calls.
> Arrays of SubArray types are created/destroyed at each time step and used
> for communications.
>
> On my laptop the code runs fine (running for 15000 temporal itérations on
> 32 processes with oversubscription) but on our cluster memory used by the
> code increase until the OOMkiller stop the job. On the cluster we use IB
> QDR for communications.
>
> Same Gcc/Gfortran 7.3 (built from sources), same sources of OpenMPI (3.1
> or 4.0.5 tested), same sources of the fortran code on the laptop and on the
> cluster.
>
> Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not show the
> problem (resident memory do not increase and we ran 10 temporal
> iterations)
>
> MPI_type_free manual says that it "*Marks the datatype object associated
> with datatype for deallocation*". But  how can I check that the
> deallocation is really done ?
>
> Thanks for ant suggestions.
>
> Patrick
>


[OMPI users] MPI_type_free question

2020-12-03 Thread Patrick Bégou via users
Hi,

I'm trying to solve a memory leak since my new implementation of
communications based on MPI_AllToAllW and MPI_type_Create_SubArray
calls.  Arrays of SubArray types are created/destroyed at each time step
and used for communications.

On my laptop the code runs fine (running for 15000 temporal itérations
on 32 processes with oversubscription) but on our cluster memory used by
the code increase until the OOMkiller stop the job. On the cluster we
use IB QDR for communications.

Same Gcc/Gfortran 7.3 (built from sources), same sources of OpenMPI (3.1
or 4.0.5 tested), same sources of the fortran code on the laptop and on
the cluster.

Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not show the
problem (resident memory do not increase and we ran 10 temporal
iterations)

MPI_type_free manual says that it "/Marks the datatype object associated
with datatype for deallocation/". But  how can I check that the
deallocation is really done ?

Thanks for ant suggestions.

Patrick



Re: [OMPI users] Parallel HDF5 low performance

2020-12-03 Thread Patrick Bégou via users
Thanks for all this suggestions. I'll try to create a small test
reproducing this behavior ans try the different parameters.
I do not use MPI I/O directly but parallel hdf5 which rely on MPI I/O .
NFS is the easiest way to share storage between nodes on a small
cluster. Beegfs or lustre require biggest (additional) architectures.

Patrick

Le 03/12/2020 à 15:38, Gabriel, Edgar via users a écrit :
> the reason for potential performance issues on NFS are very different from 
> Lustre. Basically, depending on your use-case and the NFS configuration, you 
> have to enforce different locking policy to ensure correct output files. The 
> default value for chosen for ompio is the most conservative setting, since 
> this was the only setting that we found that would result in a correct output 
> file for all of our tests.  You can change settings to see whether other 
> options would work you.
>
> The parameter that you need to work with is fs_ufs_lock_algorithm. Setting it 
> to 1 will completely disable it (and most likely lead to the best 
> performance), setting it to 3 is a middle ground (lock specific ranges) and 
> similar to what ROMIO does. So e.g.
>
> mpiexec -n 16 --mca fs_ufs_lock_algorihtm 1 ./mytests
>
> That being said, if you google NFS + MPI I/O, you will find a  ton of 
> document and reasons for potential problems, so using MPI I/O on top of NFS 
> (whether OMPIO or ROMIO) is always at your own risk.
> Thanks
>
> Edgar
>
> -Original Message-
> From: users  On Behalf Of Gilles 
> Gouaillardet via users
> Sent: Thursday, December 3, 2020 4:46 AM
> To: Open MPI Users 
> Cc: Gilles Gouaillardet 
> Subject: Re: [OMPI users] Parallel HDF5 low performance
>
> Patrick,
>
> glad to hear you will upgrade Open MPI thanks to this workaround!
>
> ompio has known performance issues on Lustre (this is why ROMIO is still the 
> default on this filesystem) but I do not remember such performance issues 
> have been reported on a NFS filesystem.
>
> Sharing a reproducer will be very much appreciated in order to improve ompio
>
> Cheers,
>
> Gilles
>
> On Thu, Dec 3, 2020 at 6:05 PM Patrick Bégou via users 
>  wrote:
>> Thanks Gilles,
>>
>> this is the solution.
>> I will set OMPI_MCA_io=^ompio automatically when loading the parallel
>> hdf5 module on the cluster.
>>
>> I was tracking this problem for several weeks but not looking in the 
>> right direction (testing NFS server I/O, network bandwidth.)
>>
>> I think we will now move definitively to modern OpenMPI implementations.
>>
>> Patrick
>>
>> Le 03/12/2020 à 09:06, Gilles Gouaillardet via users a écrit :
>>> Patrick,
>>>
>>>
>>> In recent Open MPI releases, the default component for MPI-IO is 
>>> ompio (and no more romio)
>>>
>>> unless the file is on a Lustre filesystem.
>>>
>>>
>>> You can force romio with
>>>
>>> mpirun --mca io ^ompio ...
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>>> On 12/3/2020 4:20 PM, Patrick Bégou via users wrote:
 Hi,

 I'm using an old (but required by the codes) version of hdf5 
 (1.8.12) in parallel mode in 2 fortran applications. It relies on 
 MPI/IO. The storage is NFS mounted on the nodes of a small cluster.

 With OpenMPI 1.7 it runs fine but using modern OpenMPI 3.1 or 4.0.5 
 the I/Os are 10x to 100x slower. Are there fundamentals changes in 
 MPI/IO for these new releases of OpenMPI and a solution to get back 
 to the IO performances with this parallel HDF5 release ?

 Thanks for your advices

 Patrick




Re: [OMPI users] Parallel HDF5 low performance

2020-12-03 Thread Gabriel, Edgar via users
the reason for potential performance issues on NFS are very different from 
Lustre. Basically, depending on your use-case and the NFS configuration, you 
have to enforce different locking policy to ensure correct output files. The 
default value for chosen for ompio is the most conservative setting, since this 
was the only setting that we found that would result in a correct output file 
for all of our tests.  You can change settings to see whether other options 
would work you.

The parameter that you need to work with is fs_ufs_lock_algorithm. Setting it 
to 1 will completely disable it (and most likely lead to the best performance), 
setting it to 3 is a middle ground (lock specific ranges) and similar to what 
ROMIO does. So e.g.

mpiexec -n 16 --mca fs_ufs_lock_algorihtm 1 ./mytests

That being said, if you google NFS + MPI I/O, you will find a  ton of document 
and reasons for potential problems, so using MPI I/O on top of NFS (whether 
OMPIO or ROMIO) is always at your own risk.
Thanks

Edgar

-Original Message-
From: users  On Behalf Of Gilles Gouaillardet 
via users
Sent: Thursday, December 3, 2020 4:46 AM
To: Open MPI Users 
Cc: Gilles Gouaillardet 
Subject: Re: [OMPI users] Parallel HDF5 low performance

Patrick,

glad to hear you will upgrade Open MPI thanks to this workaround!

ompio has known performance issues on Lustre (this is why ROMIO is still the 
default on this filesystem) but I do not remember such performance issues have 
been reported on a NFS filesystem.

Sharing a reproducer will be very much appreciated in order to improve ompio

Cheers,

Gilles

On Thu, Dec 3, 2020 at 6:05 PM Patrick Bégou via users 
 wrote:
>
> Thanks Gilles,
>
> this is the solution.
> I will set OMPI_MCA_io=^ompio automatically when loading the parallel
> hdf5 module on the cluster.
>
> I was tracking this problem for several weeks but not looking in the 
> right direction (testing NFS server I/O, network bandwidth.)
>
> I think we will now move definitively to modern OpenMPI implementations.
>
> Patrick
>
> Le 03/12/2020 à 09:06, Gilles Gouaillardet via users a écrit :
> > Patrick,
> >
> >
> > In recent Open MPI releases, the default component for MPI-IO is 
> > ompio (and no more romio)
> >
> > unless the file is on a Lustre filesystem.
> >
> >
> > You can force romio with
> >
> > mpirun --mca io ^ompio ...
> >
> >
> > Cheers,
> >
> >
> > Gilles
> >
> > On 12/3/2020 4:20 PM, Patrick Bégou via users wrote:
> >> Hi,
> >>
> >> I'm using an old (but required by the codes) version of hdf5 
> >> (1.8.12) in parallel mode in 2 fortran applications. It relies on 
> >> MPI/IO. The storage is NFS mounted on the nodes of a small cluster.
> >>
> >> With OpenMPI 1.7 it runs fine but using modern OpenMPI 3.1 or 4.0.5 
> >> the I/Os are 10x to 100x slower. Are there fundamentals changes in 
> >> MPI/IO for these new releases of OpenMPI and a solution to get back 
> >> to the IO performances with this parallel HDF5 release ?
> >>
> >> Thanks for your advices
> >>
> >> Patrick
> >>
>


Re: [OMPI users] Parallel HDF5 low performance

2020-12-03 Thread Gilles Gouaillardet via users
Patrick,

glad to hear you will upgrade Open MPI thanks to this workaround!

ompio has known performance issues on Lustre (this is why ROMIO is
still the default on this filesystem)
but I do not remember such performance issues have been reported on a
NFS filesystem.

Sharing a reproducer will be very much appreciated in order to improve ompio

Cheers,

Gilles

On Thu, Dec 3, 2020 at 6:05 PM Patrick Bégou via users
 wrote:
>
> Thanks Gilles,
>
> this is the solution.
> I will set OMPI_MCA_io=^ompio automatically when loading the parallel
> hdf5 module on the cluster.
>
> I was tracking this problem for several weeks but not looking in the
> right direction (testing NFS server I/O, network bandwidth.)
>
> I think we will now move definitively to modern OpenMPI implementations.
>
> Patrick
>
> Le 03/12/2020 à 09:06, Gilles Gouaillardet via users a écrit :
> > Patrick,
> >
> >
> > In recent Open MPI releases, the default component for MPI-IO is ompio
> > (and no more romio)
> >
> > unless the file is on a Lustre filesystem.
> >
> >
> > You can force romio with
> >
> > mpirun --mca io ^ompio ...
> >
> >
> > Cheers,
> >
> >
> > Gilles
> >
> > On 12/3/2020 4:20 PM, Patrick Bégou via users wrote:
> >> Hi,
> >>
> >> I'm using an old (but required by the codes) version of hdf5 (1.8.12) in
> >> parallel mode in 2 fortran applications. It relies on MPI/IO. The
> >> storage is NFS mounted on the nodes of a small cluster.
> >>
> >> With OpenMPI 1.7 it runs fine but using modern OpenMPI 3.1 or 4.0.5 the
> >> I/Os are 10x to 100x slower. Are there fundamentals changes in MPI/IO
> >> for these new releases of OpenMPI and a solution to get back to the IO
> >> performances with this parallel HDF5 release ?
> >>
> >> Thanks for your advices
> >>
> >> Patrick
> >>
>


Re: [OMPI users] Parallel HDF5 low performance

2020-12-03 Thread Patrick Bégou via users
Thanks Gilles,

this is the solution.
I will set OMPI_MCA_io=^ompio automatically when loading the parallel
hdf5 module on the cluster.

I was tracking this problem for several weeks but not looking in the
right direction (testing NFS server I/O, network bandwidth.)

I think we will now move definitively to modern OpenMPI implementations.

Patrick

Le 03/12/2020 à 09:06, Gilles Gouaillardet via users a écrit :
> Patrick,
>
>
> In recent Open MPI releases, the default component for MPI-IO is ompio
> (and no more romio)
>
> unless the file is on a Lustre filesystem.
>
>
> You can force romio with
>
> mpirun --mca io ^ompio ...
>
>
> Cheers,
>
>
> Gilles
>
> On 12/3/2020 4:20 PM, Patrick Bégou via users wrote:
>> Hi,
>>
>> I'm using an old (but required by the codes) version of hdf5 (1.8.12) in
>> parallel mode in 2 fortran applications. It relies on MPI/IO. The
>> storage is NFS mounted on the nodes of a small cluster.
>>
>> With OpenMPI 1.7 it runs fine but using modern OpenMPI 3.1 or 4.0.5 the
>> I/Os are 10x to 100x slower. Are there fundamentals changes in MPI/IO
>> for these new releases of OpenMPI and a solution to get back to the IO
>> performances with this parallel HDF5 release ?
>>
>> Thanks for your advices
>>
>> Patrick
>>



Re: [OMPI users] Parallel HDF5 low performance

2020-12-03 Thread Gilles Gouaillardet via users

Patrick,


In recent Open MPI releases, the default component for MPI-IO is ompio 
(and no more romio)


unless the file is on a Lustre filesystem.


You can force romio with

mpirun --mca io ^ompio ...


Cheers,


Gilles

On 12/3/2020 4:20 PM, Patrick Bégou via users wrote:

Hi,

I'm using an old (but required by the codes) version of hdf5 (1.8.12) in
parallel mode in 2 fortran applications. It relies on MPI/IO. The
storage is NFS mounted on the nodes of a small cluster.

With OpenMPI 1.7 it runs fine but using modern OpenMPI 3.1 or 4.0.5 the
I/Os are 10x to 100x slower. Are there fundamentals changes in MPI/IO
for these new releases of OpenMPI and a solution to get back to the IO
performances with this parallel HDF5 release ?

Thanks for your advices

Patrick