Hi George and Gilles, Thanks George for your suggestion. Is it valuable for 4.05 and 3.1 OpenMPI Versions ? I will have a look today at these tables. May be writing a small piece of code juste creating and freeing subarray datatype.
Thanks Gilles for suggesting disabling the interconnect. it is a good fast test and yes, *with "mpirun --mca pml ob1 --mca btl tcp,self" I have no memory leak*. So this explain the differences between my laptop and the cluster. The implementation of type management is so different from 1.7.3 ? A PhD student tells me he has also some trouble with this code on a cluster Omnipath based. I will have to investigate too but not sure it is the same problem. Patrick Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit : > Patrick, > > > based on George's idea, a simpler check is to retrieve the Fortran > index via the (standard) MPI_Type_c2() function > > after you create a derived datatype. > > > If the index keeps growing forever even after you MPI_Type_free(), > then this clearly indicates a leak. > > Unfortunately, this simple test cannot be used to definitely rule out > any memory leak. > > > Note you can also > > mpirun --mca pml ob1 --mca btl tcp,self ... > > in order to force communications over TCP/IP and hence rule out any > memory leak that could be triggered by your fast interconnect. > > > > In any case, a reproducer will greatly help us debugging this issue. > > > Cheers, > > > Gilles > > > > On 12/4/2020 7:20 AM, George Bosilca via users wrote: >> Patrick, >> >> I'm afraid there is no simple way to check this. The main reason >> being that OMPI use handles for MPI objects, and these handles are >> not tracked by the library, they are supposed to be provided by the >> user for each call. In your case, as you already called MPI_Type_free >> on the datatype, you cannot produce a valid handle. >> >> There might be a trick. If the datatype is manipulated with any >> Fortran MPI functions, then we convert the handle (which in fact is a >> pointer) to an index into a pointer array structure. Thus, the index >> will remain used, and can therefore be used to convert back into a >> valid datatype pointer, until OMPI completely releases the datatype. >> Look into the ompi_datatype_f_to_c_table table to see the datatypes >> that exist and get their pointers, and then use these pointers as >> arguments to ompi_datatype_dump() to see if any of these existing >> datatypes are the ones you define. >> >> George. >> >> >> >> >> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users >> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: >> >> Hi, >> >> I'm trying to solve a memory leak since my new implementation of >> communications based on MPI_AllToAllW and MPI_type_Create_SubArray >> calls. Arrays of SubArray types are created/destroyed at each >> time step and used for communications. >> >> On my laptop the code runs fine (running for 15000 temporal >> itérations on 32 processes with oversubscription) but on our >> cluster memory used by the code increase until the OOMkiller stop >> the job. On the cluster we use IB QDR for communications. >> >> Same Gcc/Gfortran 7.3 (built from sources), same sources of >> OpenMPI (3.1 or 4.0.5 tested), same sources of the fortran code on >> the laptop and on the cluster. >> >> Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not >> show the problem (resident memory do not increase and we ran >> 100000 temporal iterations) >> >> MPI_type_free manual says that it "/Marks the datatype object >> associated with datatype for deallocation/". But how can I check >> that the deallocation is really done ? >> >> Thanks for ant suggestions. >> >> Patrick >>