Hi OpenMPI developers, it looks difficult for me to track this memory problem in OpenMPI 3.x and 4.x implementation.... Should I open an issue about this ? Or is openib definitively an old strategy that will not evolved (and bug get untracked) ?
Thanks Patrick Le 07/12/2020 à 10:15, Patrick Bégou via users a écrit : > Hi, > > I've written a small piece of code to show the problem. Based on my > application but 2D and using integers arrays for testing. > The figure below shows the max RSS size of rank 0 process on 20000 > iterations on 8 and 16 cores, with openib and tcp drivers. > The more processes I have, the larger the memory leak. I use the same > binaries for the 4 runs and OpenMPI 3.1 (same behavior with 4.0.5). > The code is in attachment. I'll try to check type deallocation as soon > as possible. > > Patrick > > > > > Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit : >> Patrick, >> >> >> based on George's idea, a simpler check is to retrieve the Fortran >> index via the (standard) MPI_Type_c2() function >> >> after you create a derived datatype. >> >> >> If the index keeps growing forever even after you MPI_Type_free(), >> then this clearly indicates a leak. >> >> Unfortunately, this simple test cannot be used to definitely rule out >> any memory leak. >> >> >> Note you can also >> >> mpirun --mca pml ob1 --mca btl tcp,self ... >> >> in order to force communications over TCP/IP and hence rule out any >> memory leak that could be triggered by your fast interconnect. >> >> >> >> In any case, a reproducer will greatly help us debugging this issue. >> >> >> Cheers, >> >> >> Gilles >> >> >> >> On 12/4/2020 7:20 AM, George Bosilca via users wrote: >>> Patrick, >>> >>> I'm afraid there is no simple way to check this. The main reason >>> being that OMPI use handles for MPI objects, and these handles are >>> not tracked by the library, they are supposed to be provided by the >>> user for each call. In your case, as you already called >>> MPI_Type_free on the datatype, you cannot produce a valid handle. >>> >>> There might be a trick. If the datatype is manipulated with any >>> Fortran MPI functions, then we convert the handle (which in fact is >>> a pointer) to an index into a pointer array structure. Thus, the >>> index will remain used, and can therefore be used to convert back >>> into a valid datatype pointer, until OMPI completely releases the >>> datatype. Look into the ompi_datatype_f_to_c_table table to see the >>> datatypes that exist and get their pointers, and then use these >>> pointers as arguments to ompi_datatype_dump() to see if any of these >>> existing datatypes are the ones you define. >>> >>> George. >>> >>> >>> >>> >>> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users >>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: >>> >>> Hi, >>> >>> I'm trying to solve a memory leak since my new implementation of >>> communications based on MPI_AllToAllW and MPI_type_Create_SubArray >>> calls. Arrays of SubArray types are created/destroyed at each >>> time step and used for communications. >>> >>> On my laptop the code runs fine (running for 15000 temporal >>> itérations on 32 processes with oversubscription) but on our >>> cluster memory used by the code increase until the OOMkiller stop >>> the job. On the cluster we use IB QDR for communications. >>> >>> Same Gcc/Gfortran 7.3 (built from sources), same sources of >>> OpenMPI (3.1 or 4.0.5 tested), same sources of the fortran code on >>> the laptop and on the cluster. >>> >>> Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not >>> show the problem (resident memory do not increase and we ran >>> 100000 temporal iterations) >>> >>> MPI_type_free manual says that it "/Marks the datatype object >>> associated with datatype for deallocation/". But how can I check >>> that the deallocation is really done ? >>> >>> Thanks for ant suggestions. >>> >>> Patrick >>> >