Hi OpenMPI developers,

it looks difficult for me to track this memory problem in OpenMPI 3.x
and 4.x implementation.... Should I open an issue about this ?
Or is openib definitively an old strategy that will not evolved (and bug
get untracked) ?

Thanks

Patrick



Le 07/12/2020 à 10:15, Patrick Bégou via users a écrit :
> Hi,
>
> I've written a small piece of code to show the problem. Based on my
> application but 2D and using integers arrays for testing.
> The  figure below shows the max RSS size of rank 0 process on 20000
> iterations on 8 and 16 cores, with openib and tcp drivers.
> The more processes I have, the larger the memory leak.  I use the same
> binaries for the 4 runs and OpenMPI 3.1 (same behavior with 4.0.5).
> The code is in attachment. I'll try to check type deallocation as soon
> as possible.
>
> Patrick
>
>
>
>
> Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit :
>> Patrick,
>>
>>
>> based on George's idea, a simpler check is to retrieve the Fortran
>> index via the (standard) MPI_Type_c2() function
>>
>> after you create a derived datatype.
>>
>>
>> If the index keeps growing forever even after you MPI_Type_free(),
>> then this clearly indicates a leak.
>>
>> Unfortunately, this simple test cannot be used to definitely rule out
>> any memory leak.
>>
>>
>> Note you can also
>>
>> mpirun --mca pml ob1 --mca btl tcp,self ...
>>
>> in order to force communications over TCP/IP and hence rule out any
>> memory leak that could be triggered by your fast interconnect.
>>
>>
>>
>> In any case, a reproducer will greatly help us debugging this issue.
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>>
>> On 12/4/2020 7:20 AM, George Bosilca via users wrote:
>>> Patrick,
>>>
>>> I'm afraid there is no simple way to check this. The main reason
>>> being that OMPI use handles for MPI objects, and these handles are
>>> not tracked by the library, they are supposed to be provided by the
>>> user for each call. In your case, as you already called
>>> MPI_Type_free on the datatype, you cannot produce a valid handle.
>>>
>>> There might be a trick. If the datatype is manipulated with any
>>> Fortran MPI functions, then we convert the handle (which in fact is
>>> a pointer) to an index into a pointer array structure. Thus, the
>>> index will remain used, and can therefore be used to convert back
>>> into a valid datatype pointer, until OMPI completely releases the
>>> datatype. Look into the ompi_datatype_f_to_c_table table to see the
>>> datatypes that exist and get their pointers, and then use these
>>> pointers as arguments to ompi_datatype_dump() to see if any of these
>>> existing datatypes are the ones you define.
>>>
>>> George.
>>>
>>>
>>>
>>>
>>> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users
>>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>>>
>>>     Hi,
>>>
>>>     I'm trying to solve a memory leak since my new implementation of
>>>     communications based on MPI_AllToAllW and MPI_type_Create_SubArray
>>>     calls.  Arrays of SubArray types are created/destroyed at each
>>>     time step and used for communications.
>>>
>>>     On my laptop the code runs fine (running for 15000 temporal
>>>     itérations on 32 processes with oversubscription) but on our
>>>     cluster memory used by the code increase until the OOMkiller stop
>>>     the job. On the cluster we use IB QDR for communications.
>>>
>>>     Same Gcc/Gfortran 7.3 (built from sources), same sources of
>>>     OpenMPI (3.1 or 4.0.5 tested), same sources of the fortran code on
>>>     the laptop and on the cluster.
>>>
>>>     Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not
>>>     show the problem (resident memory do not increase and we ran
>>>     100000 temporal iterations)
>>>
>>>     MPI_type_free manual says that it "/Marks the datatype object
>>>     associated with datatype for deallocation/". But  how can I check
>>>     that the deallocation is really done ?
>>>
>>>     Thanks for ant suggestions.
>>>
>>>     Patrick
>>>
>

Reply via email to