Re: [OMPI users] MPI_type_free question

2020-12-15 Thread Patrick Bégou via users
Issue #8290 reported.
Thanks all for your help and the workaround provided.

Patrick

Le 14/12/2020 à 17:40, Jeff Squyres (jsquyres) a écrit :
> Yes, opening an issue would be great -- thanks!
>
>
>> On Dec 14, 2020, at 11:32 AM, Patrick Bégou via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>
>> OK, Thanks Gilles.
>> Does it still require that I open an issue for tracking ?
>>
>> Patrick
>>
>> Le 14/12/2020 à 14:56, Gilles Gouaillardet via users a écrit :
>>> Hi Patrick,
>>>
>>> Glad to hear you are now able to move forward.
>>>
>>> Please keep in mind this is not a fix but a temporary workaround.
>>> At first glance, I did not spot any issue in the current code.
>>> It turned out that the memory leak disappeared when doing things
>>> differently
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Mon, Dec 14, 2020 at 7:11 PM Patrick Bégou via users
>>> mailto:users@lists.open-mpi.org>> wrote:
>>>
>>> Hi Gilles,
>>>
>>> you catch the bug! With this patch, on a single node, the memory
>>> leak disappear. The cluster is actualy overloaded, as soon as
>>> possible I will launch a multinode test.
>>> Below the memory used by rank 0 before (blue) and after (red)
>>> the patch.
>>>
>>> Thanks
>>>
>>> Patrick
>>>
>>> 
>>>
>>> Le 10/12/2020 à 10:15, Gilles Gouaillardet via users a écrit :
>>>> Patrick,
>>>>
>>>>
>>>> First, thank you very much for sharing the reproducer.
>>>>
>>>>
>>>> Yes, please open a github issue so we can track this.
>>>>
>>>>
>>>> I cannot fully understand where the leak is coming from, but so
>>>>     far
>>>>
>>>>  - the code fails on master built with --enable-debug (the data
>>>> engine reports an error) but not with the v3.1.x branch
>>>>
>>>>   (this suggests there could be an error in the latest Open MPI
>>>> ... or in the code)
>>>>
>>>>  - the attached patch seems to have a positive effect, can you
>>>> please give it a try?
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> Gilles
>>>>
>>>>
>>>>
>>>> On 12/7/2020 6:15 PM, Patrick Bégou via users wrote:
>>>>> Hi,
>>>>>
>>>>> I've written a small piece of code to show the problem. Based
>>>>> on my application but 2D and using integers arrays for testing.
>>>>> The  figure below shows the max RSS size of rank 0 process on
>>>>> 2 iterations on 8 and 16 cores, with openib and tcp drivers.
>>>>> The more processes I have, the larger the memory leak.  I use
>>>>> the same binaries for the 4 runs and OpenMPI 3.1 (same
>>>>> behavior with 4.0.5).
>>>>> The code is in attachment. I'll try to check type deallocation
>>>>> as soon as possible.
>>>>>
>>>>> Patrick
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit :
>>>>>> Patrick,
>>>>>>
>>>>>>
>>>>>> based on George's idea, a simpler check is to retrieve the
>>>>>> Fortran index via the (standard) MPI_Type_c2() function
>>>>>>
>>>>>> after you create a derived datatype.
>>>>>>
>>>>>>
>>>>>> If the index keeps growing forever even after you
>>>>>> MPI_Type_free(), then this clearly indicates a leak.
>>>>>>
>>>>>> Unfortunately, this simple test cannot be used to definitely
>>>>>> rule out any memory leak.
>>>>>>
>>>>>>
>>>>>> Note you can also
>>>>>>
>>>>>> mpirun --mca pml ob1 --mca btl tcp,self ...
>>>>>>
>>>>>> in order to force communications over TCP/IP and hence rule
>>>>>> out any memory leak that could be triggered by your fast
>>>>>> interconnect.
>>>>

Re: [OMPI users] MPI_type_free question

2020-12-14 Thread Patrick Bégou via users
OK, Thanks Gilles.
Does it still require that I open an issue for tracking ?

Patrick

Le 14/12/2020 à 14:56, Gilles Gouaillardet via users a écrit :
> Hi Patrick,
>
> Glad to hear you are now able to move forward.
>
> Please keep in mind this is not a fix but a temporary workaround.
> At first glance, I did not spot any issue in the current code.
> It turned out that the memory leak disappeared when doing things
> differently
>
> Cheers,
>
> Gilles
>
> On Mon, Dec 14, 2020 at 7:11 PM Patrick Bégou via users
> mailto:users@lists.open-mpi.org>> wrote:
>
> Hi Gilles,
>
> you catch the bug! With this patch, on a single node, the memory
> leak disappear. The cluster is actualy overloaded, as soon as
> possible I will launch a multinode test.
> Below the memory used by rank 0 before (blue) and after (red) the
> patch.
>
> Thanks
>
> Patrick
>
>
> Le 10/12/2020 à 10:15, Gilles Gouaillardet via users a écrit :
>> Patrick,
>>
>>
>> First, thank you very much for sharing the reproducer.
>>
>>
>> Yes, please open a github issue so we can track this.
>>
>>
>> I cannot fully understand where the leak is coming from, but so far
>>
>>  - the code fails on master built with --enable-debug (the data
>> engine reports an error) but not with the v3.1.x branch
>>
>>   (this suggests there could be an error in the latest Open MPI
>> ... or in the code)
>>
>>  - the attached patch seems to have a positive effect, can you
>> please give it a try?
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>>
>> On 12/7/2020 6:15 PM, Patrick Bégou via users wrote:
>>> Hi,
>>>
>>> I've written a small piece of code to show the problem. Based on
>>> my application but 2D and using integers arrays for testing.
>>> The  figure below shows the max RSS size of rank 0 process on
>>> 2 iterations on 8 and 16 cores, with openib and tcp drivers.
>>> The more processes I have, the larger the memory leak.  I use
>>> the same binaries for the 4 runs and OpenMPI 3.1 (same behavior
>>> with 4.0.5).
>>> The code is in attachment. I'll try to check type deallocation
>>> as soon as possible.
>>>
>>> Patrick
>>>
>>>
>>>
>>>
>>> Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit :
>>>> Patrick,
>>>>
>>>>
>>>> based on George's idea, a simpler check is to retrieve the
>>>> Fortran index via the (standard) MPI_Type_c2() function
>>>>
>>>> after you create a derived datatype.
>>>>
>>>>
>>>> If the index keeps growing forever even after you
>>>> MPI_Type_free(), then this clearly indicates a leak.
>>>>
>>>> Unfortunately, this simple test cannot be used to definitely
>>>> rule out any memory leak.
>>>>
>>>>
>>>> Note you can also
>>>>
>>>> mpirun --mca pml ob1 --mca btl tcp,self ...
>>>>
>>>> in order to force communications over TCP/IP and hence rule out
>>>> any memory leak that could be triggered by your fast interconnect.
>>>>
>>>>
>>>>
>>>> In any case, a reproducer will greatly help us debugging this
>>>> issue.
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> Gilles
>>>>
>>>>
>>>>
>>>> On 12/4/2020 7:20 AM, George Bosilca via users wrote:
>>>>> Patrick,
>>>>>
>>>>> I'm afraid there is no simple way to check this. The main
>>>>> reason being that OMPI use handles for MPI objects, and these
>>>>> handles are not tracked by the library, they are supposed to
>>>>> be provided by the user for each call. In your case, as you
>>>>> already called MPI_Type_free on the datatype, you cannot
>>>>> produce a valid handle.
>>>>>
>>>>> There might be a trick. If the datatype is manipulated with
>>>>> any Fortran MPI functions, then we convert the handle (which
>>>>> in fact is a pointer) to an index into a pointer array
>>>>> structure. Thus, the index will remain used, and 

Re: [OMPI users] MPI_type_free question

2020-12-14 Thread Patrick Bégou via users
Hi Gilles,

you catch the bug! With this patch, on a single node, the memory leak
disappear. The cluster is actualy overloaded, as soon as possible I will
launch a multinode test.
Below the memory used by rank 0 before (blue) and after (red) the patch.

Thanks

Patrick


Le 10/12/2020 à 10:15, Gilles Gouaillardet via users a écrit :
> Patrick,
>
>
> First, thank you very much for sharing the reproducer.
>
>
> Yes, please open a github issue so we can track this.
>
>
> I cannot fully understand where the leak is coming from, but so far
>
>  - the code fails on master built with --enable-debug (the data engine
> reports an error) but not with the v3.1.x branch
>
>   (this suggests there could be an error in the latest Open MPI ... or
> in the code)
>
>  - the attached patch seems to have a positive effect, can you please
> give it a try?
>
>
> Cheers,
>
>
> Gilles
>
>
>
> On 12/7/2020 6:15 PM, Patrick Bégou via users wrote:
>> Hi,
>>
>> I've written a small piece of code to show the problem. Based on my
>> application but 2D and using integers arrays for testing.
>> The  figure below shows the max RSS size of rank 0 process on 2
>> iterations on 8 and 16 cores, with openib and tcp drivers.
>> The more processes I have, the larger the memory leak.  I use the
>> same binaries for the 4 runs and OpenMPI 3.1 (same behavior with 4.0.5).
>> The code is in attachment. I'll try to check type deallocation as
>> soon as possible.
>>
>> Patrick
>>
>>
>>
>>
>> Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit :
>>> Patrick,
>>>
>>>
>>> based on George's idea, a simpler check is to retrieve the Fortran
>>> index via the (standard) MPI_Type_c2() function
>>>
>>> after you create a derived datatype.
>>>
>>>
>>> If the index keeps growing forever even after you MPI_Type_free(),
>>> then this clearly indicates a leak.
>>>
>>> Unfortunately, this simple test cannot be used to definitely rule
>>> out any memory leak.
>>>
>>>
>>> Note you can also
>>>
>>> mpirun --mca pml ob1 --mca btl tcp,self ...
>>>
>>> in order to force communications over TCP/IP and hence rule out any
>>> memory leak that could be triggered by your fast interconnect.
>>>
>>>
>>>
>>> In any case, a reproducer will greatly help us debugging this issue.
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>>>
>>>
>>> On 12/4/2020 7:20 AM, George Bosilca via users wrote:
>>>> Patrick,
>>>>
>>>> I'm afraid there is no simple way to check this. The main reason
>>>> being that OMPI use handles for MPI objects, and these handles are
>>>> not tracked by the library, they are supposed to be provided by the
>>>> user for each call. In your case, as you already called
>>>> MPI_Type_free on the datatype, you cannot produce a valid handle.
>>>>
>>>> There might be a trick. If the datatype is manipulated with any
>>>> Fortran MPI functions, then we convert the handle (which in fact is
>>>> a pointer) to an index into a pointer array structure. Thus, the
>>>> index will remain used, and can therefore be used to convert back
>>>> into a valid datatype pointer, until OMPI completely releases the
>>>> datatype. Look into the ompi_datatype_f_to_c_table table to see the
>>>> datatypes that exist and get their pointers, and then use these
>>>> pointers as arguments to ompi_datatype_dump() to see if any of
>>>> these existing datatypes are the ones you define.
>>>>
>>>> George.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users
>>>> mailto:users@lists.open-mpi.org>> wrote:
>>>>
>>>>     Hi,
>>>>
>>>>     I'm trying to solve a memory leak since my new implementation of
>>>>     communications based on MPI_AllToAllW and MPI_type_Create_SubArray
>>>>     calls.  Arrays of SubArray types are created/destroyed at each
>>>>     time step and used for communications.
>>>>
>>>>     On my laptop the code runs fine (running for 15000 temporal
>>>>     itérations on 32 processes with oversubscription) but on our
>>>>     cluster memory used by the code increase until the OOMkiller stop
>>>>     the job. On the cluster we use IB QDR for communications.
>>>>
>>>>     Same Gcc/Gfortran 7.3 (built from sources), same sources of
>>>>     OpenMPI (3.1 or 4.0.5 tested), same sources of the fortran code on
>>>>     the laptop and on the cluster.
>>>>
>>>>     Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not
>>>>     show the problem (resident memory do not increase and we ran
>>>>     10 temporal iterations)
>>>>
>>>>     MPI_type_free manual says that it "/Marks the datatype object
>>>>     associated with datatype for deallocation/". But  how can I check
>>>>     that the deallocation is really done ?
>>>>
>>>>     Thanks for ant suggestions.
>>>>
>>>>     Patrick
>>>>
>>



Re: [OMPI users] MPI_type_free question

2020-12-10 Thread Patrick Bégou via users
Hi OpenMPI developers,

it looks difficult for me to track this memory problem in OpenMPI 3.x
and 4.x implementation Should I open an issue about this ?
Or is openib definitively an old strategy that will not evolved (and bug
get untracked) ?

Thanks

Patrick



Le 07/12/2020 à 10:15, Patrick Bégou via users a écrit :
> Hi,
>
> I've written a small piece of code to show the problem. Based on my
> application but 2D and using integers arrays for testing.
> The  figure below shows the max RSS size of rank 0 process on 2
> iterations on 8 and 16 cores, with openib and tcp drivers.
> The more processes I have, the larger the memory leak.  I use the same
> binaries for the 4 runs and OpenMPI 3.1 (same behavior with 4.0.5).
> The code is in attachment. I'll try to check type deallocation as soon
> as possible.
>
> Patrick
>
>
>
>
> Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit :
>> Patrick,
>>
>>
>> based on George's idea, a simpler check is to retrieve the Fortran
>> index via the (standard) MPI_Type_c2() function
>>
>> after you create a derived datatype.
>>
>>
>> If the index keeps growing forever even after you MPI_Type_free(),
>> then this clearly indicates a leak.
>>
>> Unfortunately, this simple test cannot be used to definitely rule out
>> any memory leak.
>>
>>
>> Note you can also
>>
>> mpirun --mca pml ob1 --mca btl tcp,self ...
>>
>> in order to force communications over TCP/IP and hence rule out any
>> memory leak that could be triggered by your fast interconnect.
>>
>>
>>
>> In any case, a reproducer will greatly help us debugging this issue.
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>>
>> On 12/4/2020 7:20 AM, George Bosilca via users wrote:
>>> Patrick,
>>>
>>> I'm afraid there is no simple way to check this. The main reason
>>> being that OMPI use handles for MPI objects, and these handles are
>>> not tracked by the library, they are supposed to be provided by the
>>> user for each call. In your case, as you already called
>>> MPI_Type_free on the datatype, you cannot produce a valid handle.
>>>
>>> There might be a trick. If the datatype is manipulated with any
>>> Fortran MPI functions, then we convert the handle (which in fact is
>>> a pointer) to an index into a pointer array structure. Thus, the
>>> index will remain used, and can therefore be used to convert back
>>> into a valid datatype pointer, until OMPI completely releases the
>>> datatype. Look into the ompi_datatype_f_to_c_table table to see the
>>> datatypes that exist and get their pointers, and then use these
>>> pointers as arguments to ompi_datatype_dump() to see if any of these
>>> existing datatypes are the ones you define.
>>>
>>> George.
>>>
>>>
>>>
>>>
>>> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users
>>> mailto:users@lists.open-mpi.org>> wrote:
>>>
>>>     Hi,
>>>
>>>     I'm trying to solve a memory leak since my new implementation of
>>>     communications based on MPI_AllToAllW and MPI_type_Create_SubArray
>>>     calls.  Arrays of SubArray types are created/destroyed at each
>>>     time step and used for communications.
>>>
>>>     On my laptop the code runs fine (running for 15000 temporal
>>>     itérations on 32 processes with oversubscription) but on our
>>>     cluster memory used by the code increase until the OOMkiller stop
>>>     the job. On the cluster we use IB QDR for communications.
>>>
>>>     Same Gcc/Gfortran 7.3 (built from sources), same sources of
>>>     OpenMPI (3.1 or 4.0.5 tested), same sources of the fortran code on
>>>     the laptop and on the cluster.
>>>
>>>     Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not
>>>     show the problem (resident memory do not increase and we ran
>>>     10 temporal iterations)
>>>
>>>     MPI_type_free manual says that it "/Marks the datatype object
>>>     associated with datatype for deallocation/". But  how can I check
>>>     that the deallocation is really done ?
>>>
>>>     Thanks for ant suggestions.
>>>
>>>     Patrick
>>>
>



Re: [OMPI users] MPI_type_free question

2020-12-07 Thread Patrick Bégou via users
Hi George,

I've implemented a call to MPI_Type_f2c using fortran C_BINDING and it
works . Data types are allways set as deallocated (I've checked the
reverse by commenting the calls to MPI_type_free(...) to be sure that it
reports "Not deallocated" in my code in this case.

Then I've ran the code with tcp and openib drivers but keeping the
deallocation commented to see how the memory consumption evolves:

The global slope of the curves are quite similar in tcp and openip on
1000 iterations even if they look differents. So it looks really as a
subarray type deallocation problem but deeper in the code I think.

Patrick




Le 04/12/2020 à 19:20, George Bosilca a écrit :
> On Fri, Dec 4, 2020 at 2:33 AM Patrick Bégou via users
> mailto:users@lists.open-mpi.org>> wrote:
>
> Hi George and Gilles,
>
> Thanks George for your suggestion. Is it valuable for 4.05 and 3.1
> OpenMPI Versions ? I will have a look today at these tables. May
> be writing a small piece of code juste creating and freeing
> subarray datatype.
>
>
> Patrick,
>
> If you use Gilles' suggestion to go through the type_f2c function when
> listing the datatypes should give you a portable datatype iterator
> across all versions of OMPI. The call to dump a datatype content,
> ompi_datatype_dump, has been there for a very long time, so the
> combination of the two should work everywhere.
>
> Thinking a little more about this, you don't necessarily have to dump
> the content of the datatype, you only need to check if they are
> different from MPI_DATATYPE_NULL. Thus, you can have a solution using
> only the MPI API.
>
>   George.
>  
>
>
> Thanks Gilles for suggesting disabling the interconnect. it is a
> good fast test and yes, *with "mpirun --mca pml ob1 --mca btl
> tcp,self" I have no memory leak*. So this explain the differences
> between my laptop and the cluster.
> The implementation of type management is so different from 1.7.3  ?
>
> A PhD student tells me he has also some trouble with this code on
> a cluster Omnipath based. I will have to investigate too but not
> sure it is the same problem.
>
> Patrick
>
> Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit :
>> Patrick,
>>
>>
>> based on George's idea, a simpler check is to retrieve the
>> Fortran index via the (standard) MPI_Type_c2() function
>>
>> after you create a derived datatype.
>>
>>
>> If the index keeps growing forever even after you
>> MPI_Type_free(), then this clearly indicates a leak.
>>
>> Unfortunately, this simple test cannot be used to definitely rule
>> out any memory leak.
>>
>>
>> Note you can also
>>
>> mpirun --mca pml ob1 --mca btl tcp,self ...
>>
>> in order to force communications over TCP/IP and hence rule out
>> any memory leak that could be triggered by your fast interconnect.
>>
>>
>>
>> In any case, a reproducer will greatly help us debugging this issue.
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>>
>> On 12/4/2020 7:20 AM, George Bosilca via users wrote:
>>> Patrick,
>>>
>>> I'm afraid there is no simple way to check this. The main reason
>>> being that OMPI use handles for MPI objects, and these handles
>>> are not tracked by the library, they are supposed to be provided
>>> by the user for each call. In your case, as you already called
>>> MPI_Type_free on the datatype, you cannot produce a valid handle.
>>>
>>> There might be a trick. If the datatype is manipulated with any
>>> Fortran MPI functions, then we convert the handle (which in fact
>>> is a pointer) to an index into a pointer array structure. Thus,
>>> the index will remain used, and can therefore be used to convert
>>> back into a valid datatype pointer, until OMPI completely
>>> releases the datatype. Look into the ompi_datatype_f_to_c_table
>>> table to see the datatypes that exist and get their pointers,
>>> and then use these pointers as arguments to ompi_datatype_dump()
>>> to see if any of these existing datatypes are the ones you define.
>>>
>>> George.
>>>
>>>
>>>
>>>
>>> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users
>>> mailto:users@lists.open-mpi.org>
>>> <mailto:users@lists.open-mpi.org>
>>> <mailto:users@lists.open-mpi.org>> wrote:

Re: [OMPI users] MPI_type_free question

2020-12-07 Thread Patrick Bégou via users
Hi,

I've written a small piece of code to show the problem. Based on my
application but 2D and using integers arrays for testing.
The  figure below shows the max RSS size of rank 0 process on 2
iterations on 8 and 16 cores, with openib and tcp drivers.
The more processes I have, the larger the memory leak.  I use the same
binaries for the 4 runs and OpenMPI 3.1 (same behavior with 4.0.5).
The code is in attachment. I'll try to check type deallocation as soon
as possible.

Patrick




Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit :
> Patrick,
>
>
> based on George's idea, a simpler check is to retrieve the Fortran
> index via the (standard) MPI_Type_c2() function
>
> after you create a derived datatype.
>
>
> If the index keeps growing forever even after you MPI_Type_free(),
> then this clearly indicates a leak.
>
> Unfortunately, this simple test cannot be used to definitely rule out
> any memory leak.
>
>
> Note you can also
>
> mpirun --mca pml ob1 --mca btl tcp,self ...
>
> in order to force communications over TCP/IP and hence rule out any
> memory leak that could be triggered by your fast interconnect.
>
>
>
> In any case, a reproducer will greatly help us debugging this issue.
>
>
> Cheers,
>
>
> Gilles
>
>
>
> On 12/4/2020 7:20 AM, George Bosilca via users wrote:
>> Patrick,
>>
>> I'm afraid there is no simple way to check this. The main reason
>> being that OMPI use handles for MPI objects, and these handles are
>> not tracked by the library, they are supposed to be provided by the
>> user for each call. In your case, as you already called MPI_Type_free
>> on the datatype, you cannot produce a valid handle.
>>
>> There might be a trick. If the datatype is manipulated with any
>> Fortran MPI functions, then we convert the handle (which in fact is a
>> pointer) to an index into a pointer array structure. Thus, the index
>> will remain used, and can therefore be used to convert back into a
>> valid datatype pointer, until OMPI completely releases the datatype.
>> Look into the ompi_datatype_f_to_c_table table to see the datatypes
>> that exist and get their pointers, and then use these pointers as
>> arguments to ompi_datatype_dump() to see if any of these existing
>> datatypes are the ones you define.
>>
>> George.
>>
>>
>>
>>
>> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>
>>     Hi,
>>
>>     I'm trying to solve a memory leak since my new implementation of
>>     communications based on MPI_AllToAllW and MPI_type_Create_SubArray
>>     calls.  Arrays of SubArray types are created/destroyed at each
>>     time step and used for communications.
>>
>>     On my laptop the code runs fine (running for 15000 temporal
>>     itérations on 32 processes with oversubscription) but on our
>>     cluster memory used by the code increase until the OOMkiller stop
>>     the job. On the cluster we use IB QDR for communications.
>>
>>     Same Gcc/Gfortran 7.3 (built from sources), same sources of
>>     OpenMPI (3.1 or 4.0.5 tested), same sources of the fortran code on
>>     the laptop and on the cluster.
>>
>>     Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not
>>     show the problem (resident memory do not increase and we ran
>>     10 temporal iterations)
>>
>>     MPI_type_free manual says that it "/Marks the datatype object
>>     associated with datatype for deallocation/". But  how can I check
>>     that the deallocation is really done ?
>>
>>     Thanks for ant suggestions.
>>
>>     Patrick
>>



test_layout_array.tgz
Description: application/compressed-tar


Re: [OMPI users] MPI_type_free question

2020-12-04 Thread Patrick Bégou via users
Hi Gilles,

The interconnect is Qlogic infiniband QDR. I was unable to compile
latest UCX release on this CentOS6 based Rocks Cluster distribution when
deploying OpenMPI 4.0.5 but I do not search a lot: I'm currently
deploying a new cluster on CentOS8 and this old cluster will move to
CentOS8 too as soon as the new one will be in production (using gcc10,
OpenMPI 4.0.5 and UCX).

In attachment the ompi_info for OpenMPI 3.1 (version in production)  and
the dump requested on the kareline cluster using 16 processes.

Patrick

Le 04/12/2020 à 08:57, Gilles Gouaillardet via users a écrit :
> Patrick,
>
>
> the test points to a leak in the way the interconnect component
> (pml/ucx ? pml/cm? mtl/psm2? btl/openib?) handles the datatype rather
> than the datatype engine itself.
>
>
> What interconnect is available on your cluster and which component(s)
> are used?
>
>
> mpirun --mca pml_base_verbose 10 --mca mtl_base_verbose 10 --mca
> btl_base_verbose 10 ...
>
> will point you to the component(s) used.
>
> The output is pretty verbose, so feel free to compress and post it if
> you cannot decipher it
>
>
> Cheers,
>
>
> Gilles
>
> On 12/4/2020 4:32 PM, Patrick Bégou via users wrote:
>> Hi George and Gilles,
>>
>> Thanks George for your suggestion. Is it valuable for 4.05 and 3.1
>> OpenMPI Versions ? I will have a look today at these tables. May be
>> writing a small piece of code juste creating and freeing subarray
>> datatype.
>>
>> Thanks Gilles for suggesting disabling the interconnect. it is a good
>> fast test and yes, *with "mpirun --mca pml ob1 --mca btl tcp,self" I
>> have no memory leak*. So this explain the differences between my
>> laptop and the cluster.
>> The implementation of type management is so different from 1.7.3  ?
>>
>> A PhD student tells me he has also some trouble with this code on a
>> cluster Omnipath based. I will have to investigate too but not sure
>> it is the same problem.
>>
>> Patrick
>>
>> Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit :
>>> Patrick,
>>>
>>>
>>> based on George's idea, a simpler check is to retrieve the Fortran
>>> index via the (standard) MPI_Type_c2() function
>>>
>>> after you create a derived datatype.
>>>
>>>
>>> If the index keeps growing forever even after you MPI_Type_free(),
>>> then this clearly indicates a leak.
>>>
>>> Unfortunately, this simple test cannot be used to definitely rule
>>> out any memory leak.
>>>
>>>
>>> Note you can also
>>>
>>> mpirun --mca pml ob1 --mca btl tcp,self ...
>>>
>>> in order to force communications over TCP/IP and hence rule out any
>>> memory leak that could be triggered by your fast interconnect.
>>>
>>>
>>>
>>> In any case, a reproducer will greatly help us debugging this issue.
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>>>
>>>
>>> On 12/4/2020 7:20 AM, George Bosilca via users wrote:
>>>> Patrick,
>>>>
>>>> I'm afraid there is no simple way to check this. The main reason
>>>> being that OMPI use handles for MPI objects, and these handles are
>>>> not tracked by the library, they are supposed to be provided by the
>>>> user for each call. In your case, as you already called
>>>> MPI_Type_free on the datatype, you cannot produce a valid handle.
>>>>
>>>> There might be a trick. If the datatype is manipulated with any
>>>> Fortran MPI functions, then we convert the handle (which in fact is
>>>> a pointer) to an index into a pointer array structure. Thus, the
>>>> index will remain used, and can therefore be used to convert back
>>>> into a valid datatype pointer, until OMPI completely releases the
>>>> datatype. Look into the ompi_datatype_f_to_c_table table to see the
>>>> datatypes that exist and get their pointers, and then use these
>>>> pointers as arguments to ompi_datatype_dump() to see if any of
>>>> these existing datatypes are the ones you define.
>>>>
>>>> George.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users
>>>> mailto:users@lists.open-mpi.org>> wrote:
>>>>
>>>>     Hi,
>>>>
>>>>     I'm trying to solve a memory leak since my new implementation of
>>>>     communications

Re: [OMPI users] MPI_type_free question

2020-12-03 Thread Patrick Bégou via users
Hi George and Gilles,

Thanks George for your suggestion. Is it valuable for 4.05 and 3.1
OpenMPI Versions ? I will have a look today at these tables. May be
writing a small piece of code juste creating and freeing subarray datatype.

Thanks Gilles for suggesting disabling the interconnect. it is a good
fast test and yes, *with "mpirun --mca pml ob1 --mca btl tcp,self" I
have no memory leak*. So this explain the differences between my laptop
and the cluster.
The implementation of type management is so different from 1.7.3  ?

A PhD student tells me he has also some trouble with this code on a
cluster Omnipath based. I will have to investigate too but not sure it
is the same problem.

Patrick

Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit :
> Patrick,
>
>
> based on George's idea, a simpler check is to retrieve the Fortran
> index via the (standard) MPI_Type_c2() function
>
> after you create a derived datatype.
>
>
> If the index keeps growing forever even after you MPI_Type_free(),
> then this clearly indicates a leak.
>
> Unfortunately, this simple test cannot be used to definitely rule out
> any memory leak.
>
>
> Note you can also
>
> mpirun --mca pml ob1 --mca btl tcp,self ...
>
> in order to force communications over TCP/IP and hence rule out any
> memory leak that could be triggered by your fast interconnect.
>
>
>
> In any case, a reproducer will greatly help us debugging this issue.
>
>
> Cheers,
>
>
> Gilles
>
>
>
> On 12/4/2020 7:20 AM, George Bosilca via users wrote:
>> Patrick,
>>
>> I'm afraid there is no simple way to check this. The main reason
>> being that OMPI use handles for MPI objects, and these handles are
>> not tracked by the library, they are supposed to be provided by the
>> user for each call. In your case, as you already called MPI_Type_free
>> on the datatype, you cannot produce a valid handle.
>>
>> There might be a trick. If the datatype is manipulated with any
>> Fortran MPI functions, then we convert the handle (which in fact is a
>> pointer) to an index into a pointer array structure. Thus, the index
>> will remain used, and can therefore be used to convert back into a
>> valid datatype pointer, until OMPI completely releases the datatype.
>> Look into the ompi_datatype_f_to_c_table table to see the datatypes
>> that exist and get their pointers, and then use these pointers as
>> arguments to ompi_datatype_dump() to see if any of these existing
>> datatypes are the ones you define.
>>
>> George.
>>
>>
>>
>>
>> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>
>>     Hi,
>>
>>     I'm trying to solve a memory leak since my new implementation of
>>     communications based on MPI_AllToAllW and MPI_type_Create_SubArray
>>     calls.  Arrays of SubArray types are created/destroyed at each
>>     time step and used for communications.
>>
>>     On my laptop the code runs fine (running for 15000 temporal
>>     itérations on 32 processes with oversubscription) but on our
>>     cluster memory used by the code increase until the OOMkiller stop
>>     the job. On the cluster we use IB QDR for communications.
>>
>>     Same Gcc/Gfortran 7.3 (built from sources), same sources of
>>     OpenMPI (3.1 or 4.0.5 tested), same sources of the fortran code on
>>     the laptop and on the cluster.
>>
>>     Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not
>>     show the problem (resident memory do not increase and we ran
>>     10 temporal iterations)
>>
>>     MPI_type_free manual says that it "/Marks the datatype object
>>     associated with datatype for deallocation/". But  how can I check
>>     that the deallocation is really done ?
>>
>>     Thanks for ant suggestions.
>>
>>     Patrick
>>



[OMPI users] MPI_type_free question

2020-12-03 Thread Patrick Bégou via users
Hi,

I'm trying to solve a memory leak since my new implementation of
communications based on MPI_AllToAllW and MPI_type_Create_SubArray
calls.  Arrays of SubArray types are created/destroyed at each time step
and used for communications.

On my laptop the code runs fine (running for 15000 temporal itérations
on 32 processes with oversubscription) but on our cluster memory used by
the code increase until the OOMkiller stop the job. On the cluster we
use IB QDR for communications.

Same Gcc/Gfortran 7.3 (built from sources), same sources of OpenMPI (3.1
or 4.0.5 tested), same sources of the fortran code on the laptop and on
the cluster.

Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not show the
problem (resident memory do not increase and we ran 10 temporal
iterations)

MPI_type_free manual says that it "/Marks the datatype object associated
with datatype for deallocation/". But  how can I check that the
deallocation is really done ?

Thanks for ant suggestions.

Patrick



Re: [OMPI users] Parallel HDF5 low performance

2020-12-03 Thread Patrick Bégou via users
Thanks for all this suggestions. I'll try to create a small test
reproducing this behavior ans try the different parameters.
I do not use MPI I/O directly but parallel hdf5 which rely on MPI I/O .
NFS is the easiest way to share storage between nodes on a small
cluster. Beegfs or lustre require biggest (additional) architectures.

Patrick

Le 03/12/2020 à 15:38, Gabriel, Edgar via users a écrit :
> the reason for potential performance issues on NFS are very different from 
> Lustre. Basically, depending on your use-case and the NFS configuration, you 
> have to enforce different locking policy to ensure correct output files. The 
> default value for chosen for ompio is the most conservative setting, since 
> this was the only setting that we found that would result in a correct output 
> file for all of our tests.  You can change settings to see whether other 
> options would work you.
>
> The parameter that you need to work with is fs_ufs_lock_algorithm. Setting it 
> to 1 will completely disable it (and most likely lead to the best 
> performance), setting it to 3 is a middle ground (lock specific ranges) and 
> similar to what ROMIO does. So e.g.
>
> mpiexec -n 16 --mca fs_ufs_lock_algorihtm 1 ./mytests
>
> That being said, if you google NFS + MPI I/O, you will find a  ton of 
> document and reasons for potential problems, so using MPI I/O on top of NFS 
> (whether OMPIO or ROMIO) is always at your own risk.
> Thanks
>
> Edgar
>
> -Original Message-
> From: users  On Behalf Of Gilles 
> Gouaillardet via users
> Sent: Thursday, December 3, 2020 4:46 AM
> To: Open MPI Users 
> Cc: Gilles Gouaillardet 
> Subject: Re: [OMPI users] Parallel HDF5 low performance
>
> Patrick,
>
> glad to hear you will upgrade Open MPI thanks to this workaround!
>
> ompio has known performance issues on Lustre (this is why ROMIO is still the 
> default on this filesystem) but I do not remember such performance issues 
> have been reported on a NFS filesystem.
>
> Sharing a reproducer will be very much appreciated in order to improve ompio
>
> Cheers,
>
> Gilles
>
> On Thu, Dec 3, 2020 at 6:05 PM Patrick Bégou via users 
>  wrote:
>> Thanks Gilles,
>>
>> this is the solution.
>> I will set OMPI_MCA_io=^ompio automatically when loading the parallel
>> hdf5 module on the cluster.
>>
>> I was tracking this problem for several weeks but not looking in the 
>> right direction (testing NFS server I/O, network bandwidth.)
>>
>> I think we will now move definitively to modern OpenMPI implementations.
>>
>> Patrick
>>
>> Le 03/12/2020 à 09:06, Gilles Gouaillardet via users a écrit :
>>> Patrick,
>>>
>>>
>>> In recent Open MPI releases, the default component for MPI-IO is 
>>> ompio (and no more romio)
>>>
>>> unless the file is on a Lustre filesystem.
>>>
>>>
>>> You can force romio with
>>>
>>> mpirun --mca io ^ompio ...
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>>> On 12/3/2020 4:20 PM, Patrick Bégou via users wrote:
>>>> Hi,
>>>>
>>>> I'm using an old (but required by the codes) version of hdf5 
>>>> (1.8.12) in parallel mode in 2 fortran applications. It relies on 
>>>> MPI/IO. The storage is NFS mounted on the nodes of a small cluster.
>>>>
>>>> With OpenMPI 1.7 it runs fine but using modern OpenMPI 3.1 or 4.0.5 
>>>> the I/Os are 10x to 100x slower. Are there fundamentals changes in 
>>>> MPI/IO for these new releases of OpenMPI and a solution to get back 
>>>> to the IO performances with this parallel HDF5 release ?
>>>>
>>>> Thanks for your advices
>>>>
>>>> Patrick
>>>>



Re: [OMPI users] Parallel HDF5 low performance

2020-12-03 Thread Patrick Bégou via users
Thanks Gilles,

this is the solution.
I will set OMPI_MCA_io=^ompio automatically when loading the parallel
hdf5 module on the cluster.

I was tracking this problem for several weeks but not looking in the
right direction (testing NFS server I/O, network bandwidth.)

I think we will now move definitively to modern OpenMPI implementations.

Patrick

Le 03/12/2020 à 09:06, Gilles Gouaillardet via users a écrit :
> Patrick,
>
>
> In recent Open MPI releases, the default component for MPI-IO is ompio
> (and no more romio)
>
> unless the file is on a Lustre filesystem.
>
>
> You can force romio with
>
> mpirun --mca io ^ompio ...
>
>
> Cheers,
>
>
> Gilles
>
> On 12/3/2020 4:20 PM, Patrick Bégou via users wrote:
>> Hi,
>>
>> I'm using an old (but required by the codes) version of hdf5 (1.8.12) in
>> parallel mode in 2 fortran applications. It relies on MPI/IO. The
>> storage is NFS mounted on the nodes of a small cluster.
>>
>> With OpenMPI 1.7 it runs fine but using modern OpenMPI 3.1 or 4.0.5 the
>> I/Os are 10x to 100x slower. Are there fundamentals changes in MPI/IO
>> for these new releases of OpenMPI and a solution to get back to the IO
>> performances with this parallel HDF5 release ?
>>
>> Thanks for your advices
>>
>> Patrick
>>



[OMPI users] Parallel HDF5 low performance

2020-12-02 Thread Patrick Bégou via users
Hi,

I'm using an old (but required by the codes) version of hdf5 (1.8.12) in
parallel mode in 2 fortran applications. It relies on MPI/IO. The
storage is NFS mounted on the nodes of a small cluster.

With OpenMPI 1.7 it runs fine but using modern OpenMPI 3.1 or 4.0.5 the
I/Os are 10x to 100x slower. Are there fundamentals changes in MPI/IO
for these new releases of OpenMPI and a solution to get back to the IO
performances with this parallel HDF5 release ?

Thanks for your advices

Patrick



Re: [OMPI users] mpirun only work for 1 processor

2020-06-04 Thread Patrick Bégou via users
Ha Chi,

first running MPI applications as root in not a good idea. You must
create users in your rocks cluster without admin rights for all that is
not system management.

Let me know a little more about how you launch this:
1) Do you run "mpirun" from the rocks frontend or from a node ?
2) Ok from ssh from the frontend to the node but BETWEEN 2 nodes ?

Patrick

Le 04/06/2020 à 10:02, Hà Chi Nguyễn Nhật a écrit :
> Dear Patrick, 
> Thanks so much for your reply, 
> Yes, we use ssh to log on the node. From the frontend, we can ssh to
> the nodes without password. 
> the mpirun --version in all 3 nodes are identical, openmpi 2.1.1, and
> same place when testing with "whereis mpirun"
> So is there any problem with mpirun causing it to not launch to other
> nodes?
>
> Regards
> HaChi
>
> On Thu, 4 Jun 2020 at 14:35, Patrick Bégou via users
> mailto:users@lists.open-mpi.org>> wrote:
>
> Hi Ha Chi
>
> do you use a batch scheduler with Rocks Cluster or do you log on
> the node with ssh ?
> If ssh, can you check  that you can ssh from one node to the other
> without password ?
> Ping just says the network is alive, not that you can connect.
>
> Patrick
>
> Le 04/06/2020 à 09:06, Hà Chi Nguyễn Nhật via users a écrit :
>> Dear Open MPI users, 
>>
>> Please help me to find the solution for the problem using mpirun
>> with a ROCK cluster, 3 nodes. I use the command:
>> mpirun -np 12 --machinefile machinefile.txt --allow-run-as-root
>> ./wrf.exe
>> But mpirun was unable to access other nodes (as the below photo).
>> But actually I checked the connection of three nodes by command
>> "ping node's IP", they are well connected.
>> 2.png
>> My machinefile.txt includes IP of three nodes (frontend and 2
>> connected nodes), like this:
>> 10.1.85.1 slots=4
>> 10.1.85.254 slots=4
>> 10.1.85.253 slots=4
>>
>> My cluster is built by a ROCK cluster, with 3 nodes, CPUS 8 per
>> each node.
>> *My question is: How can I connect 3 nodes to run together?*
>> *
>> *
>> Please advise
>> Thanks
>> Ha Chi
>>
>> -- 
>> *Ms. Nguyen Nhat Ha Chi*
>> PhD student
>> Environmental Engineering and Management 
>> Asian Institute of Technology (AIT)
>> Thailand
>
>
>
>
> -- 
> *Ms. Nguyen Nhat Ha Chi*
> PhD student
> Environmental Engineering and Management 
> Asian Institute of Technology (AIT)
> Thailand




Re: [OMPI users] mpirun only work for 1 processor

2020-06-04 Thread Patrick Bégou via users
Hi Ha Chi

do you use a batch scheduler with Rocks Cluster or do you log on the
node with ssh ?
If ssh, can you check  that you can ssh from one node to the other
without password ?
Ping just says the network is alive, not that you can connect.

Patrick

Le 04/06/2020 à 09:06, Hà Chi Nguyễn Nhật via users a écrit :
> Dear Open MPI users, 
>
> Please help me to find the solution for the problem using mpirun with
> a ROCK cluster, 3 nodes. I use the command:
> mpirun -np 12 --machinefile machinefile.txt --allow-run-as-root ./wrf.exe
> But mpirun was unable to access other nodes (as the below photo). But
> actually I checked the connection of three nodes by command "ping
> node's IP", they are well connected.
> 2.png
> My machinefile.txt includes IP of three nodes (frontend and 2
> connected nodes), like this:
> 10.1.85.1 slots=4
> 10.1.85.254 slots=4
> 10.1.85.253 slots=4
>
> My cluster is built by a ROCK cluster, with 3 nodes, CPUS 8 per each node.
> *My question is: How can I connect 3 nodes to run together?*
> *
> *
> Please advise
> Thanks
> Ha Chi
>
> -- 
> *Ms. Nguyen Nhat Ha Chi*
> PhD student
> Environmental Engineering and Management 
> Asian Institute of Technology (AIT)
> Thailand




[OMPI users] OpenMPI 4.3 without ucx

2020-05-10 Thread Patrick Bégou via users
Hi all,

I've built OpenMPI 4.3.0 with GCC 9.3.0 but on the server ucx was not
available when I set --with-ucx. I remove this option and it compiles
fine without ucx. However I have a strange behavior as when using mpirun
I must explicitely remove ucx to avoid an error: in my module file I
have to set

setenv         OMPI_MCA_pml ^ucx
setenv         OMPI_MCA_btl_openib_allow_ib 1

Is this the normal behavior or a small bug when ucx is not used ? I saw
a discussion about this but it was before OpenMPI 4.2.0 release.

Patrick

>From ompi_info:

Configure command line: '--with-lustre' '--with-slurm' '--with-hwloc'
'--enable-mpirun-prefix-by-default'
'--prefix=//GCC9.3/openmpi/4.0.3'   '--with-pmi'
'--enable-mpi1-compatibility'

MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.3)



Re: [OMPI users] can't open /dev/ipath, network down (err=26)

2020-05-09 Thread Patrick Bégou via users
This material is working for nearly 10 years with several generations of
nodes and OpenMPI without any problem. Today it is possible to found
refurbished parts at low price on the web and it can help building small
clusters. it is really more efficient than 10Gb ethernet for parallel
codes due to the very low latency.
Now I'm moving to 100-200Gb/s infiniband architectures... for the next
10 years. ;-)

Patrick

Le 09/05/2020 à 16:09, Heinz, Michael William a écrit :
> That it! I was trying to remember what the setting was but I haven’t
> worked on those HCAs since around 2012, so it was faint.
>
> That said, I found the Intel TrueScale manual online
> at 
> https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/OFED_Host_Software_UserGuide_G91902_06.pdf
> <https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/OFED_Host_Software_UserGuide_G91902_06.pdf#page72>
>
> TS is the same hardware as the old QLogic QDR HCAs so the manual might
> be helpful to you in the future.
>
> Sent from my iPad
>
>> On May 9, 2020, at 9:52 AM, Patrick Bégou via users
>>  wrote:
>>
>> 
>> Le 08/05/2020 à 21:56, Prentice Bisbal via users a écrit :
>>>
>>> We often get the following errors when more than one job runs on the
>>> same compute node. We are using Slurm with OpenMPI. The IB cards are
>>> QLogic using PSM:
>>>
>>> 10698ipath_userinit: assign_context command failed: Network is down
>>> node01.10698can't open /dev/ipath, network down (err=26)
>>> node01.10703ipath_userinit: assign_context command failed: Network
>>> is down
>>> node01.10703can't open /dev/ipath, network down (err=26)
>>> node01.10701ipath_userinit: assign_context command failed: Network
>>> is down
>>> node01.10701can't open /dev/ipath, network down (err=26)
>>> node01.10700ipath_userinit: assign_context command failed: Network
>>> is down
>>> node01.10700can't open /dev/ipath, network down (err=26)
>>> node01.10697ipath_userinit: assign_context command failed: Network
>>> is down
>>> node01.10697can't open /dev/ipath, network down (err=26)
>>> --
>>> PSM was unable to open an endpoint. Please make sure that the
>>> network link is
>>> active on the node and the hardware is functioning.
>>>
>>> Error: Could not detect network connectivity
>>> --
>>>
>>> Any Ideas how to fix this?
>>>
>>> -- 
>>> Prentice 
>>
>>
>> Hi Prentice,
>>
>> This is not openMPI related but merely due to your hardware. I've not
>> many details but I think this occurs when several jobs share the same
>> node and you have a large number of cores on these nodes (> 14). If
>> this is the case:
>>
>> On Qlogic (I'm using such a hardware at this time) you have 16
>> channel for communication on each HBA and, if I remember what I had
>> read many years ago, 2 are dedicated to the system. When launching
>> MPI applications, each process of a job request for it's own
>> dedicated channel if available, else they share ALL the available
>> channels. So if a second job starts on the same node it do not
>> remains any available channel.
>>
>> To avoid this situation I force sharing the channels (my nodes have
>> 20 codes) by 2 MPI processes. You can set this with a simple
>> environment variable. On all my cluster nodes I create the file:
>>
>> */etc/profile.d/ibsetcontext.sh*
>>
>> And it contains:
>>
>> # allow 4 processes to share an hardware MPI context
>> # in infiniband with PSM
>> *export PSM_RANKS_PER_CONTEXT=2*
>>
>> Of course if some people manage to oversubscribe on the cores (more
>> than one process by core) it could rise again the problem but we do
>> not oversubscribe.
>>
>> Hope this can help you.
>>
>> Patrick
>>



Re: [OMPI users] can't open /dev/ipath, network down (err=26)

2020-05-09 Thread Patrick Bégou via users
Le 08/05/2020 à 21:56, Prentice Bisbal via users a écrit :
>
> We often get the following errors when more than one job runs on the
> same compute node. We are using Slurm with OpenMPI. The IB cards are
> QLogic using PSM:
>
> 10698ipath_userinit: assign_context command failed: Network is down
> node01.10698can't open /dev/ipath, network down (err=26)
> node01.10703ipath_userinit: assign_context command failed: Network is down
> node01.10703can't open /dev/ipath, network down (err=26)
> node01.10701ipath_userinit: assign_context command failed: Network is down
> node01.10701can't open /dev/ipath, network down (err=26)
> node01.10700ipath_userinit: assign_context command failed: Network is down
> node01.10700can't open /dev/ipath, network down (err=26)
> node01.10697ipath_userinit: assign_context command failed: Network is down
> node01.10697can't open /dev/ipath, network down (err=26)
> --
> PSM was unable to open an endpoint. Please make sure that the network
> link is
> active on the node and the hardware is functioning.
>
> Error: Could not detect network connectivity
> --
>
> Any Ideas how to fix this?
>
> -- 
> Prentice 


Hi Prentice,

This is not openMPI related but merely due to your hardware. I've not
many details but I think this occurs when several jobs share the same
node and you have a large number of cores on these nodes (> 14). If this
is the case:

On Qlogic (I'm using such a hardware at this time) you have 16 channel
for communication on each HBA and, if I remember what I had read many
years ago, 2 are dedicated to the system. When launching MPI
applications, each process of a job request for it's own dedicated
channel if available, else they share ALL the available channels. So if
a second job starts on the same node it do not remains any available
channel.

To avoid this situation I force sharing the channels (my nodes have 20
codes) by 2 MPI processes. You can set this with a simple environment
variable. On all my cluster nodes I create the file:

*/etc/profile.d/ibsetcontext.sh*

And it contains:

# allow 4 processes to share an hardware MPI context
# in infiniband with PSM
*export PSM_RANKS_PER_CONTEXT=2*

Of course if some people manage to oversubscribe on the cores (more than
one process by core) it could rise again the problem but we do not
oversubscribe.

Hope this can help you.

Patrick



Re: [OMPI users] Can't start jobs with srun.

2020-05-07 Thread Patrick Bégou via users
Le 07/05/2020 à 11:42, John Hearns via users a écrit :
> Patrick, I am sure that you have asked Dell for support on this issue?

No I didn't :-(. I was just accessing these server for a short time to
run a bench and the workaround was enough. I'm not using slurm but a
local scheduler (OAR) so the problem was not critical for my futur.


Patrick

>
> On Sun, 26 Apr 2020 at 18:09, Patrick Bégou via users
> mailto:users@lists.open-mpi.org>> wrote:
>
> I have also this problem on servers I'm benching at DELL's lab with
> OpenMPI-4.0.3. I've tried  a new build of OpenMPI with
> "--with-pmi2". No
> change.
> Finally my work around in the slurm script was to launch my code with
> mpirun. As mpirun was only finding one slot per nodes I have used
> "--oversubscribe --bind-to core" and checked that every process was
> binded on a separate core. It worked but do not ask me why :-)
>
> Patrick
>
> Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
> > Prentice, have you tried something trivial, like "srun -N3
> hostname", to rule out non-OMPI problems?
> >
> > Andy
> >
> > -Original Message-
> > From: users [mailto:users-boun...@lists.open-mpi.org
> <mailto:users-boun...@lists.open-mpi.org>] On Behalf Of Prentice
> Bisbal via users
> > Sent: Friday, April 24, 2020 2:19 PM
> > To: Ralph Castain mailto:r...@open-mpi.org>>;
> Open MPI Users  <mailto:users@lists.open-mpi.org>>
> > Cc: Prentice Bisbal mailto:pbis...@pppl.gov>>
> > Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.
> >
> > Okay. I've got Slurm built with pmix support:
> >
> > $ srun --mpi=list
> > srun: MPI types are...
> > srun: none
> > srun: pmix_v3
> > srun: pmi2
> > srun: openmpi
> > srun: pmix
> >
> > But now when I try to launch a job with srun, the job appears to be
> > running, but doesn't do anything - it just hangs in the running
> state
> > but doesn't do anything. Any ideas what could be wrong, or how
> to debug
> > this?
> >
> > I'm also asking around on the Slurm mailing list, too
> >
> > Prentice
> >
> > On 4/23/20 3:03 PM, Ralph Castain wrote:
> >> You can trust the --mpi=list. The problem is likely that OMPI
> wasn't configured --with-pmi2
> >>
> >>
> >>> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users
> mailto:users@lists.open-mpi.org>> wrote:
> >>>
> >>> --mpi=list shows pmi2 and openmpi as valid values, but if I
> set --mpi= to either of them, my job still fails. Why is that? Can
> I not trust the output of --mpi=list?
> >>>
> >>> Prentice
> >>>
> >>> On 4/23/20 10:43 AM, Ralph Castain via users wrote:
> >>>> No, but you do have to explicitly build OMPI with non-PMIx
> support if that is what you are going to use. In this case, you
> need to configure OMPI --with-pmi2=
> >>>>
> >>>> You can leave off the path if Slurm (i.e., just
> "--with-pmi2") was installed in a standard location as we should
> find it there.
> >>>>
> >>>>
> >>>>> On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users
> mailto:users@lists.open-mpi.org>> wrote:
> >>>>>
> >>>>> It looks like it was built with PMI2, but not PMIx:
> >>>>>
> >>>>> $ srun --mpi=list
> >>>>> srun: MPI types are...
> >>>>> srun: none
> >>>>> srun: pmi2
> >>>>> srun: openmpi
> >>>>>
> >>>>> I did launch the job with srun --mpi=pmi2 
> >>>>>
> >>>>> Does OpenMPI 4 need PMIx specifically?
> >>>>>
> >>>>>
> >>>>> On 4/23/20 10:23 AM, Ralph Castain via users wrote:
> >>>>>> Is Slurm built with PMIx support? Did you tell srun to use it?
> >>>>>>
> >>>>>>
> >>>>>>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users
> mailto:users@lists.open-mpi.org>> wrote:
> >>>>>>>
> >>>>>>> I'm using OpenMPI

Re: [OMPI users] Can't start jobs with srun.

2020-04-26 Thread Patrick Bégou via users
I have also this problem on servers I'm benching at DELL's lab with
OpenMPI-4.0.3. I've tried  a new build of OpenMPI with "--with-pmi2". No
change.
Finally my work around in the slurm script was to launch my code with
mpirun. As mpirun was only finding one slot per nodes I have used
"--oversubscribe --bind-to core" and checked that every process was
binded on a separate core. It worked but do not ask me why :-)

Patrick

Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
> Prentice, have you tried something trivial, like "srun -N3 hostname", to rule 
> out non-OMPI problems?
>
> Andy
>
> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
> Bisbal via users
> Sent: Friday, April 24, 2020 2:19 PM
> To: Ralph Castain ; Open MPI Users 
> 
> Cc: Prentice Bisbal 
> Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.
>
> Okay. I've got Slurm built with pmix support:
>
> $ srun --mpi=list
> srun: MPI types are...
> srun: none
> srun: pmix_v3
> srun: pmi2
> srun: openmpi
> srun: pmix
>
> But now when I try to launch a job with srun, the job appears to be 
> running, but doesn't do anything - it just hangs in the running state 
> but doesn't do anything. Any ideas what could be wrong, or how to debug 
> this?
>
> I'm also asking around on the Slurm mailing list, too
>
> Prentice
>
> On 4/23/20 3:03 PM, Ralph Castain wrote:
>> You can trust the --mpi=list. The problem is likely that OMPI wasn't 
>> configured --with-pmi2
>>
>>
>>> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
>>>  wrote:
>>>
>>> --mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
>>> either of them, my job still fails. Why is that? Can I not trust the output 
>>> of --mpi=list?
>>>
>>> Prentice
>>>
>>> On 4/23/20 10:43 AM, Ralph Castain via users wrote:
 No, but you do have to explicitly build OMPI with non-PMIx support if that 
 is what you are going to use. In this case, you need to configure OMPI 
 --with-pmi2=

 You can leave off the path if Slurm (i.e., just "--with-pmi2") was 
 installed in a standard location as we should find it there.


> On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
>  wrote:
>
> It looks like it was built with PMI2, but not PMIx:
>
> $ srun --mpi=list
> srun: MPI types are...
> srun: none
> srun: pmi2
> srun: openmpi
>
> I did launch the job with srun --mpi=pmi2 
>
> Does OpenMPI 4 need PMIx specifically?
>
>
> On 4/23/20 10:23 AM, Ralph Castain via users wrote:
>> Is Slurm built with PMIx support? Did you tell srun to use it?
>>
>>
>>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
>>>  wrote:
>>>
>>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software 
>>> with a very simple hello, world MPI program that I've used reliably for 
>>> years. When I submit the job through slurm and use srun to launch the 
>>> job, I get these errors:
>>>
>>> *** An error occurred in MPI_Init
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> ***and potentially your MPI job)
>>> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed 
>>> completed successfully, but am not able to aggregate error messages, 
>>> and not able to guarantee that all other processes were killed!
>>> *** An error occurred in MPI_Init
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> ***and potentially your MPI job)
>>> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed 
>>> completed successfully, but am not able to aggregate error messages, 
>>> and not able to guarantee that all other processes were killed!
>>>
>>> If I run the same job, but use mpiexec or mpirun instead of srun, the 
>>> jobs run just fine. I checked ompi_info to make sure OpenMPI was 
>>> compiled with  Slurm support:
>>>
>>> $ ompi_info | grep slurm
>>>Configure command line: 
>>> '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
>>> '--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
>>> '--with-slurm' '--with-psm'
>>>   MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component 
>>> v4.0.3)
>>>   MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component 
>>> v4.0.3)
>>>   MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component 
>>> v4.0.3)
>>>MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component 
>>> v4.0.3)
>>>
>>> Any ideas what could be wrong? Do you need any additional information?
>>>
>>> Prentice
>>>



Re: [OMPI users] opal_path_nfs freeze

2020-04-23 Thread Patrick Bégou via users
Hi Jeff

As we say in french "dans le mille!" you were right.
I'm not the admin of these servers and a "mpirun not found" was
sufficient in my mind. It wasn't.

As I had deployed OpenMPI 4.0.2 I launch a new build after setting my
LD_LIBRARY_PATH to reach OpenMPI4.0.2 installed libs before all other
locations and all tests were successfull.

I think that this should be modified in the test script as we usually
run "make check" before "make install". Setting properly LD_LIBRARY_PATH
to reach first the temporary directory were the libs are built before
launching the test would be enought to avoid this wrong behavior.

I do not wait for an hour in front of my keyboard :-D, it was lunch time
and I was thinking of some timeout problem as NFS means... network!

Thanks a lot for providing the solution so quickly.

Patrick

Le 22/04/2020 à 20:17, Jeff Squyres (jsquyres) a écrit :
> The test should only take a few moments; no need to let it sit for a
> full hour.
>
> I have seen this kind of behavior before if you have an Open MPI
> installation in your PATH / LD_LIBRARY_PATH already, and then you
> invoke "make check".
>
> Because the libraries may be the same name and/or .so version numbers,
> there may be confusion in the tests setup scripts about exactly which
> libraries to use (the installed versions or the ones you just built /
> are trying to test).
>
> This is a long way of saying: make sure that you have no other Open
> MPI installation findable in your PATH / LD_LIBRARY_PATH and then try
> running `make check` again.
>
>
>> On Apr 21, 2020, at 2:37 PM, Patrick Bégou via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>
>> Hi OpenMPI maintainers,
>>
>>
>> I have temporary access to servers with AMD Epyc processors running
>> RHEL7.
>>
>> I'm trying to deploy OpenMPI with several setup but each time "make
>> check" fails on *opal_path_nfs*. This test freeze for ever consuming
>> no cpu resources.
>>
>> After nearly one hour I have killed the process.
>>
>> *_In test-suite.log I have:_*
>>
>> 
>>    Open MPI v3.1.x-201810100324-c8e9819: test/util/test-suite.log
>> 
>>
>> # TOTAL: 3
>> # PASS:  2
>> # SKIP:  0
>> # XFAIL: 0
>> # FAIL:  1
>> # XPASS: 0
>> # ERROR: 0
>>
>> .. contents:: :depth: 2
>>
>> FAIL: opal_path_nfs
>> ===
>>
>> FAIL opal_path_nfs (exit status: 137)
>>
>>
>> _*In opal_path_nfs.out I have a list of path:*_
>>
>> /proc proc
>> /sys sysfs
>> /dev devtmpfs
>> /run tmpfs
>> / xfs
>> /sys/kernel/security securityfs
>> /dev/shm tmpfs
>> /dev/pts devpts
>> /sys/fs/cgroup tmpfs
>> /sys/fs/cgroup/systemd cgroup
>> /sys/fs/pstore pstore
>> /sys/firmware/efi/efivars efivarfs
>> /sys/fs/cgroup/hugetlb cgroup
>> /sys/fs/cgroup/pids cgroup
>> /sys/fs/cgroup/net_cls,net_prio cgroup
>> /sys/fs/cgroup/devices cgroup
>> /sys/fs/cgroup/cpu,cpuacct cgroup
>> /sys/fs/cgroup/freezer cgroup
>> /sys/fs/cgroup/perf_event cgroup
>> /sys/fs/cgroup/cpuset cgroup
>> /sys/fs/cgroup/memory cgroup
>> /sys/fs/cgroup/blkio cgroup
>> /proc/sys/fs/binfmt_misc autofs
>> /sys/kernel/debug debugfs
>> /dev/hugepages hugetlbfs
>> /dev/mqueue mqueue
>> /sys/kernel/config configfs
>> /proc/sys/fs/binfmt_misc binfmt_misc
>> /boot/efi vfat
>> /local xfs
>> /var xfs
>> /tmp xfs
>> /var/lib/nfs/rpc_pipefs rpc_pipefs
>> /home nfs
>> /cm/shared nfs
>> /scratch nfs
>> /run/user/1013 tmpfs
>> /run/user/1010 tmpfs
>> /run/user/1046 tmpfs
>> /run/user/1015 tmpfs
>> /run/user/1121 tmpfs
>> /run/user/1113 tmpfs
>> /run/user/1126 tmpfs
>> /run/user/1002 tmpfs
>> /run/user/1130 tmpfs
>> /run/user/1004 tmpfs
>>
>> _*In opal_path_nfs.log:*_
>>
>> FAIL opal_path_nfs (exit status: 137)
>>
>>
>> The compiler is GCC9.2.
>>
>> I've also tested openmpi-4.0.3 built with gcc 8.2. Same problem.
>>
>> Thanks for your help.
>>
>> Patrick
>>
>>
>
>
> -- 
> Jeff Squyres
> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>



[OMPI users] opal_path_nfs freeze

2020-04-21 Thread Patrick Bégou via users
Hi OpenMPI maintainers,


I have temporary access to servers with AMD Epyc processors running RHEL7.

I'm trying to deploy OpenMPI with several setup but each time "make
check" fails on *opal_path_nfs*. This test freeze for ever consuming no
cpu resources.

After nearly one hour I have killed the process.

*_In test-suite.log I have:_*


   Open MPI v3.1.x-201810100324-c8e9819: test/util/test-suite.log


# TOTAL: 3
# PASS:  2
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: opal_path_nfs
===

FAIL opal_path_nfs (exit status: 137)


_*In opal_path_nfs.out I have a list of path:*_

/proc proc
/sys sysfs
/dev devtmpfs
/run tmpfs
/ xfs
/sys/kernel/security securityfs
/dev/shm tmpfs
/dev/pts devpts
/sys/fs/cgroup tmpfs
/sys/fs/cgroup/systemd cgroup
/sys/fs/pstore pstore
/sys/firmware/efi/efivars efivarfs
/sys/fs/cgroup/hugetlb cgroup
/sys/fs/cgroup/pids cgroup
/sys/fs/cgroup/net_cls,net_prio cgroup
/sys/fs/cgroup/devices cgroup
/sys/fs/cgroup/cpu,cpuacct cgroup
/sys/fs/cgroup/freezer cgroup
/sys/fs/cgroup/perf_event cgroup
/sys/fs/cgroup/cpuset cgroup
/sys/fs/cgroup/memory cgroup
/sys/fs/cgroup/blkio cgroup
/proc/sys/fs/binfmt_misc autofs
/sys/kernel/debug debugfs
/dev/hugepages hugetlbfs
/dev/mqueue mqueue
/sys/kernel/config configfs
/proc/sys/fs/binfmt_misc binfmt_misc
/boot/efi vfat
/local xfs
/var xfs
/tmp xfs
/var/lib/nfs/rpc_pipefs rpc_pipefs
/home nfs
/cm/shared nfs
/scratch nfs
/run/user/1013 tmpfs
/run/user/1010 tmpfs
/run/user/1046 tmpfs
/run/user/1015 tmpfs
/run/user/1121 tmpfs
/run/user/1113 tmpfs
/run/user/1126 tmpfs
/run/user/1002 tmpfs
/run/user/1130 tmpfs
/run/user/1004 tmpfs

_*In opal_path_nfs.log:*_

FAIL opal_path_nfs (exit status: 137)


The compiler is GCC9.2.

I've also tested openmpi-4.0.3 built with gcc 8.2. Same problem.

Thanks for your help.

Patrick




Re: [OMPI users] file/process write speed is not scalable

2020-04-14 Thread Patrick Bégou via users
Hi David,

could you specify which version of OpenMPI you are using ?
I've also some parallel I/O trouble with one code but still have not
investigated.
Thanks

Patrick

Le 13/04/2020 à 17:11, Dong-In Kang via users a écrit :
>
>  Thank you for your suggestion.
> I am more concerned about the poor performance of one MPI
> process/socket case.
> The model fits better for my real workload.
> The performance that I see is a lot worse than what the underlying
> hardware can support.
> The best case (all MPI processes in a single socket) is pretty good,
> which is about 80+% of underlying hardware's speed.
> However, one MPI per socket model achieves only 30% of what I get with
> all MPI processes in a single socket.
> Both are doing the same thing - independent file write.
> I used all the OSTs available.
>
> As a reference point, I did the same test on ramdisk.
> For both case, the performance scales very well, and their
> performances are close.
>
> There seems to be extra overhead when multi-sockets are used for
> independent file I/O with Lustre.
> I don't know what causes that overhead.
>
> Thanks,
> David
>
>
> On Thu, Apr 9, 2020 at 11:07 PM Gilles Gouaillardet via users
> mailto:users@lists.open-mpi.org>> wrote:
>
> Note there could be some NUMA-IO effect, so I suggest you compare
> running every MPI tasks on socket 0, to running every MPI tasks on
> socket 1 and so on, and then compared to running one MPI task per
> socket.
>
> Also, what performance do you measure?
> - Is this something in line with the filesystem/network expectation?
> - Or is this much higher (and in this case, you are benchmarking
> the i/o cache)?
>
> FWIW, I usually write files whose cumulated size is four times the
> node memory to avoid local caching effect
> (if you have a lot of RAM, that might take a while ...)
>
> Keep in mind Lustre is also sensitive to the file layout.
> If you write one file per task, you likely want to use all the
> available OST, but no stripping.
> If you want to write into a single file with 1MB blocks per MPI task,
> you likely want to stripe with 1MB blocks,
> and use the same number of OST than MPI tasks (so each MPI task ends
> up writing to its own OST)
>
> Cheers,
>
> Gilles
>
> On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users
> mailto:users@lists.open-mpi.org>> wrote:
> >
> > Hi,
> >
> > I'm running IOR benchmark on a big shared memory machine with
> Lustre file system.
> > I set up IOR to use an independent file/process so that the
> aggregated bandwidth is maximized.
> > I ran N MPI processes where N < # of cores in a socket.
> > When I put those N MPI processes on a single socket, its write
> performance is scalable.
> > However, when I put those N MPI processes on N sockets (so, 1
> MPI process/socket),
> > it performance does not scale, and stays the same for more than
> 4 MPI processes.
> > I expected it would be as scalable as the case of N processes on
> a single socket.
> > But, it is not.
> >
> > I think if an MPI process write to an independent file/process,
> there must not be file locking among MPI processes. However, there
> seems to be some. Is there any way to avoid that locking or
> overhead? It may not be file lock issue, but I don't know what is
> the exact reason for the poor performance.
> >
> > Any help will be appreciated.
> >
> > David
>