Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Michael Di Domenico via users
for whatever it's worth running the test program on my OPA cluster
seems to work.  well it keeps spitting out [INFO MEMORY] lines, not
sure if it's supposed to stop at some point

i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, without-{psm,ucx,verbs}

On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users
 wrote:
>
> Hi Michael
>
> indeed I'm a little bit lost with all these parameters in OpenMPI, mainly 
> because for years it works just fine out of the box in all my deployments on 
> various architectures, interconnects and linux flavor. Some weeks ago I 
> deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an AMD epyc2 
> cluster with connectX6, and it just works fine.  It is the first time I've 
> such trouble to deploy this library.
>
> If you have my mail posted  the 25/01/2021 in this discussion at 18h54 (may 
> be Paris TZ) there is a small test case attached that show the problem. Did 
> you got it or did the list strip these attachments ? I can provide it again.
>
> Many thanks
>
> Patrick
>
> Le 26/01/2021 à 19:25, Heinz, Michael William a écrit :
>
> Patrick how are you using original PSM if you’re using Omni-Path hardware? 
> The original PSM was written for QLogic DDR and QDR Infiniband adapters.
>
> As far as needing openib - the issue is that the PSM2 MTL doesn’t support a 
> subset of MPI operations that we previously used the pt2pt BTL for. For 
> recent version of OMPI, the preferred BTL to use with PSM2 is OFI.
>
> Is there any chance you can give us a sample MPI app that reproduces the 
> problem? I can’t think of another way I can give you more help without being 
> able to see what’s going on. It’s always possible there’s a bug in the PSM2 
> MTL but it would be surprising at this point.
>
> Sent from my iPad
>
> On Jan 26, 2021, at 1:13 PM, Patrick Begou via users 
>  wrote:
>
> 
> Hi all,
>
> I ran many tests today. I saw that an older 4.0.2 version of OpenMPI packaged 
> with Nix was running using openib. So I add the --with-verbs option to setup 
> this module.
>
> That I can see now is that:
>
> mpirun -hostfile $OAR_NODEFILE  --mca mtl psm -mca btl_openib_allow_ib true 
> 
>
> - the testcase test_layout_array is running without error
>
> - the bandwidth measured with osu_bw is half of thar it should be:
>
> # OSU MPI Bandwidth Test v5.7
> # Size  Bandwidth (MB/s)
> 1   0.54
> 2   1.13
> 4   2.26
> 8   4.51
> 16  9.06
> 32 17.93
> 64 33.87
> 12869.29
> 256   161.24
> 512   333.82
> 1024  682.66
> 2048 1188.63
> 4096 1760.14
> 8192 2166.08
> 163842036.95
> 327683466.63
> 655366296.73
> 131072   7509.43
> 262144   9104.78
> 524288   6908.55
> 1048576  5530.37
> 2097152  4489.16
> 4194304  3498.14
>
> mpirun -hostfile $OAR_NODEFILE  --mca mtl psm2 -mca btl_openib_allow_ib true 
> ...
>
> - the testcase test_layout_array is not giving correct results
>
> - the bandwidth measured with osu_bw is the right one:
>
> # OSU MPI Bandwidth Test v5.7
> # Size  Bandwidth (MB/s)
> 1   3.73
> 2   7.96
> 4  15.82
> 8  31.22
> 16 51.52
> 32107.61
> 64196.51
> 128   438.66
> 256   817.70
> 512  1593.90
> 1024 2786.09
> 2048 4459.77
> 4096 6658.70
> 8192 8092.95
> 163848664.43
> 327688495.96
> 65536   11458.77
> 131072  12094.64
> 262144  11781.84
> 524288  12297.58
> 1048576 12346.92
> 2097152 12206.53
> 4194304 12167.00
>
> But yes, I know openib is deprecated too in 4.0.5.
>
> Patrick
>
>


Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-26 Thread Patrick Begou via users
Hi Michael

indeed I'm a little bit lost with all these parameters in OpenMPI,
mainly because for years it works just fine out of the box in all my
deployments on various architectures, interconnects and linux flavor.
Some weeks ago I deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and
UCX on an AMD epyc2 cluster with connectX6, and it just works fine.  It
is the first time I've such trouble to deploy this library.

If you have my mail posted  the 25/01/2021 in this discussion at 18h54
(may be Paris TZ) there is a small test case attached that show the
problem. Did you got it or did the list strip these attachments ? I can
provide it again.

Many thanks

Patrick

Le 26/01/2021 à 19:25, Heinz, Michael William a écrit :
> Patrick how are you using original PSM if you’re using Omni-Path
> hardware? The original PSM was written for QLogic DDR and QDR
> Infiniband adapters.
>
> As far as needing openib - the issue is that the PSM2 MTL doesn’t
> support a subset of MPI operations that we previously used the pt2pt
> BTL for. For recent version of OMPI, the preferred BTL to use with
> PSM2 is OFI. 
>
> Is there any chance you can give us a sample MPI app that reproduces
> the problem? I can’t think of another way I can give you more help
> without being able to see what’s going on. It’s always possible
> there’s a bug in the PSM2 MTL but it would be surprising at this point.
>
> Sent from my iPad
>
>> On Jan 26, 2021, at 1:13 PM, Patrick Begou via users
>>  wrote:
>>
>> 
>> Hi all,
>>
>> I ran many tests today. I saw that an older 4.0.2 version of OpenMPI
>> packaged with Nix was running using openib. So I add the --with-verbs
>> option to setup this module.
>>
>> That I can see now is that:
>>
>> mpirun -hostfile $OAR_NODEFILE *--mca mtl psm -mca
>> btl_openib_allow_ib true* 
>>
>> - the testcase test_layout_array is running without error
>>
>> - the bandwidth measured with osu_bw is half of thar it should be:
>>
>> # OSU MPI Bandwidth Test v5.7
>> # Size  Bandwidth (MB/s)
>> 1   0.54
>> 2   1.13
>> 4   2.26
>> 8   4.51
>> 16  9.06
>> 32 17.93
>> 64 33.87
>> 128    69.29
>> 256   161.24
>> 512   333.82
>> 1024  682.66
>> 2048 1188.63
>> 4096 1760.14
>> 8192 2166.08
>> 16384    2036.95
>> 32768    3466.63
>> 65536    6296.73
>> 131072   7509.43
>> 262144   9104.78
>> 524288   6908.55
>> 1048576  5530.37
>> 2097152  4489.16
>> 4194304  3498.14
>>
>> mpirun -hostfile $OAR_NODEFILE *--mca mtl psm2 -mca
>> btl_openib_allow_ib true* ...
>>
>> - the testcase test_layout_array is not giving correct results
>>
>> - the bandwidth measured with osu_bw is the right one:
>>
>> # OSU MPI Bandwidth Test v5.7
>> # Size  Bandwidth (MB/s)
>> 1   3.73
>> 2   7.96
>> 4  15.82
>> 8  31.22
>> 16 51.52
>> 32    107.61
>> 64    196.51
>> 128   438.66
>> 256   817.70
>> 512  1593.90
>> 1024 2786.09
>> 2048 4459.77
>> 4096 6658.70
>> 8192 8092.95
>> 16384    8664.43
>> 32768    8495.96
>> 65536   11458.77
>> 131072  12094.64
>> 262144  11781.84
>> 524288  12297.58
>> 1048576 12346.92
>> 2097152 12206.53
>> 4194304 12167.00
>>
>> But yes, I know openib is deprecated too in 4.0.5.
>>
>> Patrick
>>



Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-26 Thread Heinz, Michael William via users
Patrick how are you using original PSM if you’re using Omni-Path hardware? The 
original PSM was written for QLogic DDR and QDR Infiniband adapters.

As far as needing openib - the issue is that the PSM2 MTL doesn’t support a 
subset of MPI operations that we previously used the pt2pt BTL for. For recent 
version of OMPI, the preferred BTL to use with PSM2 is OFI.

Is there any chance you can give us a sample MPI app that reproduces the 
problem? I can’t think of another way I can give you more help without being 
able to see what’s going on. It’s always possible there’s a bug in the PSM2 MTL 
but it would be surprising at this point.

Sent from my iPad

On Jan 26, 2021, at 1:13 PM, Patrick Begou via users  
wrote:


Hi all,

I ran many tests today. I saw that an older 4.0.2 version of OpenMPI packaged 
with Nix was running using openib. So I add the --with-verbs option to setup 
this module.

That I can see now is that:

mpirun -hostfile $OAR_NODEFILE  --mca mtl psm -mca btl_openib_allow_ib true 

- the testcase test_layout_array is running without error

- the bandwidth measured with osu_bw is half of thar it should be:

# OSU MPI Bandwidth Test v5.7
# Size  Bandwidth (MB/s)
1   0.54
2   1.13
4   2.26
8   4.51
16  9.06
32 17.93
64 33.87
12869.29
256   161.24
512   333.82
1024  682.66
2048 1188.63
4096 1760.14
8192 2166.08
163842036.95
327683466.63
655366296.73
131072   7509.43
262144   9104.78
524288   6908.55
1048576  5530.37
2097152  4489.16
4194304  3498.14

mpirun -hostfile $OAR_NODEFILE  --mca mtl psm2 -mca btl_openib_allow_ib true ...

- the testcase test_layout_array is not giving correct results

- the bandwidth measured with osu_bw is the right one:

# OSU MPI Bandwidth Test v5.7
# Size  Bandwidth (MB/s)
1   3.73
2   7.96
4  15.82
8  31.22
16 51.52
32107.61
64196.51
128   438.66
256   817.70
512  1593.90
1024 2786.09
2048 4459.77
4096 6658.70
8192 8092.95
163848664.43
327688495.96
65536   11458.77
131072  12094.64
262144  11781.84
524288  12297.58
1048576 12346.92
2097152 12206.53
4194304 12167.00

But yes, I know openib is deprecated too in 4.0.5.

Patrick


Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-26 Thread Patrick Begou via users
Hi all,

I ran many tests today. I saw that an older 4.0.2 version of OpenMPI
packaged with Nix was running using openib. So I add the --with-verbs
option to setup this module.

That I can see now is that:

mpirun -hostfile $OAR_NODEFILE *--mca mtl psm -mca btl_openib_allow_ib
true* 

- the testcase test_layout_array is running without error

- the bandwidth measured with osu_bw is half of thar it should be:

# OSU MPI Bandwidth Test v5.7
# Size  Bandwidth (MB/s)
1   0.54
2   1.13
4   2.26
8   4.51
16  9.06
32 17.93
64 33.87
128    69.29
256   161.24
512   333.82
1024  682.66
2048 1188.63
4096 1760.14
8192 2166.08
16384    2036.95
32768    3466.63
65536    6296.73
131072   7509.43
262144   9104.78
524288   6908.55
1048576  5530.37
2097152  4489.16
4194304  3498.14

mpirun -hostfile $OAR_NODEFILE *--mca mtl psm2 -mca btl_openib_allow_ib
true* ...

- the testcase test_layout_array is not giving correct results

- the bandwidth measured with osu_bw is the right one:

# OSU MPI Bandwidth Test v5.7
# Size  Bandwidth (MB/s)
1   3.73
2   7.96
4  15.82
8  31.22
16 51.52
32    107.61
64    196.51
128   438.66
256   817.70
512  1593.90
1024 2786.09
2048 4459.77
4096 6658.70
8192 8092.95
16384    8664.43
32768    8495.96
65536   11458.77
131072  12094.64
262144  11781.84
524288  12297.58
1048576 12346.92
2097152 12206.53
4194304 12167.00

But yes, I know openib is deprecated too in 4.0.5.

Patrick



Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-26 Thread Patrick Begou via users
m2
[dahu34:44665]
../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:315:
mtl:ofi:provider_include = "(null)"
[dahu34:44665]
../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:318:
mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[dahu34:44665]
../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:347:
mtl:ofi:prov: psm2
[dahu34:44663]
../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:315:
mtl:ofi:provider_include = "(null)"
[dahu34:44663]
../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:318:
mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[dahu34:44663]
../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:347:
mtl:ofi:prov: psm2
[dahu34:44664]
../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:315:
mtl:ofi:provider_include = "(null)"
[dahu34:44664]
../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:318:
mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[dahu34:44664]
../../../../../ompi/mca/mtl/ofi/mtl_ofi_component.c:347:
mtl:ofi:prov: psm2
[dahu34:44665] select: init returned success
[dahu34:44662] select: init returned success
[dahu34:44662] select: component ofi selected
[dahu34:44665] select: component ofi selected
[dahu34:44663] select: init returned success
[dahu34:44663] select: component ofi selected
[dahu34:44664] select: init returned success
[dahu34:44664] select: component ofi selected
On 1 found 1007 but expect 3007
On 2 found 1007 but expect 4007


but it fails too.

Patrick

Le 25/01/2021 à 19:34, Ralph Castain via users a écrit :
> I think you mean add "--mca mtl ofi" to the mpirun cmd line
>
>
>> On Jan 25, 2021, at 10:18 AM, Heinz, Michael William via users 
>>  wrote:
>>
>> What happens if you specify -mtl ofi ?
>>
>> -Original Message-
>> From: users  On Behalf Of Patrick Begou 
>> via users
>> Sent: Monday, January 25, 2021 12:54 PM
>> To: users@lists.open-mpi.org
>> Cc: Patrick Begou 
>> Subject: Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path
>>
>> Hi Howard and Michael,
>>
>> thanks for your feedback. I did not want to write a toot long mail with non 
>> pertinent information so I just show how the two different builds give 
>> different result. I'm using a small test case based on my large code, the 
>> same used to show the memory leak with mpi_Alltoallv calls, but just running 
>> 2 iterations. It is a 2D case and data storage is moved from distributions 
>> "along X axis" to "along Y axis" with mpi_Alltoallv and subarrays types. 
>> Datas initialization is based on the location in the array to allow checking 
>> for correct exchanges.
>>
>> When the program runs (on 4 processes in my test) it must only show the max 
>> rss size of the processes. When it fails it shows the invalid locations. 
>> I've drastically reduced the size of the problem with nx=5 and ny=7.
>>
>> Launching the non working setup with more details show:
>>
>> dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array 
>> [dahu138:115761] mca: base: components_register: registering framework mtl 
>> components [dahu138:115763] mca: base: components_register: registering 
>> framework mtl components [dahu138:115763] mca: base: components_register: 
>> found loaded component psm2 [dahu138:115763] mca: base: components_register: 
>> component psm2 register function successful [dahu138:115763] mca: base: 
>> components_open: opening mtl components [dahu138:115763] mca: base: 
>> components_open: found loaded component psm2 [dahu138:115761] mca: base: 
>> components_register: found loaded component psm2 [dahu138:115763] mca: base: 
>> components_open: component psm2 open function successful [dahu138:115761] 
>> mca: base: components_register: component psm2 register function successful 
>> [dahu138:115761] mca: base: components_open: opening mtl components 
>> [dahu138:115761] mca: base: components_open: found loaded component psm2 
>> [dahu138:115761] mca: base: components_open: component psm2 open function 
>> successful [dahu138:115760] mca: base: components_register: registering 
>> framework mtl components [dahu138:115760] mca: base: components_register: 
>> found loaded component psm2 [dahu138:115760] mca: base: components_register: 
>> component psm2 register function successful [dahu138:115760] mca: base: 
>> components_open: opening mtl components [dahu138:115760] mca: base: 
>> components_open: found loaded component psm2 [dahu138:115762] mca: base: 
>> components_register: registering framework mtl components [dahu138:1

Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-25 Thread Heinz, Michael William via users
Patrick, is your application multi-threaded? PSM2 was not originally designed 
for multiple threads per process.

I do know that the OSU alltoallV test does pass when I try it.

Sent from my iPad

> On Jan 25, 2021, at 12:57 PM, Patrick Begou via users 
>  wrote:
> 
> Hi Howard and Michael,
> 
> thanks for your feedback. I did not want to write a toot long mail with
> non pertinent information so I just show how the two different builds
> give different result. I'm using a small test case based on my large
> code, the same used to show the memory leak with mpi_Alltoallv calls,
> but just running 2 iterations. It is a 2D case and data storage is moved
> from distributions "along X axis" to "along Y axis" with mpi_Alltoallv
> and subarrays types. Datas initialization is based on the location in
> the array to allow checking for correct exchanges.
> 
> When the program runs (on 4 processes in my test) it must only show the
> max rss size of the processes. When it fails it shows the invalid
> locations. I've drastically reduced the size of the problem with nx=5
> and ny=7.
> 
> Launching the non working setup with more details show:
> 
> dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array
> [dahu138:115761] mca: base: components_register: registering framework
> mtl components
> [dahu138:115763] mca: base: components_register: registering framework
> mtl components
> [dahu138:115763] mca: base: components_register: found loaded component psm2
> [dahu138:115763] mca: base: components_register: component psm2 register
> function successful
> [dahu138:115763] mca: base: components_open: opening mtl components
> [dahu138:115763] mca: base: components_open: found loaded component psm2
> [dahu138:115761] mca: base: components_register: found loaded component psm2
> [dahu138:115763] mca: base: components_open: component psm2 open
> function successful
> [dahu138:115761] mca: base: components_register: component psm2 register
> function successful
> [dahu138:115761] mca: base: components_open: opening mtl components
> [dahu138:115761] mca: base: components_open: found loaded component psm2
> [dahu138:115761] mca: base: components_open: component psm2 open
> function successful
> [dahu138:115760] mca: base: components_register: registering framework
> mtl components
> [dahu138:115760] mca: base: components_register: found loaded component psm2
> [dahu138:115760] mca: base: components_register: component psm2 register
> function successful
> [dahu138:115760] mca: base: components_open: opening mtl components
> [dahu138:115760] mca: base: components_open: found loaded component psm2
> [dahu138:115762] mca: base: components_register: registering framework
> mtl components
> [dahu138:115762] mca: base: components_register: found loaded component psm2
> [dahu138:115760] mca: base: components_open: component psm2 open
> function successful
> [dahu138:115762] mca: base: components_register: component psm2 register
> function successful
> [dahu138:115762] mca: base: components_open: opening mtl components
> [dahu138:115762] mca: base: components_open: found loaded component psm2
> [dahu138:115762] mca: base: components_open: component psm2 open
> function successful
> [dahu138:115760] mca:base:select: Auto-selecting mtl components
> [dahu138:115760] mca:base:select:(  mtl) Querying component [psm2]
> [dahu138:115760] mca:base:select:(  mtl) Query of component [psm2] set
> priority to 40
> [dahu138:115761] mca:base:select: Auto-selecting mtl components
> [dahu138:115762] mca:base:select: Auto-selecting mtl components
> [dahu138:115762] mca:base:select:(  mtl) Querying component [psm2]
> [dahu138:115762] mca:base:select:(  mtl) Query of component [psm2] set
> priority to 40
> [dahu138:115762] mca:base:select:(  mtl) Selected component [psm2]
> [dahu138:115762] select: initializing mtl component psm2
> [dahu138:115761] mca:base:select:(  mtl) Querying component [psm2]
> [dahu138:115761] mca:base:select:(  mtl) Query of component [psm2] set
> priority to 40
> [dahu138:115761] mca:base:select:(  mtl) Selected component [psm2]
> [dahu138:115761] select: initializing mtl component psm2
> [dahu138:115760] mca:base:select:(  mtl) Selected component [psm2]
> [dahu138:115760] select: initializing mtl component psm2
> [dahu138:115763] mca:base:select: Auto-selecting mtl components
> [dahu138:115763] mca:base:select:(  mtl) Querying component [psm2]
> [dahu138:115763] mca:base:select:(  mtl) Query of component [psm2] set
> priority to 40
> [dahu138:115763] mca:base:select:(  mtl) Selected component [psm2]
> [dahu138:115763] select: initializing mtl component psm2
> [dahu138:115761] select: init returned success
> [dahu138:115761] select: component psm2 selected
> [dahu138:115762] select: init returned success
> [dahu138:115762] select: component psm2 selected
> [dahu138:115763] select: init returned success
> [dahu138:115763] select: component psm2 selected
> [dahu138:115760] select: init returned success
> 

Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-25 Thread Ralph Castain via users
I think you mean add "--mca mtl ofi" to the mpirun cmd line


> On Jan 25, 2021, at 10:18 AM, Heinz, Michael William via users 
>  wrote:
> 
> What happens if you specify -mtl ofi ?
> 
> -Original Message-
> From: users  On Behalf Of Patrick Begou via 
> users
> Sent: Monday, January 25, 2021 12:54 PM
> To: users@lists.open-mpi.org
> Cc: Patrick Begou 
> Subject: Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path
> 
> Hi Howard and Michael,
> 
> thanks for your feedback. I did not want to write a toot long mail with non 
> pertinent information so I just show how the two different builds give 
> different result. I'm using a small test case based on my large code, the 
> same used to show the memory leak with mpi_Alltoallv calls, but just running 
> 2 iterations. It is a 2D case and data storage is moved from distributions 
> "along X axis" to "along Y axis" with mpi_Alltoallv and subarrays types. 
> Datas initialization is based on the location in the array to allow checking 
> for correct exchanges.
> 
> When the program runs (on 4 processes in my test) it must only show the max 
> rss size of the processes. When it fails it shows the invalid locations. I've 
> drastically reduced the size of the problem with nx=5 and ny=7.
> 
> Launching the non working setup with more details show:
> 
> dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array 
> [dahu138:115761] mca: base: components_register: registering framework mtl 
> components [dahu138:115763] mca: base: components_register: registering 
> framework mtl components [dahu138:115763] mca: base: components_register: 
> found loaded component psm2 [dahu138:115763] mca: base: components_register: 
> component psm2 register function successful [dahu138:115763] mca: base: 
> components_open: opening mtl components [dahu138:115763] mca: base: 
> components_open: found loaded component psm2 [dahu138:115761] mca: base: 
> components_register: found loaded component psm2 [dahu138:115763] mca: base: 
> components_open: component psm2 open function successful [dahu138:115761] 
> mca: base: components_register: component psm2 register function successful 
> [dahu138:115761] mca: base: components_open: opening mtl components 
> [dahu138:115761] mca: base: components_open: found loaded component psm2 
> [dahu138:115761] mca: base: components_open: component psm2 open function 
> successful [dahu138:115760] mca: base: components_register: registering 
> framework mtl components [dahu138:115760] mca: base: components_register: 
> found loaded component psm2 [dahu138:115760] mca: base: components_register: 
> component psm2 register function successful [dahu138:115760] mca: base: 
> components_open: opening mtl components [dahu138:115760] mca: base: 
> components_open: found loaded component psm2 [dahu138:115762] mca: base: 
> components_register: registering framework mtl components [dahu138:115762] 
> mca: base: components_register: found loaded component psm2 [dahu138:115760] 
> mca: base: components_open: component psm2 open function successful 
> [dahu138:115762] mca: base: components_register: component psm2 register 
> function successful [dahu138:115762] mca: base: components_open: opening mtl 
> components [dahu138:115762] mca: base: components_open: found loaded 
> component psm2 [dahu138:115762] mca: base: components_open: component psm2 
> open function successful [dahu138:115760] mca:base:select: Auto-selecting mtl 
> components [dahu138:115760] mca:base:select:(  mtl) Querying component [psm2] 
> [dahu138:115760] mca:base:select:(  mtl) Query of component [psm2] set 
> priority to 40 [dahu138:115761] mca:base:select: Auto-selecting mtl 
> components [dahu138:115762] mca:base:select: Auto-selecting mtl components 
> [dahu138:115762] mca:base:select:(  mtl) Querying component [psm2] 
> [dahu138:115762] mca:base:select:(  mtl) Query of component [psm2] set 
> priority to 40 [dahu138:115762] mca:base:select:(  mtl) Selected component 
> [psm2] [dahu138:115762] select: initializing mtl component psm2 
> [dahu138:115761] mca:base:select:(  mtl) Querying component [psm2] 
> [dahu138:115761] mca:base:select:(  mtl) Query of component [psm2] set 
> priority to 40 [dahu138:115761] mca:base:select:(  mtl) Selected component 
> [psm2] [dahu138:115761] select: initializing mtl component psm2 
> [dahu138:115760] mca:base:select:(  mtl) Selected component [psm2] 
> [dahu138:115760] select: initializing mtl component psm2 [dahu138:115763] 
> mca:base:select: Auto-selecting mtl components [dahu138:115763] 
> mca:base:select:(  mtl) Querying component [psm2] [dahu138:115763] 
> mca:base:select:(  mtl) Query of component [psm2] set priority to 40 
> [dahu138:115763] mca:base:select:(  mtl) Sel

Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-25 Thread Heinz, Michael William via users
What happens if you specify -mtl ofi ?

-Original Message-
From: users  On Behalf Of Patrick Begou via 
users
Sent: Monday, January 25, 2021 12:54 PM
To: users@lists.open-mpi.org
Cc: Patrick Begou 
Subject: Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

Hi Howard and Michael,

thanks for your feedback. I did not want to write a toot long mail with non 
pertinent information so I just show how the two different builds give 
different result. I'm using a small test case based on my large code, the same 
used to show the memory leak with mpi_Alltoallv calls, but just running 2 
iterations. It is a 2D case and data storage is moved from distributions "along 
X axis" to "along Y axis" with mpi_Alltoallv and subarrays types. Datas 
initialization is based on the location in the array to allow checking for 
correct exchanges.

When the program runs (on 4 processes in my test) it must only show the max rss 
size of the processes. When it fails it shows the invalid locations. I've 
drastically reduced the size of the problem with nx=5 and ny=7.

Launching the non working setup with more details show:

dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array 
[dahu138:115761] mca: base: components_register: registering framework mtl 
components [dahu138:115763] mca: base: components_register: registering 
framework mtl components [dahu138:115763] mca: base: components_register: found 
loaded component psm2 [dahu138:115763] mca: base: components_register: 
component psm2 register function successful [dahu138:115763] mca: base: 
components_open: opening mtl components [dahu138:115763] mca: base: 
components_open: found loaded component psm2 [dahu138:115761] mca: base: 
components_register: found loaded component psm2 [dahu138:115763] mca: base: 
components_open: component psm2 open function successful [dahu138:115761] mca: 
base: components_register: component psm2 register function successful 
[dahu138:115761] mca: base: components_open: opening mtl components 
[dahu138:115761] mca: base: components_open: found loaded component psm2 
[dahu138:115761] mca: base: components_open: component psm2 open function 
successful [dahu138:115760] mca: base: components_register: registering 
framework mtl components [dahu138:115760] mca: base: components_register: found 
loaded component psm2 [dahu138:115760] mca: base: components_register: 
component psm2 register function successful [dahu138:115760] mca: base: 
components_open: opening mtl components [dahu138:115760] mca: base: 
components_open: found loaded component psm2 [dahu138:115762] mca: base: 
components_register: registering framework mtl components [dahu138:115762] mca: 
base: components_register: found loaded component psm2 [dahu138:115760] mca: 
base: components_open: component psm2 open function successful [dahu138:115762] 
mca: base: components_register: component psm2 register function successful 
[dahu138:115762] mca: base: components_open: opening mtl components 
[dahu138:115762] mca: base: components_open: found loaded component psm2 
[dahu138:115762] mca: base: components_open: component psm2 open function 
successful [dahu138:115760] mca:base:select: Auto-selecting mtl components 
[dahu138:115760] mca:base:select:(  mtl) Querying component [psm2] 
[dahu138:115760] mca:base:select:(  mtl) Query of component [psm2] set priority 
to 40 [dahu138:115761] mca:base:select: Auto-selecting mtl components 
[dahu138:115762] mca:base:select: Auto-selecting mtl components 
[dahu138:115762] mca:base:select:(  mtl) Querying component [psm2] 
[dahu138:115762] mca:base:select:(  mtl) Query of component [psm2] set priority 
to 40 [dahu138:115762] mca:base:select:(  mtl) Selected component [psm2] 
[dahu138:115762] select: initializing mtl component psm2 [dahu138:115761] 
mca:base:select:(  mtl) Querying component [psm2] [dahu138:115761] 
mca:base:select:(  mtl) Query of component [psm2] set priority to 40 
[dahu138:115761] mca:base:select:(  mtl) Selected component [psm2] 
[dahu138:115761] select: initializing mtl component psm2 [dahu138:115760] 
mca:base:select:(  mtl) Selected component [psm2] [dahu138:115760] select: 
initializing mtl component psm2 [dahu138:115763] mca:base:select: 
Auto-selecting mtl components [dahu138:115763] mca:base:select:(  mtl) Querying 
component [psm2] [dahu138:115763] mca:base:select:(  mtl) Query of component 
[psm2] set priority to 40 [dahu138:115763] mca:base:select:(  mtl) Selected 
component [psm2] [dahu138:115763] select: initializing mtl component psm2 
[dahu138:115761] select: init returned success [dahu138:115761] select: 
component psm2 selected [dahu138:115762] select: init returned success 
[dahu138:115762] select: component psm2 selected [dahu138:115763] select: init 
returned success [dahu138:115763] select: component psm2 selected 
[dahu138:115760] select: init returned success [dahu138:115760] select: 
component psm2 selected On 1 found 1007 but expect 3007 On 2 found 1

Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-25 Thread Patrick Begou via users
Hi Howard and Michael,

thanks for your feedback. I did not want to write a toot long mail with
non pertinent information so I just show how the two different builds
give different result. I'm using a small test case based on my large
code, the same used to show the memory leak with mpi_Alltoallv calls,
but just running 2 iterations. It is a 2D case and data storage is moved
from distributions "along X axis" to "along Y axis" with mpi_Alltoallv
and subarrays types. Datas initialization is based on the location in
the array to allow checking for correct exchanges.

When the program runs (on 4 processes in my test) it must only show the
max rss size of the processes. When it fails it shows the invalid
locations. I've drastically reduced the size of the problem with nx=5
and ny=7.

Launching the non working setup with more details show:

dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array
[dahu138:115761] mca: base: components_register: registering framework
mtl components
[dahu138:115763] mca: base: components_register: registering framework
mtl components
[dahu138:115763] mca: base: components_register: found loaded component psm2
[dahu138:115763] mca: base: components_register: component psm2 register
function successful
[dahu138:115763] mca: base: components_open: opening mtl components
[dahu138:115763] mca: base: components_open: found loaded component psm2
[dahu138:115761] mca: base: components_register: found loaded component psm2
[dahu138:115763] mca: base: components_open: component psm2 open
function successful
[dahu138:115761] mca: base: components_register: component psm2 register
function successful
[dahu138:115761] mca: base: components_open: opening mtl components
[dahu138:115761] mca: base: components_open: found loaded component psm2
[dahu138:115761] mca: base: components_open: component psm2 open
function successful
[dahu138:115760] mca: base: components_register: registering framework
mtl components
[dahu138:115760] mca: base: components_register: found loaded component psm2
[dahu138:115760] mca: base: components_register: component psm2 register
function successful
[dahu138:115760] mca: base: components_open: opening mtl components
[dahu138:115760] mca: base: components_open: found loaded component psm2
[dahu138:115762] mca: base: components_register: registering framework
mtl components
[dahu138:115762] mca: base: components_register: found loaded component psm2
[dahu138:115760] mca: base: components_open: component psm2 open
function successful
[dahu138:115762] mca: base: components_register: component psm2 register
function successful
[dahu138:115762] mca: base: components_open: opening mtl components
[dahu138:115762] mca: base: components_open: found loaded component psm2
[dahu138:115762] mca: base: components_open: component psm2 open
function successful
[dahu138:115760] mca:base:select: Auto-selecting mtl components
[dahu138:115760] mca:base:select:(  mtl) Querying component [psm2]
[dahu138:115760] mca:base:select:(  mtl) Query of component [psm2] set
priority to 40
[dahu138:115761] mca:base:select: Auto-selecting mtl components
[dahu138:115762] mca:base:select: Auto-selecting mtl components
[dahu138:115762] mca:base:select:(  mtl) Querying component [psm2]
[dahu138:115762] mca:base:select:(  mtl) Query of component [psm2] set
priority to 40
[dahu138:115762] mca:base:select:(  mtl) Selected component [psm2]
[dahu138:115762] select: initializing mtl component psm2
[dahu138:115761] mca:base:select:(  mtl) Querying component [psm2]
[dahu138:115761] mca:base:select:(  mtl) Query of component [psm2] set
priority to 40
[dahu138:115761] mca:base:select:(  mtl) Selected component [psm2]
[dahu138:115761] select: initializing mtl component psm2
[dahu138:115760] mca:base:select:(  mtl) Selected component [psm2]
[dahu138:115760] select: initializing mtl component psm2
[dahu138:115763] mca:base:select: Auto-selecting mtl components
[dahu138:115763] mca:base:select:(  mtl) Querying component [psm2]
[dahu138:115763] mca:base:select:(  mtl) Query of component [psm2] set
priority to 40
[dahu138:115763] mca:base:select:(  mtl) Selected component [psm2]
[dahu138:115763] select: initializing mtl component psm2
[dahu138:115761] select: init returned success
[dahu138:115761] select: component psm2 selected
[dahu138:115762] select: init returned success
[dahu138:115762] select: component psm2 selected
[dahu138:115763] select: init returned success
[dahu138:115763] select: component psm2 selected
[dahu138:115760] select: init returned success
[dahu138:115760] select: component psm2 selected
On 1 found 1007 but expect 3007
On 2 found 1007 but expect 4007

and with this setup the code freeze with this dimension of the problem.


Below is the same code with my no-ib setup of openMPI on the same node:

dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array
[dahu138:116723] mca: base: components_register: registering framework
mtl components
[dahu138:116723] mca: base: 

[OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-25 Thread Heinz, Michael William via users
Patrick,

You really have to provide us some detailed information if you want assistance. 
At a minimum we need to know if you're using the PSM2 MTL or the OFI MTL and 
what the actual error is.

Please provide the actual command line you are having problems with, along with 
any errors. In addition, I recommend adding the following to your command line:

-mca mtl_base_verbose 99

If you have a way to reproduce the problem quickly you might also want to add:

-x PSM2_TRACEMASK=11

But that will add very detailed debug output to your command and you haven't 
mentioned that PSM2 is failing, so it may not be useful.


[OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-25 Thread Patrick Begou via users
Hi,

I'm trying to deploy OpenMPI 4.0.5 on the university's supercomputer:

  * Debian GNU/Linux 9 (stretch)
  * Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)

and for several days I have a bug (wrong results using MPI_AllToAllW) on
this server when using OmniPath.

Running 4 threads on a single node, using OpenMPI 4.0.5 built without
omnipath support, the code is working:

CC=$(which gcc) CXX=$(which g++) FC=$(which gfortran) ../configure
--with-hwloc --enable-mpirun-prefix-by-default \
--prefix=/bettik/begou/OpenMPI405-noib --enable-mpi1-compatibility \
--enable-mpi-cxx --enable-cxx-exceptions --without-verbs --without-ofi
--without-psm --without-psm2 --without-openib \
--without-slurm

If I use omnipath, still with 4 threads on one node, the test-case does
not work (incorrect results):

CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native
-mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \
CC=$(which gcc) CXX=$(which g++) FC=$(which gfortran) ../configure
--with-hwloc --enable-mpirun-prefix-by-default \
--prefix=/bettik/begou/OpenMPI405 --enable-mpi1-compatibility \
--enable-mpi-cxx --enable-cxx-exceptions --without-verbs

I do not undestand what could be wrong as the code is running on many
architecture with various interconnect and openMPI versions.

Thanks for your suggestions.

Patrick