Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Heinz, Michael William via users
Patrick,

Do you have any PSM2_* or HFI_* environment variables defined in your run time 
environment that could be affecting things?


-Original Message-
From: users  On Behalf Of Heinz, Michael 
William via users
Sent: Wednesday, January 27, 2021 3:37 PM
To: Open MPI Users 
Cc: Heinz, Michael William 
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

Unfortunately, OPA/PSM support for Debian isn't handled by Intel directly or by 
Cornelis Networks - but I should point out you can download the latest official 
source for PSM2 and the drivers from Github.

-Original Message-
From: users  On Behalf Of Michael Di Domenico 
via users
Sent: Wednesday, January 27, 2021 3:32 PM
To: Open MPI Users 
Cc: Michael Di Domenico 
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

if you have OPA cards, for openmpi you only need --with-ofi, you don't need 
psm/psm2/verbs/ucx.  but this assumes you're running a rhel based distro and 
have installed the OPA fabric suite of software from Intel/CornelisNetworks.  
which is what i have.  perhaps there's something really odd in debian or 
there's an incompatibility with the older ofed drivers perhaps included with 
debian.  unfortunately i don't have access to a debian, so i can't be much more 
help

if i had to guess totally pulling junk from the air, there's probably something 
incompatible with PSM and OPA when running specifically on debian (likely due 
to library versioning).  i don't know how common that is, so it's not clear how 
flushed out and tested it is




On Wed, Jan 27, 2021 at 3:07 PM Patrick Begou via users 
 wrote:
>
> Hi Howard and Michael
>
> first many thanks for testing with my short application. Yes, when the 
> test code runs fine it just show the max RSS size of rank 0 process.
> When it runs wrong it put a messages about each invalid value found.
>
> As I said, I have also deployed OpenMPI on various cluster (in DELL 
> data center at Austin) when I was testing some architectures some 
> months ago and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any 
> problem. The goal was running my tests with same software stacks and 
> be sure to be able to deploy my software stack on the selected solution.
> But as your clusters (and my small local clusters) they were all 
> running RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or 10).
> The university's cluster I have access is running Debian stretch and 
> provides GCC6 as default compiler.
>
> I cannot ask for a different OS, but I can deploy a local gcc10 and 
> build again OpenMPI.  UCX is not available on this cluster, should I 
> deploy a local UCX too ?
>
> Libpsm2 seams good:
> dahu103 : dpkg -l |grep psm
> ii  libfabric-psm  1.10.0-2-1ifs+deb9amd64 Dynamic PSM
> provider for user-space Open Fabric Interfaces
> ii  libfabric-psm2 1.10.0-2-1ifs+deb9amd64 Dynamic PSM2
> provider for user-space Open Fabric Interfaces
> ii  libpsm-infinipath1 3.3-19-g67c0807-2ifs+deb9 amd64 PSM Messaging
> library for Intel Truescale adapters
> ii  libpsm-infinipath1-dev 3.3-19-g67c0807-2ifs+deb9 amd64 Development 
> files for libpsm-infinipath1
> ii  libpsm2-2  11.2.185-1-1ifs+deb9  amd64 Intel PSM2
> Libraries
> ii  libpsm2-2-compat   11.2.185-1-1ifs+deb9  amd64 Compat
> library for Intel PSM2
> ii  libpsm2-dev11.2.185-1-1ifs+deb9  amd64 Development
> files for Intel PSM2
> ii  psmisc 22.21-2.1+b2  amd64 utilities
> that use the proc file system
>
> This will be my next try to install OpenMPI on this cluster.
>
> Patrick
>
>
> Le 27/01/2021 à 18:09, Pritchard Jr., Howard via users a écrit :
> > Hi Folks,
> >
> > I'm also have problems reproducing this on one of our OPA clusters:
> >
> > libpsm2-11.2.78-1.el7.x86_64
> > libpsm2-devel-11.2.78-1.el7.x86_64
> >
> > cluster runs RHEL 7.8
> >
> > hca_id:   hfi1_0
> >   transport:  InfiniBand (0)
> >   fw_ver: 1.27.0
> >   node_guid:  0011:7501:0179:e2d7
> >   sys_image_guid: 0011:7501:0179:e2d7
> >   vendor_id:  0x1175
> >   vendor_part_id: 9456
> >   hw_ver: 0x11
> >   board_id:   Intel Omni-Path Host Fabric Interface 
> > Adapter 100 Series
> >   phys_port_cnt:  1
> >   port:   1
> >   state:  PORT_ACTIVE (4)
> >   max_mtu:4096 (5)
> >   active_mtu: 4096 (5)
> >   sm_lid: 1
> >   port_lid:   99
> >   port_lmc:   0x00
> >   link_layer: InfiniBand
> >
> > using gcc/gfortran 9.3.0
> >
> > Built Open MPI 4.0.5 without any special 

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Heinz, Michael William via users
Unfortunately, OPA/PSM support for Debian isn't handled by Intel directly or by 
Cornelis Networks - but I should point out you can download the latest official 
source for PSM2 and the drivers from Github.

-Original Message-
From: users  On Behalf Of Michael Di Domenico 
via users
Sent: Wednesday, January 27, 2021 3:32 PM
To: Open MPI Users 
Cc: Michael Di Domenico 
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

if you have OPA cards, for openmpi you only need --with-ofi, you don't need 
psm/psm2/verbs/ucx.  but this assumes you're running a rhel based distro and 
have installed the OPA fabric suite of software from Intel/CornelisNetworks.  
which is what i have.  perhaps there's something really odd in debian or 
there's an incompatibility with the older ofed drivers perhaps included with 
debian.  unfortunately i don't have access to a debian, so i can't be much more 
help

if i had to guess totally pulling junk from the air, there's probably something 
incompatible with PSM and OPA when running specifically on debian (likely due 
to library versioning).  i don't know how common that is, so it's not clear how 
flushed out and tested it is




On Wed, Jan 27, 2021 at 3:07 PM Patrick Begou via users 
 wrote:
>
> Hi Howard and Michael
>
> first many thanks for testing with my short application. Yes, when the 
> test code runs fine it just show the max RSS size of rank 0 process.
> When it runs wrong it put a messages about each invalid value found.
>
> As I said, I have also deployed OpenMPI on various cluster (in DELL 
> data center at Austin) when I was testing some architectures some 
> months ago and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any 
> problem. The goal was running my tests with same software stacks and 
> be sure to be able to deploy my software stack on the selected solution.
> But as your clusters (and my small local clusters) they were all 
> running RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or 10).
> The university's cluster I have access is running Debian stretch and 
> provides GCC6 as default compiler.
>
> I cannot ask for a different OS, but I can deploy a local gcc10 and 
> build again OpenMPI.  UCX is not available on this cluster, should I 
> deploy a local UCX too ?
>
> Libpsm2 seams good:
> dahu103 : dpkg -l |grep psm
> ii  libfabric-psm  1.10.0-2-1ifs+deb9amd64 Dynamic PSM
> provider for user-space Open Fabric Interfaces
> ii  libfabric-psm2 1.10.0-2-1ifs+deb9amd64 Dynamic PSM2
> provider for user-space Open Fabric Interfaces
> ii  libpsm-infinipath1 3.3-19-g67c0807-2ifs+deb9 amd64 PSM Messaging
> library for Intel Truescale adapters
> ii  libpsm-infinipath1-dev 3.3-19-g67c0807-2ifs+deb9 amd64 Development 
> files for libpsm-infinipath1
> ii  libpsm2-2  11.2.185-1-1ifs+deb9  amd64 Intel PSM2
> Libraries
> ii  libpsm2-2-compat   11.2.185-1-1ifs+deb9  amd64 Compat
> library for Intel PSM2
> ii  libpsm2-dev11.2.185-1-1ifs+deb9  amd64 Development
> files for Intel PSM2
> ii  psmisc 22.21-2.1+b2  amd64 utilities
> that use the proc file system
>
> This will be my next try to install OpenMPI on this cluster.
>
> Patrick
>
>
> Le 27/01/2021 à 18:09, Pritchard Jr., Howard via users a écrit :
> > Hi Folks,
> >
> > I'm also have problems reproducing this on one of our OPA clusters:
> >
> > libpsm2-11.2.78-1.el7.x86_64
> > libpsm2-devel-11.2.78-1.el7.x86_64
> >
> > cluster runs RHEL 7.8
> >
> > hca_id:   hfi1_0
> >   transport:  InfiniBand (0)
> >   fw_ver: 1.27.0
> >   node_guid:  0011:7501:0179:e2d7
> >   sys_image_guid: 0011:7501:0179:e2d7
> >   vendor_id:  0x1175
> >   vendor_part_id: 9456
> >   hw_ver: 0x11
> >   board_id:   Intel Omni-Path Host Fabric Interface 
> > Adapter 100 Series
> >   phys_port_cnt:  1
> >   port:   1
> >   state:  PORT_ACTIVE (4)
> >   max_mtu:4096 (5)
> >   active_mtu: 4096 (5)
> >   sm_lid: 1
> >   port_lid:   99
> >   port_lmc:   0x00
> >   link_layer: InfiniBand
> >
> > using gcc/gfortran 9.3.0
> >
> > Built Open MPI 4.0.5 without any special configure options.
> >
> > Howard
> >
> > On 1/27/21, 9:47 AM, "users on behalf of Michael Di Domenico via users" 
> >  
> > wrote:
> >
> > for whatever it's worth running the test program on my OPA cluster
> > seems to work.  well it keeps spitting out [INFO MEMORY] lines, not
> > sure if it's supposed to stop at some point
> >
> > i'm running rhel7, gcc 10.1, openmpi 

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Michael Di Domenico via users
if you have OPA cards, for openmpi you only need --with-ofi, you don't
need psm/psm2/verbs/ucx.  but this assumes you're running a rhel based
distro and have installed the OPA fabric suite of software from
Intel/CornelisNetworks.  which is what i have.  perhaps there's
something really odd in debian or there's an incompatibility with the
older ofed drivers perhaps included with debian.  unfortunately i
don't have access to a debian, so i can't be much more help

if i had to guess totally pulling junk from the air, there's probably
something incompatible with PSM and OPA when running specifically on
debian (likely due to library versioning).  i don't know how common
that is, so it's not clear how flushed out and tested it is




On Wed, Jan 27, 2021 at 3:07 PM Patrick Begou via users
 wrote:
>
> Hi Howard and Michael
>
> first many thanks for testing with my short application. Yes, when the
> test code runs fine it just show the max RSS size of rank 0 process.
> When it runs wrong it put a messages about each invalid value found.
>
> As I said, I have also deployed OpenMPI on various cluster (in DELL data
> center at Austin) when I was testing some architectures some months ago
> and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any problem. The
> goal was running my tests with same software stacks and be sure to be
> able to deploy my software stack on the selected solution.
> But as your clusters (and my small local clusters) they were all running
> RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or 10).
> The university's cluster I have access is running Debian stretch and
> provides GCC6 as default compiler.
>
> I cannot ask for a different OS, but I can deploy a local gcc10 and
> build again OpenMPI.  UCX is not available on this cluster, should I
> deploy a local UCX too ?
>
> Libpsm2 seams good:
> dahu103 : dpkg -l |grep psm
> ii  libfabric-psm  1.10.0-2-1ifs+deb9amd64 Dynamic PSM
> provider for user-space Open Fabric Interfaces
> ii  libfabric-psm2 1.10.0-2-1ifs+deb9amd64 Dynamic PSM2
> provider for user-space Open Fabric Interfaces
> ii  libpsm-infinipath1 3.3-19-g67c0807-2ifs+deb9 amd64 PSM Messaging
> library for Intel Truescale adapters
> ii  libpsm-infinipath1-dev 3.3-19-g67c0807-2ifs+deb9 amd64 Development
> files for libpsm-infinipath1
> ii  libpsm2-2  11.2.185-1-1ifs+deb9  amd64 Intel PSM2
> Libraries
> ii  libpsm2-2-compat   11.2.185-1-1ifs+deb9  amd64 Compat
> library for Intel PSM2
> ii  libpsm2-dev11.2.185-1-1ifs+deb9  amd64 Development
> files for Intel PSM2
> ii  psmisc 22.21-2.1+b2  amd64 utilities
> that use the proc file system
>
> This will be my next try to install OpenMPI on this cluster.
>
> Patrick
>
>
> Le 27/01/2021 à 18:09, Pritchard Jr., Howard via users a écrit :
> > Hi Folks,
> >
> > I'm also have problems reproducing this on one of our OPA clusters:
> >
> > libpsm2-11.2.78-1.el7.x86_64
> > libpsm2-devel-11.2.78-1.el7.x86_64
> >
> > cluster runs RHEL 7.8
> >
> > hca_id:   hfi1_0
> >   transport:  InfiniBand (0)
> >   fw_ver: 1.27.0
> >   node_guid:  0011:7501:0179:e2d7
> >   sys_image_guid: 0011:7501:0179:e2d7
> >   vendor_id:  0x1175
> >   vendor_part_id: 9456
> >   hw_ver: 0x11
> >   board_id:   Intel Omni-Path Host Fabric Interface 
> > Adapter 100 Series
> >   phys_port_cnt:  1
> >   port:   1
> >   state:  PORT_ACTIVE (4)
> >   max_mtu:4096 (5)
> >   active_mtu: 4096 (5)
> >   sm_lid: 1
> >   port_lid:   99
> >   port_lmc:   0x00
> >   link_layer: InfiniBand
> >
> > using gcc/gfortran 9.3.0
> >
> > Built Open MPI 4.0.5 without any special configure options.
> >
> > Howard
> >
> > On 1/27/21, 9:47 AM, "users on behalf of Michael Di Domenico via users" 
> >  
> > wrote:
> >
> > for whatever it's worth running the test program on my OPA cluster
> > seems to work.  well it keeps spitting out [INFO MEMORY] lines, not
> > sure if it's supposed to stop at some point
> >
> > i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, 
> > without-{psm,ucx,verbs}
> >
> > On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users
> >  wrote:
> > >
> > > Hi Michael
> > >
> > > indeed I'm a little bit lost with all these parameters in OpenMPI, 
> > mainly because for years it works just fine out of the box in all my 
> > deployments on various architectures, interconnects and linux flavor. Some 
> > weeks ago I deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and 

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Patrick Begou via users
Hi Howard and Michael

first many thanks for testing with my short application. Yes, when the
test code runs fine it just show the max RSS size of rank 0 process.
When it runs wrong it put a messages about each invalid value found.

As I said, I have also deployed OpenMPI on various cluster (in DELL data
center at Austin) when I was testing some architectures some months ago
and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any problem. The
goal was running my tests with same software stacks and be sure to be
able to deploy my software stack on the selected solution.
But as your clusters (and my small local clusters) they were all running
RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or 10).
The university's cluster I have access is running Debian stretch and
provides GCC6 as default compiler.

I cannot ask for a different OS, but I can deploy a local gcc10 and
build again OpenMPI.  UCX is not available on this cluster, should I
deploy a local UCX too ?

Libpsm2 seams good:
dahu103 : dpkg -l |grep psm
ii  libfabric-psm  1.10.0-2-1ifs+deb9    amd64 Dynamic PSM
provider for user-space Open Fabric Interfaces
ii  libfabric-psm2 1.10.0-2-1ifs+deb9    amd64 Dynamic PSM2
provider for user-space Open Fabric Interfaces
ii  libpsm-infinipath1 3.3-19-g67c0807-2ifs+deb9 amd64 PSM Messaging
library for Intel Truescale adapters
ii  libpsm-infinipath1-dev 3.3-19-g67c0807-2ifs+deb9 amd64 Development
files for libpsm-infinipath1
ii  libpsm2-2  11.2.185-1-1ifs+deb9  amd64 Intel PSM2
Libraries
ii  libpsm2-2-compat   11.2.185-1-1ifs+deb9  amd64 Compat
library for Intel PSM2
ii  libpsm2-dev    11.2.185-1-1ifs+deb9  amd64 Development
files for Intel PSM2
ii  psmisc 22.21-2.1+b2  amd64 utilities
that use the proc file system

This will be my next try to install OpenMPI on this cluster.

Patrick


Le 27/01/2021 à 18:09, Pritchard Jr., Howard via users a écrit :
> Hi Folks,
>
> I'm also have problems reproducing this on one of our OPA clusters:
>
> libpsm2-11.2.78-1.el7.x86_64
> libpsm2-devel-11.2.78-1.el7.x86_64
>
> cluster runs RHEL 7.8
>
> hca_id:   hfi1_0
>   transport:  InfiniBand (0)
>   fw_ver: 1.27.0
>   node_guid:  0011:7501:0179:e2d7
>   sys_image_guid: 0011:7501:0179:e2d7
>   vendor_id:  0x1175
>   vendor_part_id: 9456
>   hw_ver: 0x11
>   board_id:   Intel Omni-Path Host Fabric Interface 
> Adapter 100 Series
>   phys_port_cnt:  1
>   port:   1
>   state:  PORT_ACTIVE (4)
>   max_mtu:4096 (5)
>   active_mtu: 4096 (5)
>   sm_lid: 1
>   port_lid:   99
>   port_lmc:   0x00
>   link_layer: InfiniBand
>
> using gcc/gfortran 9.3.0
>
> Built Open MPI 4.0.5 without any special configure options.
>
> Howard
>
> On 1/27/21, 9:47 AM, "users on behalf of Michael Di Domenico via users" 
>  
> wrote:
>
> for whatever it's worth running the test program on my OPA cluster
> seems to work.  well it keeps spitting out [INFO MEMORY] lines, not
> sure if it's supposed to stop at some point
> 
> i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, 
> without-{psm,ucx,verbs}
> 
> On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users
>  wrote:
> >
> > Hi Michael
> >
> > indeed I'm a little bit lost with all these parameters in OpenMPI, 
> mainly because for years it works just fine out of the box in all my 
> deployments on various architectures, interconnects and linux flavor. Some 
> weeks ago I deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an 
> AMD epyc2 cluster with connectX6, and it just works fine.  It is the first 
> time I've such trouble to deploy this library.
> >
> > If you have my mail posted  the 25/01/2021 in this discussion at 18h54 
> (may be Paris TZ) there is a small test case attached that show the problem. 
> Did you got it or did the list strip these attachments ? I can provide it 
> again.
> >
> > Many thanks
> >
> > Patrick
> >
> > Le 26/01/2021 à 19:25, Heinz, Michael William a écrit :
> >
> > Patrick how are you using original PSM if you’re using Omni-Path 
> hardware? The original PSM was written for QLogic DDR and QDR Infiniband 
> adapters.
> >
> > As far as needing openib - the issue is that the PSM2 MTL doesn’t 
> support a subset of MPI operations that we previously used the pt2pt BTL for. 
> For recent version of OMPI, the preferred BTL to use with PSM2 is OFI.
> >
> > Is there any chance you can give us a sample MPI 

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Pritchard Jr., Howard via users
Hi Folks,

I'm also have problems reproducing this on one of our OPA clusters:

libpsm2-11.2.78-1.el7.x86_64
libpsm2-devel-11.2.78-1.el7.x86_64

cluster runs RHEL 7.8

hca_id: hfi1_0
transport:  InfiniBand (0)
fw_ver: 1.27.0
node_guid:  0011:7501:0179:e2d7
sys_image_guid: 0011:7501:0179:e2d7
vendor_id:  0x1175
vendor_part_id: 9456
hw_ver: 0x11
board_id:   Intel Omni-Path Host Fabric Interface 
Adapter 100 Series
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid:   99
port_lmc:   0x00
link_layer: InfiniBand

using gcc/gfortran 9.3.0

Built Open MPI 4.0.5 without any special configure options.

Howard

On 1/27/21, 9:47 AM, "users on behalf of Michael Di Domenico via users" 
 wrote:

for whatever it's worth running the test program on my OPA cluster
seems to work.  well it keeps spitting out [INFO MEMORY] lines, not
sure if it's supposed to stop at some point

i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, 
without-{psm,ucx,verbs}

On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users
 wrote:
>
> Hi Michael
>
> indeed I'm a little bit lost with all these parameters in OpenMPI, mainly 
because for years it works just fine out of the box in all my deployments on 
various architectures, interconnects and linux flavor. Some weeks ago I deploy 
OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an AMD epyc2 cluster with 
connectX6, and it just works fine.  It is the first time I've such trouble to 
deploy this library.
>
> If you have my mail posted  the 25/01/2021 in this discussion at 18h54 
(may be Paris TZ) there is a small test case attached that show the problem. 
Did you got it or did the list strip these attachments ? I can provide it again.
>
> Many thanks
>
> Patrick
>
> Le 26/01/2021 à 19:25, Heinz, Michael William a écrit :
>
> Patrick how are you using original PSM if you’re using Omni-Path 
hardware? The original PSM was written for QLogic DDR and QDR Infiniband 
adapters.
>
> As far as needing openib - the issue is that the PSM2 MTL doesn’t support 
a subset of MPI operations that we previously used the pt2pt BTL for. For 
recent version of OMPI, the preferred BTL to use with PSM2 is OFI.
>
> Is there any chance you can give us a sample MPI app that reproduces the 
problem? I can’t think of another way I can give you more help without being 
able to see what’s going on. It’s always possible there’s a bug in the PSM2 MTL 
but it would be surprising at this point.
>
> Sent from my iPad
>
> On Jan 26, 2021, at 1:13 PM, Patrick Begou via users 
 wrote:
>
> 
> Hi all,
>
> I ran many tests today. I saw that an older 4.0.2 version of OpenMPI 
packaged with Nix was running using openib. So I add the --with-verbs option to 
setup this module.
>
> That I can see now is that:
>
> mpirun -hostfile $OAR_NODEFILE  --mca mtl psm -mca btl_openib_allow_ib 
true 
>
> - the testcase test_layout_array is running without error
>
> - the bandwidth measured with osu_bw is half of thar it should be:
>
> # OSU MPI Bandwidth Test v5.7
> # Size  Bandwidth (MB/s)
> 1   0.54
> 2   1.13
> 4   2.26
> 8   4.51
> 16  9.06
> 32 17.93
> 64 33.87
> 12869.29
> 256   161.24
> 512   333.82
> 1024  682.66
> 2048 1188.63
> 4096 1760.14
> 8192 2166.08
> 163842036.95
> 327683466.63
> 655366296.73
> 131072   7509.43
> 262144   9104.78
> 524288   6908.55
> 1048576  5530.37
> 2097152  4489.16
> 4194304  3498.14
>
> mpirun -hostfile $OAR_NODEFILE  --mca mtl psm2 -mca btl_openib_allow_ib 
true ...
>
> - the testcase test_layout_array is not giving correct results
>
> - the bandwidth measured with osu_bw is the right one:
>
> # OSU MPI Bandwidth Test v5.7
> # Size  Bandwidth (MB/s)
> 1   3.73
> 2   7.96
> 4 

Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Michael Di Domenico via users
for whatever it's worth running the test program on my OPA cluster
seems to work.  well it keeps spitting out [INFO MEMORY] lines, not
sure if it's supposed to stop at some point

i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, without-{psm,ucx,verbs}

On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users
 wrote:
>
> Hi Michael
>
> indeed I'm a little bit lost with all these parameters in OpenMPI, mainly 
> because for years it works just fine out of the box in all my deployments on 
> various architectures, interconnects and linux flavor. Some weeks ago I 
> deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an AMD epyc2 
> cluster with connectX6, and it just works fine.  It is the first time I've 
> such trouble to deploy this library.
>
> If you have my mail posted  the 25/01/2021 in this discussion at 18h54 (may 
> be Paris TZ) there is a small test case attached that show the problem. Did 
> you got it or did the list strip these attachments ? I can provide it again.
>
> Many thanks
>
> Patrick
>
> Le 26/01/2021 à 19:25, Heinz, Michael William a écrit :
>
> Patrick how are you using original PSM if you’re using Omni-Path hardware? 
> The original PSM was written for QLogic DDR and QDR Infiniband adapters.
>
> As far as needing openib - the issue is that the PSM2 MTL doesn’t support a 
> subset of MPI operations that we previously used the pt2pt BTL for. For 
> recent version of OMPI, the preferred BTL to use with PSM2 is OFI.
>
> Is there any chance you can give us a sample MPI app that reproduces the 
> problem? I can’t think of another way I can give you more help without being 
> able to see what’s going on. It’s always possible there’s a bug in the PSM2 
> MTL but it would be surprising at this point.
>
> Sent from my iPad
>
> On Jan 26, 2021, at 1:13 PM, Patrick Begou via users 
>  wrote:
>
> 
> Hi all,
>
> I ran many tests today. I saw that an older 4.0.2 version of OpenMPI packaged 
> with Nix was running using openib. So I add the --with-verbs option to setup 
> this module.
>
> That I can see now is that:
>
> mpirun -hostfile $OAR_NODEFILE  --mca mtl psm -mca btl_openib_allow_ib true 
> 
>
> - the testcase test_layout_array is running without error
>
> - the bandwidth measured with osu_bw is half of thar it should be:
>
> # OSU MPI Bandwidth Test v5.7
> # Size  Bandwidth (MB/s)
> 1   0.54
> 2   1.13
> 4   2.26
> 8   4.51
> 16  9.06
> 32 17.93
> 64 33.87
> 12869.29
> 256   161.24
> 512   333.82
> 1024  682.66
> 2048 1188.63
> 4096 1760.14
> 8192 2166.08
> 163842036.95
> 327683466.63
> 655366296.73
> 131072   7509.43
> 262144   9104.78
> 524288   6908.55
> 1048576  5530.37
> 2097152  4489.16
> 4194304  3498.14
>
> mpirun -hostfile $OAR_NODEFILE  --mca mtl psm2 -mca btl_openib_allow_ib true 
> ...
>
> - the testcase test_layout_array is not giving correct results
>
> - the bandwidth measured with osu_bw is the right one:
>
> # OSU MPI Bandwidth Test v5.7
> # Size  Bandwidth (MB/s)
> 1   3.73
> 2   7.96
> 4  15.82
> 8  31.22
> 16 51.52
> 32107.61
> 64196.51
> 128   438.66
> 256   817.70
> 512  1593.90
> 1024 2786.09
> 2048 4459.77
> 4096 6658.70
> 8192 8092.95
> 163848664.43
> 327688495.96
> 65536   11458.77
> 131072  12094.64
> 262144  11781.84
> 524288  12297.58
> 1048576 12346.92
> 2097152 12206.53
> 4194304 12167.00
>
> But yes, I know openib is deprecated too in 4.0.5.
>
> Patrick
>
>