if you have OPA cards, for openmpi you only need --with-ofi, you don't
need psm/psm2/verbs/ucx.  but this assumes you're running a rhel based
distro and have installed the OPA fabric suite of software from
Intel/CornelisNetworks.  which is what i have.  perhaps there's
something really odd in debian or there's an incompatibility with the
older ofed drivers perhaps included with debian.  unfortunately i
don't have access to a debian, so i can't be much more help

if i had to guess totally pulling junk from the air, there's probably
something incompatible with PSM and OPA when running specifically on
debian (likely due to library versioning).  i don't know how common
that is, so it's not clear how flushed out and tested it is




On Wed, Jan 27, 2021 at 3:07 PM Patrick Begou via users
<users@lists.open-mpi.org> wrote:
>
> Hi Howard and Michael
>
> first many thanks for testing with my short application. Yes, when the
> test code runs fine it just show the max RSS size of rank 0 process.
> When it runs wrong it put a messages about each invalid value found.
>
> As I said, I have also deployed OpenMPI on various cluster (in DELL data
> center at Austin) when I was testing some architectures some months ago
> and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any problem. The
> goal was running my tests with same software stacks and be sure to be
> able to deploy my software stack on the selected solution.
> But as your clusters (and my small local clusters) they were all running
> RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or 10).
> The university's cluster I have access is running Debian stretch and
> provides GCC6 as default compiler.
>
> I cannot ask for a different OS, but I can deploy a local gcc10 and
> build again OpenMPI.  UCX is not available on this cluster, should I
> deploy a local UCX too ?
>
> Libpsm2 seams good:
> dahu103 : dpkg -l |grep psm
> ii  libfabric-psm          1.10.0-2-1ifs+deb9        amd64 Dynamic PSM
> provider for user-space Open Fabric Interfaces
> ii  libfabric-psm2         1.10.0-2-1ifs+deb9        amd64 Dynamic PSM2
> provider for user-space Open Fabric Interfaces
> ii  libpsm-infinipath1     3.3-19-g67c0807-2ifs+deb9 amd64 PSM Messaging
> library for Intel Truescale adapters
> ii  libpsm-infinipath1-dev 3.3-19-g67c0807-2ifs+deb9 amd64 Development
> files for libpsm-infinipath1
> ii  libpsm2-2              11.2.185-1-1ifs+deb9      amd64 Intel PSM2
> Libraries
> ii  libpsm2-2-compat       11.2.185-1-1ifs+deb9      amd64 Compat
> library for Intel PSM2
> ii  libpsm2-dev            11.2.185-1-1ifs+deb9      amd64 Development
> files for Intel PSM2
> ii  psmisc                 22.21-2.1+b2              amd64 utilities
> that use the proc file system
>
> This will be my next try to install OpenMPI on this cluster.
>
> Patrick
>
>
> Le 27/01/2021 à 18:09, Pritchard Jr., Howard via users a écrit :
> > Hi Folks,
> >
> > I'm also have problems reproducing this on one of our OPA clusters:
> >
> > libpsm2-11.2.78-1.el7.x86_64
> > libpsm2-devel-11.2.78-1.el7.x86_64
> >
> > cluster runs RHEL 7.8
> >
> > hca_id:       hfi1_0
> >       transport:                      InfiniBand (0)
> >       fw_ver:                         1.27.0
> >       node_guid:                      0011:7501:0179:e2d7
> >       sys_image_guid:                 0011:7501:0179:e2d7
> >       vendor_id:                      0x1175
> >       vendor_part_id:                 9456
> >       hw_ver:                         0x11
> >       board_id:                       Intel Omni-Path Host Fabric Interface 
> > Adapter 100 Series
> >       phys_port_cnt:                  1
> >               port:   1
> >                       state:                  PORT_ACTIVE (4)
> >                       max_mtu:                4096 (5)
> >                       active_mtu:             4096 (5)
> >                       sm_lid:                 1
> >                       port_lid:               99
> >                       port_lmc:               0x00
> >                       link_layer:             InfiniBand
> >
> > using gcc/gfortran 9.3.0
> >
> > Built Open MPI 4.0.5 without any special configure options.
> >
> > Howard
> >
> > On 1/27/21, 9:47 AM, "users on behalf of Michael Di Domenico via users" 
> > <users-boun...@lists.open-mpi.org on behalf of users@lists.open-mpi.org> 
> > wrote:
> >
> >     for whatever it's worth running the test program on my OPA cluster
> >     seems to work.  well it keeps spitting out [INFO MEMORY] lines, not
> >     sure if it's supposed to stop at some point
> >
> >     i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, 
> > without-{psm,ucx,verbs}
> >
> >     On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users
> >     <users@lists.open-mpi.org> wrote:
> >     >
> >     > Hi Michael
> >     >
> >     > indeed I'm a little bit lost with all these parameters in OpenMPI, 
> > mainly because for years it works just fine out of the box in all my 
> > deployments on various architectures, interconnects and linux flavor. Some 
> > weeks ago I deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an 
> > AMD epyc2 cluster with connectX6, and it just works fine.  It is the first 
> > time I've such trouble to deploy this library.
> >     >
> >     > If you have my mail posted  the 25/01/2021 in this discussion at 
> > 18h54 (may be Paris TZ) there is a small test case attached that show the 
> > problem. Did you got it or did the list strip these attachments ? I can 
> > provide it again.
> >     >
> >     > Many thanks
> >     >
> >     > Patrick
> >     >
> >     > Le 26/01/2021 à 19:25, Heinz, Michael William a écrit :
> >     >
> >     > Patrick how are you using original PSM if you’re using Omni-Path 
> > hardware? The original PSM was written for QLogic DDR and QDR Infiniband 
> > adapters.
> >     >
> >     > As far as needing openib - the issue is that the PSM2 MTL doesn’t 
> > support a subset of MPI operations that we previously used the pt2pt BTL 
> > for. For recent version of OMPI, the preferred BTL to use with PSM2 is OFI.
> >     >
> >     > Is there any chance you can give us a sample MPI app that reproduces 
> > the problem? I can’t think of another way I can give you more help without 
> > being able to see what’s going on. It’s always possible there’s a bug in 
> > the PSM2 MTL but it would be surprising at this point.
> >     >
> >     > Sent from my iPad
> >     >
> >     > On Jan 26, 2021, at 1:13 PM, Patrick Begou via users 
> > <users@lists.open-mpi.org> wrote:
> >     >
> >     >
> >     > Hi all,
> >     >
> >     > I ran many tests today. I saw that an older 4.0.2 version of OpenMPI 
> > packaged with Nix was running using openib. So I add the --with-verbs 
> > option to setup this module.
> >     >
> >     > That I can see now is that:
> >     >
> >     > mpirun -hostfile $OAR_NODEFILE  --mca mtl psm -mca 
> > btl_openib_allow_ib true ....
> >     >
> >     > - the testcase test_layout_array is running without error
> >     >
> >     > - the bandwidth measured with osu_bw is half of thar it should be:
> >     >
> >     > # OSU MPI Bandwidth Test v5.7
> >     > # Size      Bandwidth (MB/s)
> >     > 1                       0.54
> >     > 2                       1.13
> >     > 4                       2.26
> >     > 8                       4.51
> >     > 16                      9.06
> >     > 32                     17.93
> >     > 64                     33.87
> >     > 128                    69.29
> >     > 256                   161.24
> >     > 512                   333.82
> >     > 1024                  682.66
> >     > 2048                 1188.63
> >     > 4096                 1760.14
> >     > 8192                 2166.08
> >     > 16384                2036.95
> >     > 32768                3466.63
> >     > 65536                6296.73
> >     > 131072               7509.43
> >     > 262144               9104.78
> >     > 524288               6908.55
> >     > 1048576              5530.37
> >     > 2097152              4489.16
> >     > 4194304              3498.14
> >     >
> >     > mpirun -hostfile $OAR_NODEFILE  --mca mtl psm2 -mca 
> > btl_openib_allow_ib true ...
> >     >
> >     > - the testcase test_layout_array is not giving correct results
> >     >
> >     > - the bandwidth measured with osu_bw is the right one:
> >     >
> >     > # OSU MPI Bandwidth Test v5.7
> >     > # Size      Bandwidth (MB/s)
> >     > 1                       3.73
> >     > 2                       7.96
> >     > 4                      15.82
> >     > 8                      31.22
> >     > 16                     51.52
> >     > 32                    107.61
> >     > 64                    196.51
> >     > 128                   438.66
> >     > 256                   817.70
> >     > 512                  1593.90
> >     > 1024                 2786.09
> >     > 2048                 4459.77
> >     > 4096                 6658.70
> >     > 8192                 8092.95
> >     > 16384                8664.43
> >     > 32768                8495.96
> >     > 65536               11458.77
> >     > 131072              12094.64
> >     > 262144              11781.84
> >     > 524288              12297.58
> >     > 1048576             12346.92
> >     > 2097152             12206.53
> >     > 4194304             12167.00
> >     >
> >     > But yes, I know openib is deprecated too in 4.0.5.
> >     >
> >     > Patrick
> >     >
> >     >
> >
> >
>

Reply via email to