Dear Gilles,

                     Reposting your comments along with my replies in the 
mailing-list for everybody to view/react.



I am seeing some important performance degradation between Open MPI

3.1.1 and the top of the v3.1.x branch

when running on a large number of cores.

Same performance between 4.1.0 and the top of v3.1.x

I am now running git bisect to find out when this started happening.

I am finally feeling relieved and happy that you could reproduce and 
acknowledge this regression !!

Do I need to file any bug officially anywhere?



IIRC, I noted an xpmem error in your logs (that means xpmem is not used).

The root cause could be that the xpmem kernel module is not loaded, of the 
permissions on the device are incorrect As Nathan pointed out, xpmem is likely 
to get the best performances, so while I am running git bisect, I do invite you 
to fix your xpmem issue and see how this impacts performances

Sure, I will try to fix the xpmem error and check the impact on the performance.



With Regards,

S. Biplab Raut



-----Original Message-----
From: Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>>
Sent: Sunday, March 14, 2021 8:45 AM
To: Raut, S Biplab <biplab.r...@amd.com<mailto:biplab.r...@amd.com>>
Subject: Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 
?



[CAUTION: External Email]



I am seeing some important performance degradation between Open MPI

3.1.1 and the top of the v3.1.x branch

when running on a large number of cores.

Same performance between 4.1.0 and the top of v3.1.x



I am now running git bisect to find out when this started happening.



IIRC, I noted an xpmem error in your logs (that means xpmem is not used).

The root cause could be that the xpmem kernel module is not loaded, of the 
permissions on the device are incorrect As Nathan pointed out, xpmem is likely 
to get the best performances, so while I am running git bisect, I do invite you 
to fix your xpmem issue and see how this impacts performances



Cheers,



Gilles



On Sat, Mar 13, 2021 at 12:08 AM Raut, S Biplab 
<biplab.r...@amd.com<mailto:biplab.r...@amd.com>> wrote:

>

> Dear Gilles,

>

>                     Please check my replies inline.

>

>

>

> >>> Can you please post the output of

>

> >>> ompi_info --param btl vader --level 3

>

> >>> with both Open MPI 3.1 and 4.1?

>

> openMPI3.1.1

>

> ------------------

>

> $ ompi_info --param btl vader --level 3

>

>                  MCA btl: vader (MCA v2.1.0, API v3.0.0, Component

> v3.1.1)

>

>            MCA btl vader:

> ---------------------------------------------------

>

>            MCA btl vader: parameter "btl_vader_single_copy_mechanism"

>

>                           (current value: "cma", data source: default, level:

>

>                           3 user/all, type: int)

>

>                           Single copy mechanism to use (defaults to

> best

>

>                           available)

>

>                           Valid values: 1:"cma", 3:"none"

>

> openMPI4.1.0

>

> ------------------

>

> $ ompi_info --param btl vader --level 3

>

>                  MCA btl: vader (MCA v2.1.0, API v3.1.0, Component

> v4.1.0)

>

>            MCA btl vader:

> ---------------------------------------------------

>

>            MCA btl vader: parameter "btl_vader_single_copy_mechanism"

>

>                           (current value: "cma", data source: default, level:

>

>                           3 user/all, type: int)

>

>                           Single copy mechanism to use (defaults to

> best

>

>                           available)

>

>                           Valid values: 1:"cma", 4:"emulated", 3:"none"

>

>            MCA btl vader: parameter "btl_vader_backing_directory"

> (current

>

>                           value: "/dev/shm", data source: default,

> level: 3

>

>                           user/all, type: string)

>

>                           Directory to place backing files for shared

> memory

>

>                           communication. This directory should be on a

> local

>

>                           filesystem such as /tmp or /dev/shm (default:

>

>                           (linux) /dev/shm, (others) session

> directory)

>

>

>

> >>> What if you run with only 2 MPI ranks?

>

> >>> do you observe similar performance differences between Open MPI 3.1 and 
> >>> 4.1?

>

> When I run only 2 MPI ranks, the performance regression is not significant.

>

> openMPI3.1.1 gives MFLOPS: 11122

>

> openMPI4.1.0 gives MFLOPS: 11041

>

>

>

> With Regards,

>

> S. Biplab Raut

>

>

>

> From: Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>>

> Sent: Friday, March 12, 2021 7:07 PM

> To: Raut, S Biplab <biplab.r...@amd.com<mailto:biplab.r...@amd.com>>

> Subject: Re: [OMPI users] Stable and performant openMPI version for 
> Ubuntu20.04 ?

>

>

>

> [CAUTION: External Email]

>

> Can you please post the output of

>

> ompi_info --param btl vader --level 3

>

> with both Open MPI 3.1 and 4.1?

>

>

>

> What if you run with only 2 MPI ranks?

>

> do you observe similar performance differences between Open MPI 3.1 and 4.1?

>

>

>

> Cheers,

>

> Gilles

>

>

>

> On Fri, Mar 12, 2021 at 6:31 PM Raut, S Biplab 
> <biplab.r...@amd.com<mailto:biplab.r...@amd.com>> wrote:

>

> Dear Gilles,

>

>                     Thank you for the reply.

>

>

>

> >>> when running

>

> >>> mpirun --map-by core -rank-by core --bind-to core --mca pml ob1

> >>> --mca btl vader,self ./mpi-bench ic1000000

>

> >>> I go similar flops with Open MPI 3.1.1, 3.1.6, 4.1.0 and 4.1.1rc1

> >>> on my system

>

> >>> If you are using a different command line, please let me know and

> >>> I will give it a try

>

> Although the command line that I use is different, but I ran with the above 
> command line as used by you.

>

> I still find that openMPI4.1.0 is poor as compared to openMPI3.1.1. Please 
> check the details below. I have also provided my system details if it matters.

>

> openMPI3.1.1

>

> -------------------

>

> $ mpirun --map-by core -rank-by core --bind-to core --mca pml ob1

> --mca btl vader,self ./mpi-bench ic1000000

>

> Problem: ic1000000, setup: 552.20 ms, time: 1.33 ms, ``mflops'': 75143

>

> $ ompi_info --all|grep 'command line'

>

>   Configure command line: '--prefix=/home/server/ompi3/gcc' 
> '--enable-mpi-fortran' '--enable-mpi-cxx' '--enable-shared=yes' 
> '--enable-static=yes' '--enable-mpi1-compatibility'

>

>                           User-specified command line parameters

> passed to ROMIO's configure script

>

>                           Complete set of command line parameters

> passed to ROMIO's configure script

>

>

>

> openMPI4.1.0

>

> -------------------

>

> $ mpirun --map-by core -rank-by core --bind-to core --mca pml ob1

> --mca btl vader,self ./mpi-bench ic1000000

>

> Problem: ic1000000, setup: 557.12 ms, time: 1.75 ms, ``mflops'': 57029

>

> $ ompi_info --all|grep 'command line'

>

>   Configure command line: '--prefix=/home/server/ompi4_plain' 
> '--enable-mpi-fortran' '--enable-mpi-cxx' '--enable-shared=yes' 
> '--enable-static=yes' '--enable-mpi1-compatibility'

>

>                           User-specified command line parameters

> passed to ROMIO's configure script

>

>                           Complete set of command line parameters

> passed to ROMIO's configure script

>

>

>

> openMPI4.1.0 + xpmem

>

> --------------------------------

>

> $ mpirun --map-by core -rank-by core --bind-to core --mca pml ob1

> --mca btl vader,self ./mpi-bench ic1000000

>

> ----------------------------------------------------------------------

> ----

>

> WARNING: Could not generate an xpmem segment id for this process'

>

> address space.

>

> The vader shared memory BTL will fall back on another single-copy

>

> mechanism if one is available. This may result in lower performance.

>

>   Local host: lib-daytonax-03

>

>   Error code: 2 (No such file or directory)

>

> ----------------------------------------------------------------------

> ----

>

> Problem: ic1000000, setup: 559.55 ms, time: 1.77 ms, ``mflops'': 56280

>

> $ ompi_info --all|grep 'command line'

>

>   Configure command line: '--prefix=/home/server/ompi4_xmem' 
> '--with-xpmem=/opt/xpmm' '--enable-mpi-fortran' '--enable-mpi-cxx' 
> '--enable-shared=yes' '--enable-static=yes' '--enable-mpi1-compatibility'

>

>                           User-specified command line parameters

> passed to ROMIO's configure script

>

>                           Complete set of command line parameters

> passed to ROMIO's configure script

>

>

>

> Other System Config

>

> ----------------------------

> -        $ cat /etc/os-release

>

> NAME="Ubuntu"

>

> VERSION="20.04 LTS (Focal Fossa)"

>

> $ gcc -v

>

> cc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)

>

> DRAM:- 1TB DDR4-3200 MT/s RDIMM memory

>

>

>

> The recommended command line to run would be as below:-

>

> mpirun --map-by core --rank-by core --bind-to core --mca pml ob1 --mca

> btl vader,self ./mpi-bench -owisdom -opatient -r1000 -s icf1000000

>

> (Here, -opatient would allow the use of best kernel/algorithm plan,

>

>             -r1000 would run the test for 1000 iterations to avoid

> run-to-run variations,

>

>             -owisdom would take off the first-time setup overhead/time

> when executing the "mpirun command line" next time)

>

>

>

> Please suggest me if any other details needed for you to analyze this 
> performance regression?

>

>

>

> With Regards,

>

> S. Biplab Raut

>

>

>

> From: Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>>

> Sent: Friday, March 12, 2021 12:46 PM

> To: Raut, S Biplab <biplab.r...@amd.com<mailto:biplab.r...@amd.com>>

> Subject: Re: [OMPI users] Stable and performant openMPI version for 
> Ubuntu20.04 ?

>

>

>

> [CAUTION: External Email]

>

> when running

>

>

>

> mpirun --map-by core -rank-by core --bind-to core --mca pml ob1 --mca

> btl vader,self ./mpi-bench ic1000000

>

>

>

> I go similar flops with Open MPI 3.1.1, 3.1.6, 4.1.0 and 4.1.1rc1 on

> my system

>

>

>

> If you are using a different command line, please let me know and I

> will give it a try

>

>

>

> Cheers,

>

>

>

> Gilles

>

>

>

>

>

> On Fri, Mar 12, 2021 at 3:20 PM Raut, S Biplab 
> <biplab.r...@amd.com<mailto:biplab.r...@amd.com>> wrote:

>

> Reposting here without the logs - it seems there is a message size limit of 
> 150KB and so could not attach the logs.

>

> (Request the moderator to approve the original mail that has

> attachment of compressed logs)

>

>

>

> My main concern in moving from ompi3.1.1 to ompi4.1.0 - Why does ompi4.1.0 
> perform poorly as compared to opmi3.1.1 for some test sizes???

>

>

>

> I ran "FFTW MPI bench binary" in verbose mode "10" (as suggested by Gilles) 
> for below three cases and confirmed that btl/vader is used by default.

>

> FFTW MPI test for a 1D problem size (1000000) is run on a single-node

> as below:-

>

> mpirun --map-by core --rank-by core --bind-to core -np 128

> <fftw/mpi/bench program binary> <program binary options for problem

> size 1000000 >

>

>

>

> The three test cases are described below :- Test run with openMPI3.1.1 
> performs best.

>

> Test run on Ubuntu20.04 and stock openMPI3.1.1 : gives mflops: 76978

> Test run on Ubuntu20.04 and stock openMPI4.1.1 : gives mflops: 56205

> Test run on Ubuntu20.04 and openMPI4.1.1 configured with xpmem : gives

> mflops: 56411

>

>

>

> Please check more details in the below mail chain.

>

>

>

> P.S:

>

> FFTW MPI bench test binary can be compiled from sources 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famd%2Famd-fftw&amp;data=04%7C01%7CBiplab.Raut%40amd.com%7C228a34e0288944edb16108d8e697596a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637512885030533161%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1YUHJGIVqR636Wh0oPyXzQBygOsPlXnv%2FbSkH7OXEk0%3D&amp;reserved=0
>  OR 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FFFTW%2Ffftw3&amp;data=04%7C01%7CBiplab.Raut%40amd.com%7C228a34e0288944edb16108d8e697596a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637512885030543156%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=bHf7ArhpbWNpdbE0rViMjT0ls8uQqB%2BbMgJ1T31dVPs%3D&amp;reserved=0
>  .

>

>

>

> With Regards,

>

> S. Biplab Raut

>

>

>

> From: Raut, S Biplab

> Sent: Thursday, March 11, 2021 5:45 PM

> To: Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>>

> Subject: RE: [OMPI users] Stable and performant openMPI version for 
> Ubuntu20.04 ?

>

>

>

> Oh okay, got you. Please check below details.

>

>

>

> $ ompi_info --all|grep 'command line'

>

>   Configure command line: '--prefix=/home/amd/ompi4_plain' 
> '--enable-mpi-fortran' '--enable-mpi-cxx' '--enable-shared=yes' 
> '--enable-static=yes' '--enable-mpi1-compatibility'

>

>                           User-specified command line parameters

> passed to ROMIO's configure script

>

>                           Complete set of command line parameters

> passed to ROMIO's configure script

>

>

>

>

>

> For your other questions, please check my reply inline.

>

>

>

> >>> did you  have any chance to profile the benchmark to understand where the 
> >>> extra time is spent?

>

> >>> (point to point? collective? communicator creation? other?)

>

> The application binary is using point to point communication - isend and 
> irecv with wait.

>

> Please check the below "perf report" hotspots:-

>

> Overhead  Command          Shared Object           Symbol

>

>   58.54%  mpi-bench        libopen-pal.so.40.30.0  [.] 
> mca_btl_vader_component_progress

>

>    4.59%  mpi-bench        libopen-pal.so.40.30.0  [.] mca_btl_vader_send

>

>    4.43%  mpi-bench        libopen-pal.so.40.30.0  [.] 
> mca_btl_vader_poll_handle_frag

>

>    1.50%  mpi-bench        libmpi.so.40.30.0       [.] mca_pml_ob1_irecv

>

>    1.33%  mpi-bench        libmpi.so.40.30.0       [.] mca_pml_ob1_isend

>

>

>

> >>> have you tried running well known benchmarks such as IMB or OSU?

>

> >>> it would be interesting to understand where are the significant

> >>> differences and the minimum number of MPI tasks

>

> >>> required to exhibit them.

>

> I have not run them. I wonder if application developers like me have to 
> really explore these benchmarks to design MPI based applications?

>

>

>

> With Regards,

>

> S. Biplab Raut

>

>

>

> From: Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>>

> Sent: Thursday, March 11, 2021 5:16 PM

> To: Raut, S Biplab <biplab.r...@amd.com<mailto:biplab.r...@amd.com>>

> Subject: Re: [OMPI users] Stable and performant openMPI version for 
> Ubuntu20.04 ?

>

>

>

> [CAUTION: External Email]

>

> I am not aware of such performance issue.

>

>

>

> can you post the output of

>

> ompi_info --all | grep 'command line'

>

>

>

>

>

> did you  have any chance to profile the benchmark to understand where the 
> extra time is spent?

>

> (point to point? collective? communicator creation? other?)

>

>

>

> have you tried running well known benchmarks such as IMB or OSU?

>

> it would be interesting to understand where are the significant

> differences and the minimum number of MPI tasks

>

> required to exhibit them.

>

>

>

> Cheers,

>

>

>

> Gilles

>

> On Thu, Mar 11, 2021 at 8:33 PM Raut, S Biplab 
> <biplab.r...@amd.com<mailto:biplab.r...@amd.com>> wrote:

>

> Dear Gilles,

>

>                    Running with "mpirun --mca coll ^han" does not change the 
> performance much.

>

>

>

> mpirun --map-by core --rank-by core --bind-to core -np 128

> .libs/mpi-bench -owisdom -opatient -r1000 -s icf1000000

>

> Problem: icf1000000, setup: 2.12 ms, time: 1.75 ms, ``mflops'': 56838

>

> mpirun --mca coll ^han  --map-by core --rank-by core --bind-to core

> -np 128 .libs/mpi-bench -owisdom -opatient -r1000 -s icf1000000

>

> Problem: icf1000000, setup: 2.22 ms, time: 1.75 ms, ``mflops'': 57021

>

>

>

> By the way, reiterating my original question, is there any known performance 
> issue with openMPI4.x as compared to openMPI3.1.1 ?

>

> P.S: Let me repost these numbers on the forum for others to comment.

>

>

>

> With Regards,

>

> S. Biplab Raut

>

>

>

> > [AMD Official Use Only - Internal Distribution Only]

>

> >

>

> > Dear Experts,

>

> >                         Until recently, I was using openMPI3.1.1 to run 
> > single node 128 ranks MPI application on Ubuntu18.04 and Ubuntu19.04.

>

> > But, now the OS on these machines are upgraded to Ubuntu20.04, and I have 
> > been observing program hangs with openMPI3.1.1 version.

>

> > So, I tried with openMPI4.0.5 version - The program ran properly without 
> > any issues but there is a performance regression in my application.

>

> >

>

> > Can I know the stable openMPI version recommended for Ubuntu20.04 that has 
> > no known regression compared to v3.1.1.

>

> >

>

> > With Regards,

>

> > S. Biplab Raut

>

> >

>

> >

Reply via email to