Hi

Thanks a lot for all the infos !

Very interesting thanks !

We use basically AMP EPYC processor


>>

vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD EPYC 7551 32-Core Processor
stepping : 2
microcode : 0x8001250
cpu MHz : 2000.000
cache size : 512 KB
physical id : 1
siblings : 64
core id : 31
cpu cores : 32
apicid : 127
initial apicid : 127
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
>>

The number of cores could depend on the node though ( 32/64 )
So according to your calculation, the message of 16384 bytes

should fit in?

BTW it is 16384 bytes or 16384 double precision = 16384*8bytes?

Best

Denis


________________________________
From: users <users-boun...@lists.open-mpi.org> on behalf of Benson Muite via 
users <users@lists.open-mpi.org>
Sent: Tuesday, February 8, 2022 11:47:18 AM
To: users@lists.open-mpi.org
Cc: Benson Muite
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

On 2/8/22 11:06 AM, Bertini, Denis Dr. via users wrote:
> Hi
>
> Thanks for all these informations !
>
>
> But i have to confess that in this multi-tuning-parameter space,
>
> i got somehow lost.
>
> Furthermore it is somtimes mixing between user-space and kernel-space.
>
> I have only possibility to act on the user space.
Ok. If you are doing the test to check your system, you should only tune
for typical applications, rather than for one function call with a
specific message size. As you can change the OpenMPI default settings
for the algorithm used to do all reduce, you may wish to run tests and
choose a setting that will work well for most of your users. You may
also wish to upgrade to OpenMPI 4.1 as default, so perhaps do tests on
that version.
>
>
> 1) So i have on the system max locked memory:
>
>                          - ulimit -l unlimited (default )
>
>    and i do not see any warnings/errors related to that when launching MPI.
>
>
> 2) I tried differents algorithms for MPI_all_reduce op.  all showing drop in
>
> bw for size=16384
The drops are of different magnitude depending on the algorithm used,
default gives worst case latency of 54970.96 us and round robin gives
worst case latency of 4992.04 us for a size of 16384. May be helpful to
indicate what hardware you are using, both for the chip (cache sizes
will be important) and the interconnect. Perhaps try the test on 2 or 4
nodes as well.
>
>
> 4) I disable openIB ( no RDMA, ) and used only TCP,  and i noticed
>
> the same behaviour.
This suggests it is the chip, rather than the interconnect.
>
>
> 3) i realized that increasing the so-called warm up parameter  in the
>
> OSU benchmark (argument -x 200 as default) the discrepancy.
>
> At the contrary putting lower threshold ( -x 10 ) can increase this BW
>
> discrepancy up to factor 300 at message size 16384 compare to
>
> message size 8192 for example.
>
> So does it means that there are some caching effects
>
> in the internode communication?
>
>
Probably. If you are using AMD 7551P nodes, these have 96K L1 cache per
core. A message of 16384 double precision uses 132K so will not fit in
L1 cache, and a message of 8192 uses 66K and will fit in L1 cache.
Perhaps try the same test on Intel Xeon e52680 nodes or 6248r nodes.

Some relevant studies are:
Zhong, Cao, Bosilica and Dongarra, "Using long vector extensions for MPI
reductions", https://doi.org/10.1016/j.parco.2021.102871

Hashmi, Chakraborty, Bayatpour, Subramoni and Panda "Designing Shared
Address Space MPI libraries in the Many-core Era",
https://jahanzeb-hashmi.github.io/files/talks/ipdps18.pdf

Saini, Mehrotra, Taylor, Shende and Biswas, "Performance Analysis of
Scientific and Engineering Applications Using MPInside and TAU",
https://ntrs.nasa.gov/api/citations/20100038444/downloads/20100038444.pdf

The second study by Hashmi et al. focuses on inter node communication,
but has a nice performance model that demonstrates understanding of the
communication pattern. For typical use of MPI on a particular cluster,
such a detailed understanding is likely not necessary. These studies do
also collect hardware performance information.

>  From my experience, to tune parameters is a time-consuming and cumbersome
>
> task.
>
>
> Could it also be the problem is not really on the openMPI implemenation
> but on the
>
> system?
The default OpenMPI parameters may need to be adjusted for a good user
experience on your system, but demanding users will probably do this for
their specific applications. By changing the algorithm used for all
reduce, you got a factor of 10 improvement in the benchmark for a size
of 16384.  Perhaps determine which MPI calls are used most often on your
cluster, and provide a guide as to how OpenMPI can be tuned for these.
Alternatively, if you have a set of heavily used applications, profile
them to determine most used MPI calls and then set defaults that would
improve application performance.

Do also check whether there are any performance measurements available
from your infiniband switch provider that will allow checking of correct
functionality at the single switch level.
>
>
> Best
>
> Denis
>
> ------------------------------------------------------------------------
> *From:* users <users-boun...@lists.open-mpi.org> on behalf of Gus Correa
> via users <users@lists.open-mpi.org>
> *Sent:* Monday, February 7, 2022 9:14:19 PM
> *To:* Open MPI Users
> *Cc:* Gus Correa
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
> network
> This may have changed since, but these used to be relevant points.
> Overall, the Open MPI FAQ have lots of good suggestions:
> https://www.open-mpi.org/faq/ <https://www.open-mpi.org/faq/>
> some specific for performance tuning:
> https://www.open-mpi.org/faq/?category=tuning
> <https://www.open-mpi.org/faq/?category=tuning>
> https://www.open-mpi.org/faq/?category=openfabrics
> <https://www.open-mpi.org/faq/?category=openfabrics>
>
> 1) Make sure you are not using the Ethernet TCP/IP, which is widely
> available in compute nodes:
>
> mpirun  --mca btl self,sm,openib  ...
>
> https://www.open-mpi.org/faq/?category=tuning#selecting-components  
> <https://www.open-mpi.org/faq/?category=tuning#selecting-components>
>
> However, this may have changed lately:
> https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable
> <https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable>
>
> 2) Maximum locked memory used by IB and their system limit. Start here:
> https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage
> <https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage>
>
> 3) The eager vs. rendezvous message size threshold.
> I wonder if it may sit right where you see the latency spike.
> https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user  
> <https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user>
>
> 4) Processor and memory locality/affinity and binding (please check the
> current options and syntax)
> https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4  
> <https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4>
>
>
> On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users
> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>
>     Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php
>     <https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php>
>
>     mpirun --verbose --display-map
>
>     Have you tried newer OpenMPI versions?
>
>     Do you get similar behavior for the osu_reduce and osu_gather
>     benchmarks?
>
>     Typically internal buffer sizes as well as your hardware will affect
>     performance. Can you give specifications similar to what is
>     available at:
>     http://mvapich.cse.ohio-state.edu/performance/collectives/
>     <http://mvapich.cse.ohio-state.edu/performance/collectives/>
>     where the operating system, switch, node type and memory are indicated.
>
>     If you need good performance, may want to also specify the algorithm
>     used. You can find some of the parameters you can tune using:
>
>     ompi_info --all
>
>     A particular helpful parameter is:
>
>     MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
>     value: "ignore", data source: default, level: 5 tuner/detail, type: int)
>                                 Which allreduce algorithm is used. Can be
>     locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping
>     (tuned
>     reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
>                                 Valid values: 0:"ignore", 1:"basic_linear",
>     2:"nonoverlapping", 3:"recursive_doubling", 4:"ring",
>     5:"segmented_ring", 6:"rabenseifner"
>                 MCA coll tuned: parameter
>     "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data
>     source: default, level: 5 tuner/detail, type: int)
>
>     For OpenMPI 4.0, there is a tuning program [2] that might also be
>     helpful.
>
>     [1]
>     
> https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi
>     
> <https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi>
>     [2] https://github.com/open-mpi/ompi-collectives-tuning
>     <https://github.com/open-mpi/ompi-collectives-tuning>
>
>     On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:
>      > Hi
>      >
>      > When i repeat i always got the huge discrepancy at the
>      >
>      > message size of 16384.
>      >
>      > May be there is a way to run mpi in verbose mode in order
>      >
>      > to further investigate this behaviour?
>      >
>      > Best
>      >
>      > Denis
>      >
>      >
>     ------------------------------------------------------------------------
>      > *From:* users <users-boun...@lists.open-mpi.org
>     <mailto:users-boun...@lists.open-mpi.org>> on behalf of Benson
>      > Muite via users <users@lists.open-mpi.org
>     <mailto:users@lists.open-mpi.org>>
>      > *Sent:* Monday, February 7, 2022 2:27:34 PM
>      > *To:* users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>      > *Cc:* Benson Muite
>      > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
>     Infiniband
>      > network
>      > Hi,
>      > Do you get similar results when you repeat the test? Another job
>     could
>      > have interfered with your run.
>      > Benson
>      > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:
>      >> Hi
>      >>
>      >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in
>     order to
>      >> check/benchmark
>      >>
>      >> the infiniband network for our cluster.
>      >>
>      >> For that i use the collective all_reduce benchmark and run over 200
>      >> nodes, using 1 process per node.
>      >>
>      >> And this is the results i obtained 😎
>      >>
>      >>
>      >>
>      >> ################################################################
>      >>
>      >> # OSU MPI Allreduce Latency Test v5.7.1
>      >> # Size       Avg Latency(us)   Min Latency(us)   Max
>     Latency(us)  Iterations
>      >> 4                     114.65             83.22
>     147.98        1000
>      >> 8                     133.85            106.47
>     164.93        1000
>      >> 16                    116.41             87.57
>     150.58        1000
>      >> 32                    112.17             93.25
>     130.23        1000
>      >> 64                    106.85             81.93
>     134.74        1000
>      >> 128                   117.53             87.50
>     152.27        1000
>      >> 256                   143.08            115.63
>     173.97        1000
>      >> 512                   130.34            100.20
>     167.56        1000
>      >> 1024                  155.67            111.29
>     188.20        1000
>      >> 2048                  151.82            116.03
>     198.19        1000
>      >> 4096                  159.11            122.09
>     199.24        1000
>      >> 8192                  176.74            143.54
>     221.98        1000
>      >> 16384               48862.85          39270.21
>     54970.96        1000
>      >> 32768                2737.37           2614.60
>       2802.68        1000
>      >> 65536                2723.15           2585.62
>       2813.65        1000
>      >>
>      >> ####################################################################
>      >>
>      >> Could someone explain me what is happening for message = 16384 ?
>      >> One can notice a huge latency (~ 300 time larger)  compare to
>     message
>      >> size = 8192.
>      >> I do not really understand what could  create such an increase
>     in the
>      >> latency.
>      >> The reason i use the OSU microbenchmarks is that we
>      >> sporadically experience a drop
>      >> in the bandwith for typical collective operations such as
>     MPI_Reduce in
>      >> our cluster
>      >> which is difficult to understand.
>      >> I would be grateful if somebody can share its expertise or such
>     problem
>      >> with me.
>      >>
>      >> Best,
>      >> Denis
>      >>
>      >>
>      >>
>      >> ---------
>      >> Denis Bertini
>      >> Abteilung: CIT
>      >> Ort: SB3 2.265a
>      >>
>      >> Tel: +49 6159 71 2240
>      >> Fax: +49 6159 71 2986
>      >> E-Mail: d.bert...@gsi.de <mailto:d.bert...@gsi.de>
>      >>
>      >> GSI Helmholtzzentrum für Schwerionenforschung GmbH
>      >> Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de
>     <http://www.gsi.de>
>      >>
>      >> Commercial Register / Handelsregister: Amtsgericht Darmstadt,
>     HRB 1528
>      >> Managing Directors / Geschäftsführung:
>      >> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
>      >> Chairman of the GSI Supervisory Board / Vorsitzender des
>     GSI-Aufsichtsrats:
>      >> Ministerialdirigent Dr. Volkmar Dietz
>      >>
>      >
>

Reply via email to