Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

George Bosilca via users Fri, 11 Feb 2022 22:43:42 -0800

I am not sure I understand the comment about MPI_T.

Each network card has internal counters that can be gathered by any process
on the node. Similarly, some information is available from the switches,
but I always assumed that information is aggregated across all ongoing
jobs. But, merging the switch-level information with the MPI level the
necessary trend can be highlighted.


  George.


On Fri, Feb 11, 2022 at 12:43 PM Bertini, Denis Dr. <d.bert...@gsi.de>
wrote:

> May be i am wrong, but the MPI_T seems to aim to internal openMPI
> parameters right?
>
>
> So with which kind of magic a tool like OSU INAM can get info from network
> fabric and even
>
> switches related to a particular MPI job ...
>
>
> There should be more info gathered in the background ....
>
>
> ------------------------------
> *From:* George Bosilca <bosi...@icl.utk.edu>
> *Sent:* Friday, February 11, 2022 4:25:42 PM
> *To:* Open MPI Users
> *Cc:* Joseph Schuchart; Bertini, Denis Dr.
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
> network
>
> Collecting data during execution is possible in OMPI either with an
> external tool, such as mpiP, or the internal infrastructure, SPC. Take a
> look at ./examples/spc_example.c or ./test/spc/spc_test.c to see how to use
> this.
>
>   George.
>
>
> On Fri, Feb 11, 2022 at 9:43 AM Bertini, Denis Dr. via users <
> users@lists.open-mpi.org> wrote:
>
>> I have seen in OSU INAM paper:
>>
>>
>> "
>> While we chose MVAPICH2 for implementing our designs, any MPI
>> runtime (e.g.: OpenMPI [12]) can be modified to perform similar data
>> collection and
>> transmission.
>> "
>>
>> But i do not know what it is meant with "modified" openMPI ?
>>
>>
>> Cheers,
>>
>> Denis
>>
>>
>> ------------------------------
>> *From:* Joseph Schuchart <schuch...@icl.utk.edu>
>> *Sent:* Friday, February 11, 2022 3:02:36 PM
>> *To:* Bertini, Denis Dr.; Open MPI Users
>> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
>> network
>>
>> I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work
>> with other MPI implementations? Would be worth investigating...
>>
>> Joseph
>>
>> On 2/11/22 06:54, Bertini, Denis Dr. wrote:
>> >
>> > Hi Joseph
>> >
>> > Looking at the MVAPICH i noticed that, in this MPI implementation
>> >
>> > a Infiniband Network Analysis  and Profiling Tool  is provided:
>> >
>> >
>> > OSU-INAM
>> >
>> >
>> > Is there something equivalent using openMPI ?
>> >
>> > Best
>> >
>> > Denis
>> >
>> >
>> > ------------------------------------------------------------------------
>> > *From:* users <users-boun...@lists.open-mpi.org> on behalf of Joseph
>> > Schuchart via users <users@lists.open-mpi.org>
>> > *Sent:* Tuesday, February 8, 2022 4:02:53 PM
>> > *To:* users@lists.open-mpi.org
>> > *Cc:* Joseph Schuchart
>> > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
>> > Infiniband network
>> > Hi Denis,
>> >
>> > Sorry if I missed it in your previous messages but could you also try
>> > running a different MPI implementation (MVAPICH) to see whether Open MPI
>> > is at fault or the system is somehow to blame for it?
>> >
>> > Thanks
>> > Joseph
>> >
>> > On 2/8/22 03:06, Bertini, Denis Dr. via users wrote:
>> > >
>> > > Hi
>> > >
>> > > Thanks for all these informations !
>> > >
>> > >
>> > > But i have to confess that in this multi-tuning-parameter space,
>> > >
>> > > i got somehow lost.
>> > >
>> > > Furthermore it is somtimes mixing between user-space and kernel-space.
>> > >
>> > > I have only possibility to act on the user space.
>> > >
>> > >
>> > > 1) So i have on the system max locked memory:
>> > >
>> > >                         - ulimit -l unlimited (default )
>> > >
>> > >   and i do not see any warnings/errors related to that when
>> > launching MPI.
>> > >
>> > >
>> > > 2) I tried differents algorithms for MPI_all_reduce op.  all showing
>> > > drop in
>> > >
>> > > bw for size=16384
>> > >
>> > >
>> > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed
>> > >
>> > > the same behaviour.
>> > >
>> > >
>> > > 3) i realized that increasing the so-called warm up parameter  in the
>> > >
>> > > OSU benchmark (argument -x 200 as default) the discrepancy.
>> > >
>> > > At the contrary putting lower threshold ( -x 10 ) can increase this BW
>> > >
>> > > discrepancy up to factor 300 at message size 16384 compare to
>> > >
>> > > message size 8192 for example.
>> > >
>> > > So does it means that there are some caching effects
>> > >
>> > > in the internode communication?
>> > >
>> > >
>> > > From my experience, to tune parameters is a time-consuming and
>> > cumbersome
>> > >
>> > > task.
>> > >
>> > >
>> > > Could it also be the problem is not really on the openMPI
>> > > implemenation but on the
>> > >
>> > > system?
>> > >
>> > >
>> > > Best
>> > >
>> > > Denis
>> > >
>> > >
>> ------------------------------------------------------------------------
>> > > *From:* users <users-boun...@lists.open-mpi.org> on behalf of Gus
>> > > Correa via users <users@lists.open-mpi.org>
>> > > *Sent:* Monday, February 7, 2022 9:14:19 PM
>> > > *To:* Open MPI Users
>> > > *Cc:* Gus Correa
>> > > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
>> > > Infiniband network
>> > > This may have changed since, but these used to be relevant points.
>> > > Overall, the Open MPI FAQ have lots of good suggestions:
>> > > https://www.open-mpi.org/faq/
>> > > some specific for performance tuning:
>> > > https://www.open-mpi.org/faq/?category=tuning
>> > > https://www.open-mpi.org/faq/?category=openfabrics
>> > >
>> > > 1) Make sure you are not using the Ethernet TCP/IP, which is widely
>> > > available in compute nodes:
>> > > mpirun  --mca btl self,sm,openib  ...
>> > >
>> > > https://www.open-mpi.org/faq/?category=tuning#selecting-components
>> > >
>> > > However, this may have changed lately:
>> > > https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable
>> > > 2) Maximum locked memory used by IB and their system limit. Start
>> > > here:
>> > >
>> >
>> https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage
>> > > 3) The eager vs. rendezvous message size threshold. I wonder if it may
>> > > sit right where you see the latency spike.
>> > > https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user
>> > > 4) Processor and memory locality/affinity and binding (please check
>> > > the current options and syntax)
>> > > https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4
>> > >
>> > > On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users
>> > > <users@lists.open-mpi.org> wrote:
>> > >
>> > >     Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php
>> > >
>> > >     mpirun --verbose --display-map
>> > >
>> > >     Have you tried newer OpenMPI versions?
>> > >
>> > >     Do you get similar behavior for the osu_reduce and osu_gather
>> > >     benchmarks?
>> > >
>> > >     Typically internal buffer sizes as well as your hardware will
>> affect
>> > >     performance. Can you give specifications similar to what is
>> > >     available at:
>> > > http://mvapich.cse.ohio-state.edu/performance/collectives/
>> > >     where the operating system, switch, node type and memory are
>> > >     indicated.
>> > >
>> > >     If you need good performance, may want to also specify the
>> algorithm
>> > >     used. You can find some of the parameters you can tune using:
>> > >
>> > >     ompi_info --all
>> > >
>> > >     A particular helpful parameter is:
>> > >
>> > >     MCA coll tuned: parameter "coll_tuned_allreduce_algorithm"
>> (current
>> > >     value: "ignore", data source: default, level: 5 tuner/detail,
>> > >     type: int)
>> > >                                Which allreduce algorithm is used. Can
>> be
>> > >     locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping
>> > >     (tuned
>> > >     reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented
>> > ring
>> > >                                Valid values: 0:"ignore",
>> > >     1:"basic_linear",
>> > >     2:"nonoverlapping", 3:"recursive_doubling", 4:"ring",
>> > >     5:"segmented_ring", 6:"rabenseifner"
>> > >                MCA coll tuned: parameter
>> > >     "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0",
>> > >     data
>> > >     source: default, level: 5 tuner/detail, type: int)
>> > >
>> > >     For OpenMPI 4.0, there is a tuning program [2] that might also be
>> > >     helpful.
>> > >
>> > >     [1]
>> > >
>> >
>> https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi
>> > >     [2] https://github.com/open-mpi/ompi-collectives-tuning
>> > >
>> > >     On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:
>> > >     > Hi
>> > >     >
>> > >     > When i repeat i always got the huge discrepancy at the
>> > >     >
>> > >     > message size of 16384.
>> > >     >
>> > >     > May be there is a way to run mpi in verbose mode in order
>> > >     >
>> > >     > to further investigate this behaviour?
>> > >     >
>> > >     > Best
>> > >     >
>> > >     > Denis
>> > >     >
>> > >     >
>> > >
>> ------------------------------------------------------------------------
>> > >     > *From:* users <users-boun...@lists.open-mpi.org> on behalf of
>> > >     Benson
>> > >     > Muite via users <users@lists.open-mpi.org>
>> > >     > *Sent:* Monday, February 7, 2022 2:27:34 PM
>> > >     > *To:* users@lists.open-mpi.org
>> > >     > *Cc:* Benson Muite
>> > >     > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
>> > >     Infiniband
>> > >     > network
>> > >     > Hi,
>> > >     > Do you get similar results when you repeat the test? Another job
>> > >     could
>> > >     > have interfered with your run.
>> > >     > Benson
>> > >     > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:
>> > >     >> Hi
>> > >     >>
>> > >     >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in
>> > >     order to
>> > >     >> check/benchmark
>> > >     >>
>> > >     >> the infiniband network for our cluster.
>> > >     >>
>> > >     >> For that i use the collective all_reduce benchmark and run over
>> > >     200
>> > >     >> nodes, using 1 process per node.
>> > >     >>
>> > >     >> And this is the results i obtained 😎
>> > >     >>
>> > >     >>
>> > >     >>
>> > >     >>
>> ################################################################
>> > >     >>
>> > >     >> # OSU MPI Allreduce Latency Test v5.7.1
>> > >     >> # Size       Avg Latency(us)   Min Latency(us)  Max
>> > >     Latency(us)  Iterations
>> > >     >> 4                     114.65  83.22       147.98
>> > >         1000
>> > >     >> 8                     133.85 106.47       164.93
>> > >         1000
>> > >     >> 16                    116.41  87.57       150.58
>> > >         1000
>> > >     >> 32                    112.17  93.25       130.23
>> > >         1000
>> > >     >> 64                    106.85  81.93       134.74
>> > >         1000
>> > >     >> 128                   117.53  87.50       152.27
>> > >         1000
>> > >     >> 256                   143.08 115.63       173.97
>> > >         1000
>> > >     >> 512                   130.34 100.20       167.56
>> > >         1000
>> > >     >> 1024                  155.67 111.29       188.20
>> > >         1000
>> > >     >> 2048                  151.82 116.03       198.19
>> > >         1000
>> > >     >> 4096                  159.11 122.09       199.24
>> > >         1000
>> > >     >> 8192                  176.74 143.54       221.98
>> > >         1000
>> > >     >> 16384               48862.85 39270.21     54970.96
>> > >         1000
>> > >     >> 32768                2737.37  2614.60      2802.68
>> > >         1000
>> > >     >> 65536                2723.15  2585.62      2813.65
>> > >         1000
>> > >     >>
>> > >     >>
>> > > ####################################################################
>> > >     >>
>> > >     >> Could someone explain me what is happening for message = 16384
>> ?
>> > >     >> One can notice a huge latency (~ 300 time larger) compare to
>> > >     message
>> > >     >> size = 8192.
>> > >     >> I do not really understand what could create such an increase
>> > >     in the
>> > >     >> latency.
>> > >     >> The reason i use the OSU microbenchmarks is that we
>> > >     >> sporadically experience a drop
>> > >     >> in the bandwith for typical collective operations such as
>> > >     MPI_Reduce in
>> > >     >> our cluster
>> > >     >> which is difficult to understand.
>> > >     >> I would be grateful if somebody can share its expertise or such
>> > >     problem
>> > >     >> with me.
>> > >     >>
>> > >     >> Best,
>> > >     >> Denis
>> > >     >>
>> > >     >>
>> > >     >>
>> > >     >> ---------
>> > >     >> Denis Bertini
>> > >     >> Abteilung: CIT
>> > >     >> Ort: SB3 2.265a
>> > >     >>
>> > >     >> Tel: +49 6159 71 2240
>> > >     >> Fax: +49 6159 71 2986
>> > >     >> E-Mail: d.bert...@gsi.de
>> > >     >>
>> > >     >> GSI Helmholtzzentrum für Schwerionenforschung GmbH
>> > >     >> Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de
>> > >     <http://www.gsi.de>
>> > >     >>
>> > >     >> Commercial Register / Handelsregister: Amtsgericht Darmstadt,
>> > >     HRB 1528
>> > >     >> Managing Directors / Geschäftsführung:
>> > >     >> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg
>> Blaurock
>> > >     >> Chairman of the GSI Supervisory Board / Vorsitzender des
>> > >     GSI-Aufsichtsrats:
>> > >     >> Ministerialdirigent Dr. Volkmar Dietz
>> > >     >>
>> > >     >
>> > >
>> >
>>
>>

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

Reply via email to