I am not sure I understand the comment about MPI_T. Each network card has internal counters that can be gathered by any process on the node. Similarly, some information is available from the switches, but I always assumed that information is aggregated across all ongoing jobs. But, merging the switch-level information with the MPI level the necessary trend can be highlighted.
George. On Fri, Feb 11, 2022 at 12:43 PM Bertini, Denis Dr. <d.bert...@gsi.de> wrote: > May be i am wrong, but the MPI_T seems to aim to internal openMPI > parameters right? > > > So with which kind of magic a tool like OSU INAM can get info from network > fabric and even > > switches related to a particular MPI job ... > > > There should be more info gathered in the background .... > > > ------------------------------ > *From:* George Bosilca <bosi...@icl.utk.edu> > *Sent:* Friday, February 11, 2022 4:25:42 PM > *To:* Open MPI Users > *Cc:* Joseph Schuchart; Bertini, Denis Dr. > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband > network > > Collecting data during execution is possible in OMPI either with an > external tool, such as mpiP, or the internal infrastructure, SPC. Take a > look at ./examples/spc_example.c or ./test/spc/spc_test.c to see how to use > this. > > George. > > > On Fri, Feb 11, 2022 at 9:43 AM Bertini, Denis Dr. via users < > users@lists.open-mpi.org> wrote: > >> I have seen in OSU INAM paper: >> >> >> " >> While we chose MVAPICH2 for implementing our designs, any MPI >> runtime (e.g.: OpenMPI [12]) can be modified to perform similar data >> collection and >> transmission. >> " >> >> But i do not know what it is meant with "modified" openMPI ? >> >> >> Cheers, >> >> Denis >> >> >> ------------------------------ >> *From:* Joseph Schuchart <schuch...@icl.utk.edu> >> *Sent:* Friday, February 11, 2022 3:02:36 PM >> *To:* Bertini, Denis Dr.; Open MPI Users >> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband >> network >> >> I am not aware of anything similar in Open MPI. Maybe OSU-INAM can work >> with other MPI implementations? Would be worth investigating... >> >> Joseph >> >> On 2/11/22 06:54, Bertini, Denis Dr. wrote: >> > >> > Hi Joseph >> > >> > Looking at the MVAPICH i noticed that, in this MPI implementation >> > >> > a Infiniband Network Analysis and Profiling Tool is provided: >> > >> > >> > OSU-INAM >> > >> > >> > Is there something equivalent using openMPI ? >> > >> > Best >> > >> > Denis >> > >> > >> > ------------------------------------------------------------------------ >> > *From:* users <users-boun...@lists.open-mpi.org> on behalf of Joseph >> > Schuchart via users <users@lists.open-mpi.org> >> > *Sent:* Tuesday, February 8, 2022 4:02:53 PM >> > *To:* users@lists.open-mpi.org >> > *Cc:* Joseph Schuchart >> > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking >> > Infiniband network >> > Hi Denis, >> > >> > Sorry if I missed it in your previous messages but could you also try >> > running a different MPI implementation (MVAPICH) to see whether Open MPI >> > is at fault or the system is somehow to blame for it? >> > >> > Thanks >> > Joseph >> > >> > On 2/8/22 03:06, Bertini, Denis Dr. via users wrote: >> > > >> > > Hi >> > > >> > > Thanks for all these informations ! >> > > >> > > >> > > But i have to confess that in this multi-tuning-parameter space, >> > > >> > > i got somehow lost. >> > > >> > > Furthermore it is somtimes mixing between user-space and kernel-space. >> > > >> > > I have only possibility to act on the user space. >> > > >> > > >> > > 1) So i have on the system max locked memory: >> > > >> > > - ulimit -l unlimited (default ) >> > > >> > > and i do not see any warnings/errors related to that when >> > launching MPI. >> > > >> > > >> > > 2) I tried differents algorithms for MPI_all_reduce op. all showing >> > > drop in >> > > >> > > bw for size=16384 >> > > >> > > >> > > 4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed >> > > >> > > the same behaviour. >> > > >> > > >> > > 3) i realized that increasing the so-called warm up parameter in the >> > > >> > > OSU benchmark (argument -x 200 as default) the discrepancy. >> > > >> > > At the contrary putting lower threshold ( -x 10 ) can increase this BW >> > > >> > > discrepancy up to factor 300 at message size 16384 compare to >> > > >> > > message size 8192 for example. >> > > >> > > So does it means that there are some caching effects >> > > >> > > in the internode communication? >> > > >> > > >> > > From my experience, to tune parameters is a time-consuming and >> > cumbersome >> > > >> > > task. >> > > >> > > >> > > Could it also be the problem is not really on the openMPI >> > > implemenation but on the >> > > >> > > system? >> > > >> > > >> > > Best >> > > >> > > Denis >> > > >> > > >> ------------------------------------------------------------------------ >> > > *From:* users <users-boun...@lists.open-mpi.org> on behalf of Gus >> > > Correa via users <users@lists.open-mpi.org> >> > > *Sent:* Monday, February 7, 2022 9:14:19 PM >> > > *To:* Open MPI Users >> > > *Cc:* Gus Correa >> > > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking >> > > Infiniband network >> > > This may have changed since, but these used to be relevant points. >> > > Overall, the Open MPI FAQ have lots of good suggestions: >> > > https://www.open-mpi.org/faq/ >> > > some specific for performance tuning: >> > > https://www.open-mpi.org/faq/?category=tuning >> > > https://www.open-mpi.org/faq/?category=openfabrics >> > > >> > > 1) Make sure you are not using the Ethernet TCP/IP, which is widely >> > > available in compute nodes: >> > > mpirun --mca btl self,sm,openib ... >> > > >> > > https://www.open-mpi.org/faq/?category=tuning#selecting-components >> > > >> > > However, this may have changed lately: >> > > https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable >> > > 2) Maximum locked memory used by IB and their system limit. Start >> > > here: >> > > >> > >> https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage >> > > 3) The eager vs. rendezvous message size threshold. I wonder if it may >> > > sit right where you see the latency spike. >> > > https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user >> > > 4) Processor and memory locality/affinity and binding (please check >> > > the current options and syntax) >> > > https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4 >> > > >> > > On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users >> > > <users@lists.open-mpi.org> wrote: >> > > >> > > Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php >> > > >> > > mpirun --verbose --display-map >> > > >> > > Have you tried newer OpenMPI versions? >> > > >> > > Do you get similar behavior for the osu_reduce and osu_gather >> > > benchmarks? >> > > >> > > Typically internal buffer sizes as well as your hardware will >> affect >> > > performance. Can you give specifications similar to what is >> > > available at: >> > > http://mvapich.cse.ohio-state.edu/performance/collectives/ >> > > where the operating system, switch, node type and memory are >> > > indicated. >> > > >> > > If you need good performance, may want to also specify the >> algorithm >> > > used. You can find some of the parameters you can tune using: >> > > >> > > ompi_info --all >> > > >> > > A particular helpful parameter is: >> > > >> > > MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" >> (current >> > > value: "ignore", data source: default, level: 5 tuner/detail, >> > > type: int) >> > > Which allreduce algorithm is used. Can >> be >> > > locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping >> > > (tuned >> > > reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented >> > ring >> > > Valid values: 0:"ignore", >> > > 1:"basic_linear", >> > > 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", >> > > 5:"segmented_ring", 6:"rabenseifner" >> > > MCA coll tuned: parameter >> > > "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", >> > > data >> > > source: default, level: 5 tuner/detail, type: int) >> > > >> > > For OpenMPI 4.0, there is a tuning program [2] that might also be >> > > helpful. >> > > >> > > [1] >> > > >> > >> https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi >> > > [2] https://github.com/open-mpi/ompi-collectives-tuning >> > > >> > > On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote: >> > > > Hi >> > > > >> > > > When i repeat i always got the huge discrepancy at the >> > > > >> > > > message size of 16384. >> > > > >> > > > May be there is a way to run mpi in verbose mode in order >> > > > >> > > > to further investigate this behaviour? >> > > > >> > > > Best >> > > > >> > > > Denis >> > > > >> > > > >> > > >> ------------------------------------------------------------------------ >> > > > *From:* users <users-boun...@lists.open-mpi.org> on behalf of >> > > Benson >> > > > Muite via users <users@lists.open-mpi.org> >> > > > *Sent:* Monday, February 7, 2022 2:27:34 PM >> > > > *To:* users@lists.open-mpi.org >> > > > *Cc:* Benson Muite >> > > > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking >> > > Infiniband >> > > > network >> > > > Hi, >> > > > Do you get similar results when you repeat the test? Another job >> > > could >> > > > have interfered with your run. >> > > > Benson >> > > > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: >> > > >> Hi >> > > >> >> > > >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in >> > > order to >> > > >> check/benchmark >> > > >> >> > > >> the infiniband network for our cluster. >> > > >> >> > > >> For that i use the collective all_reduce benchmark and run over >> > > 200 >> > > >> nodes, using 1 process per node. >> > > >> >> > > >> And this is the results i obtained 😎 >> > > >> >> > > >> >> > > >> >> > > >> >> ################################################################ >> > > >> >> > > >> # OSU MPI Allreduce Latency Test v5.7.1 >> > > >> # Size Avg Latency(us) Min Latency(us) Max >> > > Latency(us) Iterations >> > > >> 4 114.65 83.22 147.98 >> > > 1000 >> > > >> 8 133.85 106.47 164.93 >> > > 1000 >> > > >> 16 116.41 87.57 150.58 >> > > 1000 >> > > >> 32 112.17 93.25 130.23 >> > > 1000 >> > > >> 64 106.85 81.93 134.74 >> > > 1000 >> > > >> 128 117.53 87.50 152.27 >> > > 1000 >> > > >> 256 143.08 115.63 173.97 >> > > 1000 >> > > >> 512 130.34 100.20 167.56 >> > > 1000 >> > > >> 1024 155.67 111.29 188.20 >> > > 1000 >> > > >> 2048 151.82 116.03 198.19 >> > > 1000 >> > > >> 4096 159.11 122.09 199.24 >> > > 1000 >> > > >> 8192 176.74 143.54 221.98 >> > > 1000 >> > > >> 16384 48862.85 39270.21 54970.96 >> > > 1000 >> > > >> 32768 2737.37 2614.60 2802.68 >> > > 1000 >> > > >> 65536 2723.15 2585.62 2813.65 >> > > 1000 >> > > >> >> > > >> >> > > #################################################################### >> > > >> >> > > >> Could someone explain me what is happening for message = 16384 >> ? >> > > >> One can notice a huge latency (~ 300 time larger) compare to >> > > message >> > > >> size = 8192. >> > > >> I do not really understand what could create such an increase >> > > in the >> > > >> latency. >> > > >> The reason i use the OSU microbenchmarks is that we >> > > >> sporadically experience a drop >> > > >> in the bandwith for typical collective operations such as >> > > MPI_Reduce in >> > > >> our cluster >> > > >> which is difficult to understand. >> > > >> I would be grateful if somebody can share its expertise or such >> > > problem >> > > >> with me. >> > > >> >> > > >> Best, >> > > >> Denis >> > > >> >> > > >> >> > > >> >> > > >> --------- >> > > >> Denis Bertini >> > > >> Abteilung: CIT >> > > >> Ort: SB3 2.265a >> > > >> >> > > >> Tel: +49 6159 71 2240 >> > > >> Fax: +49 6159 71 2986 >> > > >> E-Mail: d.bert...@gsi.de >> > > >> >> > > >> GSI Helmholtzzentrum für Schwerionenforschung GmbH >> > > >> Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de >> > > <http://www.gsi.de> >> > > >> >> > > >> Commercial Register / Handelsregister: Amtsgericht Darmstadt, >> > > HRB 1528 >> > > >> Managing Directors / Geschäftsführung: >> > > >> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg >> Blaurock >> > > >> Chairman of the GSI Supervisory Board / Vorsitzender des >> > > GSI-Aufsichtsrats: >> > > >> Ministerialdirigent Dr. Volkmar Dietz >> > > >> >> > > > >> > > >> > >> >>