Hi
Thanks for all these informations !
But i have to confess that in this multi-tuning-parameter space,
i got somehow lost.
Furthermore it is somtimes mixing between user-space and kernel-space.
I have only possibility to act on the user space.
1) So i have on the system max locked memory:
- ulimit -l unlimited (default )
and i do not see any warnings/errors related to that when launching MPI.
2) I tried differents algorithms for MPI_all_reduce op. all showing
drop in
bw for size=16384
4) I disable openIB ( no RDMA, ) and used only TCP, and i noticed
the same behaviour.
3) i realized that increasing the so-called warm up parameter in the
OSU benchmark (argument -x 200 as default) the discrepancy.
At the contrary putting lower threshold ( -x 10 ) can increase this BW
discrepancy up to factor 300 at message size 16384 compare to
message size 8192 for example.
So does it means that there are some caching effects
in the internode communication?
From my experience, to tune parameters is a time-consuming and cumbersome
task.
Could it also be the problem is not really on the openMPI
implemenation but on the
system?
Best
Denis
------------------------------------------------------------------------
*From:* users <users-boun...@lists.open-mpi.org> on behalf of Gus
Correa via users <users@lists.open-mpi.org>
*Sent:* Monday, February 7, 2022 9:14:19 PM
*To:* Open MPI Users
*Cc:* Gus Correa
*Subject:* Re: [OMPI users] Using OSU benchmarks for checking
Infiniband network
This may have changed since, but these used to be relevant points.
Overall, the Open MPI FAQ have lots of good suggestions:
https://www.open-mpi.org/faq/
some specific for performance tuning:
https://www.open-mpi.org/faq/?category=tuning
https://www.open-mpi.org/faq/?category=openfabrics
1) Make sure you are not using the Ethernet TCP/IP, which is widely
available in compute nodes:
mpirun --mca btl self,sm,openib ...
https://www.open-mpi.org/faq/?category=tuning#selecting-components
However, this may have changed lately:
https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable
2) Maximum locked memory used by IB and their system limit. Start
here:
https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage
3) The eager vs. rendezvous message size threshold. I wonder if it may
sit right where you see the latency spike.
https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user
4) Processor and memory locality/affinity and binding (please check
the current options and syntax)
https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4
On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users
<users@lists.open-mpi.org> wrote:
Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php
mpirun --verbose --display-map
Have you tried newer OpenMPI versions?
Do you get similar behavior for the osu_reduce and osu_gather
benchmarks?
Typically internal buffer sizes as well as your hardware will affect
performance. Can you give specifications similar to what is
available at:
http://mvapich.cse.ohio-state.edu/performance/collectives/
where the operating system, switch, node type and memory are
indicated.
If you need good performance, may want to also specify the algorithm
used. You can find some of the parameters you can tune using:
ompi_info --all
A particular helpful parameter is:
MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
value: "ignore", data source: default, level: 5 tuner/detail,
type: int)
Which allreduce algorithm is used. Can be
locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping
(tuned
reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
Valid values: 0:"ignore",
1:"basic_linear",
2:"nonoverlapping", 3:"recursive_doubling", 4:"ring",
5:"segmented_ring", 6:"rabenseifner"
MCA coll tuned: parameter
"coll_tuned_allreduce_algorithm_segmentsize" (current value: "0",
data
source: default, level: 5 tuner/detail, type: int)
For OpenMPI 4.0, there is a tuning program [2] that might also be
helpful.
[1]
https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi
[2] https://github.com/open-mpi/ompi-collectives-tuning
On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:
> Hi
>
> When i repeat i always got the huge discrepancy at the
>
> message size of 16384.
>
> May be there is a way to run mpi in verbose mode in order
>
> to further investigate this behaviour?
>
> Best
>
> Denis
>
>
------------------------------------------------------------------------
> *From:* users <users-boun...@lists.open-mpi.org> on behalf of
Benson
> Muite via users <users@lists.open-mpi.org>
> *Sent:* Monday, February 7, 2022 2:27:34 PM
> *To:* users@lists.open-mpi.org
> *Cc:* Benson Muite
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking
Infiniband
> network
> Hi,
> Do you get similar results when you repeat the test? Another job
could
> have interfered with your run.
> Benson
> On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:
>> Hi
>>
>> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in
order to
>> check/benchmark
>>
>> the infiniband network for our cluster.
>>
>> For that i use the collective all_reduce benchmark and run over
200
>> nodes, using 1 process per node.
>>
>> And this is the results i obtained 😎
>>
>>
>>
>> ################################################################
>>
>> # OSU MPI Allreduce Latency Test v5.7.1
>> # Size Avg Latency(us) Min Latency(us) Max
Latency(us) Iterations
>> 4 114.65 83.22 147.98
1000
>> 8 133.85 106.47 164.93
1000
>> 16 116.41 87.57 150.58
1000
>> 32 112.17 93.25 130.23
1000
>> 64 106.85 81.93 134.74
1000
>> 128 117.53 87.50 152.27
1000
>> 256 143.08 115.63 173.97
1000
>> 512 130.34 100.20 167.56
1000
>> 1024 155.67 111.29 188.20
1000
>> 2048 151.82 116.03 198.19
1000
>> 4096 159.11 122.09 199.24
1000
>> 8192 176.74 143.54 221.98
1000
>> 16384 48862.85 39270.21 54970.96
1000
>> 32768 2737.37 2614.60 2802.68
1000
>> 65536 2723.15 2585.62 2813.65
1000
>>
>>
####################################################################
>>
>> Could someone explain me what is happening for message = 16384 ?
>> One can notice a huge latency (~ 300 time larger) compare to
message
>> size = 8192.
>> I do not really understand what could create such an increase
in the
>> latency.
>> The reason i use the OSU microbenchmarks is that we
>> sporadically experience a drop
>> in the bandwith for typical collective operations such as
MPI_Reduce in
>> our cluster
>> which is difficult to understand.
>> I would be grateful if somebody can share its expertise or such
problem
>> with me.
>>
>> Best,
>> Denis
>>
>>
>>
>> ---------
>> Denis Bertini
>> Abteilung: CIT
>> Ort: SB3 2.265a
>>
>> Tel: +49 6159 71 2240
>> Fax: +49 6159 71 2986
>> E-Mail: d.bert...@gsi.de
>>
>> GSI Helmholtzzentrum für Schwerionenforschung GmbH
>> Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de
<http://www.gsi.de>
>>
>> Commercial Register / Handelsregister: Amtsgericht Darmstadt,
HRB 1528
>> Managing Directors / Geschäftsführung:
>> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
>> Chairman of the GSI Supervisory Board / Vorsitzender des
GSI-Aufsichtsrats:
>> Ministerialdirigent Dr. Volkmar Dietz
>>
>