Re: [OMPI users] MPI_Bcast performance doesn't improve after enabling tree implementation

2017-10-17 Thread Gilles Gouaillardet
If you use the rsh tree spawn mechanism, then yes, any node must be able 
to SSH passwordless to any node.

This is only used to spawn one orted per node.
when the number of nodes is important, a tree spawn is faster and avoids 
having all the SSH connections issued and maintained from the

node running mpirun.

After the orted have been spawned and wired up, MPI connections can be 
established directly and do not involve SSH.


basic_linear is the algo you are looking for.
your best bet is to have a look at the source code in 
ompi/mca/coll/base/coll_base_bcast.c from Open MPI 2.0.0


Cheers,

Gilles

On 10/18/2017 5:23 AM, Konstantinos Konstantinidis wrote:

Thanks for clarifying that Gilles.

Now I have seen that omitting "-mca plm_rsh_no_tree_spawn 1" requires 
establishing passwordless SSH among the machines but this is not 
required for setting "--mca coll_tuned_bcast_algo". Is this correct or 
am I missing something?


Also, among all possible broadcast options (0:"ignore", 
1:"basic_linear", 2:"chain", 3:"pipeline", 4:"split_binary_tree", 
5:"binary_tree", 6:"binomial") is there any option that behaves like 
individual MPI_Send separately to each receiver or they all have some 
parallel transmissions? Where can I find a more detailed description 
of these implementations of broadcast?


Out of curiosity, when is "-mca plm_rsh_no_tree_spawn 1". I have a 
little MPI experience but I don't understand the need of having a 
special tree-based algorithm just to start running the MPI program on 
the machines.


Regards,
Kostas

On Tue, Oct 17, 2017 at 1:57 AM, Gilles Gouaillardet 
> wrote:


Konstantinos,


I am afraid there is some confusion here.


the plm_rsh_no_tree_spawn is only used at startup time (e.g. when
remote launching one orted daemon per node but the one running
mpirun).

there is zero impact on the performances of MPI communications
such as MPI_Bcast()


the coll/tuned module select the broadcast algorithm based on
communicator and message sizes.
you can manually force that with

mpirun --mca coll_tuned_use_dynamic_rules true --mca
coll_tuned_bcast_algo  ./my_test

where  is the algo number as described by ompi_info --all

 MCA coll tuned: parameter "coll_tuned_bcast_algorithm"
(current value: "ignore", data source: default, level: 5
tuner/detail, type: int)
  Which bcast algorithm is used. Can be
locked down to choice of: 0 ignore, 1 basic linear, 2 chain, 3:
pipeline, 4: split binary tree, 5: binary tree, 6: binomial tree.
  Valid values: 0:"ignore",
1:"basic_linear", 2:"chain", 3:"pipeline", 4:"split_binary_tree",
5:"binary_tree", 6:"binomial"

for some specific communicator and message sizes, you might
experience better performances.
you also have the option to write your own rules (e.g. which algo
should be used based on communicator and message sizes) if you are
not happy with the default rules.
(that would be with the coll_tuned_dynamic_rules_filename MCA option)

note coll/tuned does not take the topology (e.g. inter vs intra
node communications) into consideration when choosing the algorithm.


Cheers,

Gilles


On 10/17/2017 3:30 PM, Konstantinos Konstantinidis wrote:

I have implemented some algorithms in C++ which are greatly
affected by shuffling time among nodes which is done by some
broadcast calls. Up to now, I have been testing them by
running something like

mpirun -mca btl ^openib -mca plm_rsh_no_tree_spawn 1 ./my_test

which I think make MPI_Bcast to work serially. Now, I want to
improve the communication time so I have configured the
appropriate SSH access from every node to every other node and
I have enabled the binary tree implementation of Open MPI
collective calls by running

mpirun -mca btl ^openib ./my_test

My problem is that throughout various experiments with files
of different sizes, I realized that there is no improvement in
terms of transmission time even though theoretically I would
expect a gain of approximately (log(k))/(k-1) where k is the
size of the group that the communication takes place within.

I compile the code with

mpic++ my_test.cc -o my_test

and all of the experiments are done on Amazon EC2 r3.large or
m3.large machines. I have also set different values of rate
limits to avoid bursty behavior of Amazon's EC2 transmission
rate. The Open MPI I have installed is described on the txt I
have attached after running ompi_info.

What can be wrong here?


___
users mailing list
users@lists.open-mpi.org 

Re: [OMPI users] MPI_Bcast performance doesn't improve after enabling tree implementation

2017-10-17 Thread Konstantinos Konstantinidis
Thanks for clarifying that Gilles.

Now I have seen that omitting "-mca plm_rsh_no_tree_spawn 1" requires
establishing passwordless SSH among the machines but this is not required
for setting "--mca coll_tuned_bcast_algo". Is this correct or am I missing
something?

Also, among all possible broadcast options (0:"ignore", 1:"basic_linear",
2:"chain", 3:"pipeline", 4:"split_binary_tree", 5:"binary_tree",
6:"binomial") is there any option that behaves like individual MPI_Send
separately to each receiver or they all have some parallel transmissions?
Where can I find a more detailed description of these implementations of
broadcast?

Out of curiosity, when is "-mca plm_rsh_no_tree_spawn 1". I have a little
MPI experience but I don't understand the need of having a special
tree-based algorithm just to start running the MPI program on the machines.

Regards,
Kostas

On Tue, Oct 17, 2017 at 1:57 AM, Gilles Gouaillardet 
wrote:

> Konstantinos,
>
>
> I am afraid there is some confusion here.
>
>
> the plm_rsh_no_tree_spawn is only used at startup time (e.g. when remote
> launching one orted daemon per node but the one running mpirun).
>
> there is zero impact on the performances of MPI communications such as
> MPI_Bcast()
>
>
> the coll/tuned module select the broadcast algorithm based on communicator
> and message sizes.
> you can manually force that with
>
> mpirun --mca coll_tuned_use_dynamic_rules true --mca coll_tuned_bcast_algo
>  ./my_test
>
> where  is the algo number as described by ompi_info --all
>
>  MCA coll tuned: parameter "coll_tuned_bcast_algorithm" (current
> value: "ignore", data source: default, level: 5 tuner/detail, type: int)
>   Which bcast algorithm is used. Can be locked
> down to choice of: 0 ignore, 1 basic linear, 2 chain, 3: pipeline, 4: split
> binary tree, 5: binary tree, 6: binomial tree.
>   Valid values: 0:"ignore", 1:"basic_linear",
> 2:"chain", 3:"pipeline", 4:"split_binary_tree", 5:"binary_tree",
> 6:"binomial"
>
> for some specific communicator and message sizes, you might experience
> better performances.
> you also have the option to write your own rules (e.g. which algo should
> be used based on communicator and message sizes) if you are not happy with
> the default rules.
> (that would be with the coll_tuned_dynamic_rules_filename MCA option)
>
> note coll/tuned does not take the topology (e.g. inter vs intra node
> communications) into consideration when choosing the algorithm.
>
>
> Cheers,
>
> Gilles
>
>
> On 10/17/2017 3:30 PM, Konstantinos Konstantinidis wrote:
>
>> I have implemented some algorithms in C++ which are greatly affected by
>> shuffling time among nodes which is done by some broadcast calls. Up to
>> now, I have been testing them by running something like
>>
>> mpirun -mca btl ^openib -mca plm_rsh_no_tree_spawn 1 ./my_test
>>
>> which I think make MPI_Bcast to work serially. Now, I want to improve the
>> communication time so I have configured the appropriate SSH access from
>> every node to every other node and I have enabled the binary tree
>> implementation of Open MPI collective calls by running
>>
>> mpirun -mca btl ^openib ./my_test
>>
>> My problem is that throughout various experiments with files of different
>> sizes, I realized that there is no improvement in terms of transmission
>> time even though theoretically I would expect a gain of approximately
>> (log(k))/(k-1) where k is the size of the group that the communication
>> takes place within.
>>
>> I compile the code with
>>
>> mpic++ my_test.cc -o my_test
>>
>> and all of the experiments are done on Amazon EC2 r3.large or m3.large
>> machines. I have also set different values of rate limits to avoid bursty
>> behavior of Amazon's EC2 transmission rate. The Open MPI I have installed
>> is described on the txt I have attached after running ompi_info.
>>
>> What can be wrong here?
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_Bcast performance doesn't improve after enabling tree implementation

2017-10-17 Thread Gilles Gouaillardet

Konstantinos,


I am afraid there is some confusion here.


the plm_rsh_no_tree_spawn is only used at startup time (e.g. when remote 
launching one orted daemon per node but the one running mpirun).


there is zero impact on the performances of MPI communications such as 
MPI_Bcast()



the coll/tuned module select the broadcast algorithm based on 
communicator and message sizes.

you can manually force that with

mpirun --mca coll_tuned_use_dynamic_rules true --mca 
coll_tuned_bcast_algo  ./my_test


where  is the algo number as described by ompi_info --all

 MCA coll tuned: parameter "coll_tuned_bcast_algorithm" 
(current value: "ignore", data source: default, level: 5 tuner/detail, 
type: int)
  Which bcast algorithm is used. Can be locked 
down to choice of: 0 ignore, 1 basic linear, 2 chain, 3: pipeline, 4: 
split binary tree, 5: binary tree, 6: binomial tree.
  Valid values: 0:"ignore", 1:"basic_linear", 
2:"chain", 3:"pipeline", 4:"split_binary_tree", 5:"binary_tree", 
6:"binomial"


for some specific communicator and message sizes, you might experience 
better performances.
you also have the option to write your own rules (e.g. which algo should 
be used based on communicator and message sizes) if you are not happy 
with the default rules.

(that would be with the coll_tuned_dynamic_rules_filename MCA option)

note coll/tuned does not take the topology (e.g. inter vs intra node 
communications) into consideration when choosing the algorithm.



Cheers,

Gilles

On 10/17/2017 3:30 PM, Konstantinos Konstantinidis wrote:
I have implemented some algorithms in C++ which are greatly affected 
by shuffling time among nodes which is done by some broadcast calls. 
Up to now, I have been testing them by running something like


mpirun -mca btl ^openib -mca plm_rsh_no_tree_spawn 1 ./my_test

which I think make MPI_Bcast to work serially. Now, I want to improve 
the communication time so I have configured the appropriate SSH access 
from every node to every other node and I have enabled the binary tree 
implementation of Open MPI collective calls by running


mpirun -mca btl ^openib ./my_test

My problem is that throughout various experiments with files of 
different sizes, I realized that there is no improvement in terms of 
transmission time even though theoretically I would expect a gain of 
approximately (log(k))/(k-1) where k is the size of the group that the 
communication takes place within.


I compile the code with

mpic++ my_test.cc -o my_test

and all of the experiments are done on Amazon EC2 r3.large or m3.large 
machines. I have also set different values of rate limits to avoid 
bursty behavior of Amazon's EC2 transmission rate. The Open MPI I have 
installed is described on the txt I have attached after running ompi_info.


What can be wrong here?


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users