Re: [OMPI users] MPI_Bcast performance doesn't improve after enabling tree implementation

Gilles Gouaillardet Tue, 17 Oct 2017 17:14:59 -0700

If you use the rsh tree spawn mechanism, then yes, any node must be ableto SSH passwordless to any node.

This is only used to spawn one orted per node.

when the number of nodes is important, a tree spawn is faster and avoidshaving all the SSH connections issued and maintained from the

node running mpirun.

After the orted have been spawned and wired up, MPI connections can beestablished directly and do not involve SSH.


basic_linear is the algo you are looking for.

your best bet is to have a look at the source code inompi/mca/coll/base/coll_base_bcast.c from Open MPI 2.0.0


Cheers,

Gilles

On 10/18/2017 5:23 AM, Konstantinos Konstantinidis wrote:

Thanks for clarifying that Gilles.

Now I have seen that omitting "-mca plm_rsh_no_tree_spawn 1" requiresestablishing passwordless SSH among the machines but this is notrequired for setting "--mca coll_tuned_bcast_algo". Is this correct oram I missing something?

Also, among all possible broadcast options (0:"ignore",1:"basic_linear", 2:"chain", 3:"pipeline", 4:"split_binary_tree",5:"binary_tree", 6:"binomial") is there any option that behaves likeindividual MPI_Send separately to each receiver or they all have someparallel transmissions? Where can I find a more detailed descriptionof these implementations of broadcast?

Out of curiosity, when is "-mca plm_rsh_no_tree_spawn 1". I have alittle MPI experience but I don't understand the need of having aspecial tree-based algorithm just to start running the MPI program onthe machines.


Regards,
Kostas

On Tue, Oct 17, 2017 at 1:57 AM, Gilles Gouaillardet<gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:


    Konstantinos,


    I am afraid there is some confusion here.


    the plm_rsh_no_tree_spawn is only used at startup time (e.g. when
    remote launching one orted daemon per node but the one running
    mpirun).

    there is zero impact on the performances of MPI communications
    such as MPI_Bcast()


    the coll/tuned module select the broadcast algorithm based on
    communicator and message sizes.
    you can manually force that with

    mpirun --mca coll_tuned_use_dynamic_rules true --mca
    coll_tuned_bcast_algo <algo> ./my_test

    where <algo> is the algo number as described by ompi_info --all

             MCA coll tuned: parameter "coll_tuned_bcast_algorithm"
    (current value: "ignore", data source: default, level: 5
    tuner/detail, type: int)
                              Which bcast algorithm is used. Can be
    locked down to choice of: 0 ignore, 1 basic linear, 2 chain, 3:
    pipeline, 4: split binary tree, 5: binary tree, 6: binomial tree.
                              Valid values: 0:"ignore",
    1:"basic_linear", 2:"chain", 3:"pipeline", 4:"split_binary_tree",
    5:"binary_tree", 6:"binomial"

    for some specific communicator and message sizes, you might
    experience better performances.
    you also have the option to write your own rules (e.g. which algo
    should be used based on communicator and message sizes) if you are
    not happy with the default rules.
    (that would be with the coll_tuned_dynamic_rules_filename MCA option)

    note coll/tuned does not take the topology (e.g. inter vs intra
    node communications) into consideration when choosing the algorithm.


    Cheers,

    Gilles


    On 10/17/2017 3:30 PM, Konstantinos Konstantinidis wrote:

        I have implemented some algorithms in C++ which are greatly
        affected by shuffling time among nodes which is done by some
        broadcast calls. Up to now, I have been testing them by
        running something like

        mpirun -mca btl ^openib -mca plm_rsh_no_tree_spawn 1 ./my_test

        which I think make MPI_Bcast to work serially. Now, I want to
        improve the communication time so I have configured the
        appropriate SSH access from every node to every other node and
        I have enabled the binary tree implementation of Open MPI
        collective calls by running

        mpirun -mca btl ^openib ./my_test

        My problem is that throughout various experiments with files
        of different sizes, I realized that there is no improvement in
        terms of transmission time even though theoretically I would
        expect a gain of approximately (log(k))/(k-1) where k is the
        size of the group that the communication takes place within.

        I compile the code with

        mpic++ my_test.cc -o my_test

        and all of the experiments are done on Amazon EC2 r3.large or
        m3.large machines. I have also set different values of rate
        limits to avoid bursty behavior of Amazon's EC2 transmission
        rate. The Open MPI I have installed is described on the txt I
        have attached after running ompi_info.

        What can be wrong here?


        _______________________________________________
        users mailing list
        users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
        https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>


    _______________________________________________
    users mailing list
    users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
    https://lists.open-mpi.org/mailman/listinfo/users
    <https://lists.open-mpi.org/mailman/listinfo/users>




_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_Bcast performance doesn't improve after enabling tree implementation

Reply via email to