Re: [OMPI users] Parallel MPI broadcasts (parameterized)

George Bosilca Mon, 06 Nov 2017 07:04:13 -0800

On Sun, Nov 5, 2017 at 10:23 PM, Konstantinos Konstantinidis <
kostas1...@gmail.com> wrote:


> Hi George,
>
> First, let me note that the cost of q^(k-1)]*(q-1) communicators was fine
> for the values of parameters q,k I am working with. Also, the whole point
> of speeding up the shuffling phase is trying to reduce this number even
> more (compared to already known implementations) which is a major concern
> of my project. But thanks for pointing that out. Btw, do you know what is
> the maximum such number in MPI?
>

Last time I run into such troubles these limits were: 2k for MVAPICH, 16k
for MPICH and 2^30-1 for OMPI (all positive signed 23 bits integers). It
might have changed meanwhile.


> Now to the main part of the question, let me clarify that I have 1 process
> per machine. I don't know if this is important here but my way of thinking
> is that we have a big text file and each process will have to work on some
> chunks of it (like chapters of a book). But each process resides in an
> machine with some RAM which is able to handle a specific amount of work so
> if you generate many processes per machine you must have fewer book
> chapters per process than before. Thus, I wanted to avoid thinking in the
> process-level rather than machine-level with the RAM limitations.
>
> Now to the actual shuffling, here is what I am currently doing (Option 1):
>
> Let's denote the data that slave s has to send to the slaves in group G as
> D(s,G).
>
> *for each slave s in 1,2,...,K{*
>
> *    for each group G that s participates into{*
>
> *        if (my rank is s){*
> *            MPI_Bcast(send data D(s,G))*
> *        }else if(my rank is in group G)*
> *            MPI_Bcast(get data D(s,G))*
> *        }else{*
> *           Do nothing*
> *        }*
>
> *    }*
>
>     *MPI::COMM_WORLD.Barrier();*
>
> *}*
>
> What I suggested before to speedup things (Option 2) is:
>
> *for each set {G(1),G(2),...,G(q-1)} of q-1 disjoint groups{ *
>
> *    for each slave s in G(1)*
> *        if (my rank is s){*
> *            MPI_Bcast(send data D(s,G(1)))*
> *        }else if(**my rank is in** group G(1))*
> *            MPI_Bcast(get data D(s,G(1)))*
> *        }else{*
> *           Do nothing*
> *        }*
> *    }*
>
> *    for each slave s in G(2)*
> *        if (my rank is s){*
> *            MPI_Bcast(send data D(s,G(2)))*
> *        }else if(**my rank is in** G(2))*
> *            MPI_Bcast(get data D(s,G(2)))*
> *        }else{*
> *           Do nothing*
> *        }*
> *    }*
>
> *    ...*
>
>     *for each slave s in G(q-1)*
> *        if (my rank is s){*
> *            MPI_Bcast(send data D(s,G(q-1)))*
> *        }else if(**my rank is in** G(q-1))*
> *            MPI_Bcast(get data D(s,G(q-1)))*
> *        }else{*
> *           Do nothing*
> *        }*
> *    }*
>
> *    MPI::COMM_WORLD.Barrier();*
>
> *}*
>
> My hope was that I could implement Option 2 (in some way without copying
> and pasting the same code q-1 times every time I change q) and that this
> could bring a speedup of q-1 compared to Option 1 by having these groups
> communicate in parallel. Right, now I am trying to find a way to identify
> these sets of groups based on my implementation, which involves some
> abstract algebra but for now let's assume that I can find them in an
> efficient manner.
>
> Let me emphasize that each broadcast sends different actual data. There
> are no two broadcasts that send the same D(s,G).
>
> Finally, let's go to MPI_Allgather(): I am really confused since I have
> never used this call but I have this image in my mind:
>
>
>
If every member of a group does a bcast to all other members of the same
group, then this operation is better realized by an allgather. The picture
you attached clearly expose the data movement pattern where each color box
gets distributed to all members of the same communicator. You could also
see this operation as a loop of bcast where the iterator goes over all
members of the communicator and use it as a root.


> 
> I am not sure what you meant but now I am thinking of this (let commG be
> the intra-communicator of group G):
>
> *for each possible group G{*
>
>     *if (my rank is in G){*
> *        commG.MPI_AllGather(**send data D(rank,G)**)*
> *    }**else{*
> *        Do nothing*
> *    }*
>
>     *MPI::COMM_WORLD.Barrier();*
>
> *}*
>

This is indeed what I was thinking about, with the condition that you make
sure the list of communicators in G is ordered in the same way on all
processes.

That being said, this communication pattern 1) generated a large barrier in
your code; 2) as all processes will potentially be involved in many
collective communications you will be hammering the network in a
significant way (so you will have to take into account the network
congestion); and 3) all processes need to have all memory for receive
allocated for the buffers. Thus, even be implementing a nice communication
scheme you might encounter some performance issues.

Another way to do this is to instead of conglomerating all communications
in a single temporal location you spread them out across time by imposing
your own communication logic. This basically translate a set of blocking
collective (bcast is a perfect target) into a pipelined mix. Instead of
describing such a scheme here I suggest you read the algorithmic
description of the SUMMA and/or PUMMA distributed matrix multiplication.

  George.


I am not sure whether this makes sense since I am confused about the
> correspodence of the data transmitted with Allgather() compared to the
> notation D(s,G) I am currently using.
>
> Thanks.
>
>
> On Tue, Oct 31, 2017 at 11:11 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
>
>> It really depends what are you trying to achieve. If the question is
>> rhetorical: "can I write a code that does in parallel broadcasts on
>> independent groups of processes ?" then the answer is yes, this is
>> certainly possible. If however you add a hint of practicality in your
>> question "can I write an efficient parallel broadcast between independent
>> groups of processes?" then I'm afraid the answer will be a negative one.
>>
>> Let's not look at how you can write the multiple bcast code as the answer
>> in the stackoverflow is correct, but instead look at what resources these
>> collective operations are using. In general you can assume that nodes are
>> connected by a network, able to move data at a rate B in both directions
>> (full duplex). Assuming the implementation of the bcast algorithm is not
>> entirely moronic, the bcast can saturate the network with a single process
>> per node. Now, if you have multiple processes per node (P) then either you
>> schedule them sequentially (so that each one has the full bandwidth B) or
>> you let them progress in parallel in which case each participating process
>> can claim a lower bandwidth B/P (as it is shared between all processes on
>> the nore).
>>
>> So even if you are able to expose enough parallelism, physical resources
>> will impose the real hard limit.
>>
>> That being said I have the impression you are trying to implement an
>> MPI_Allgather(v) using a series of MPI_Bcast. Is that true ?
>>
>>   George.
>>
>> PS: Few other constraints: the cost of creating the q^(k-1)]*(q-1)
>> communicator might be prohibitive; the MPI library might support a limited
>> number of communicators.
>>
>>
>> On Tue, Oct 31, 2017 at 11:42 PM, Konstantinos Konstantinidis <
>> kostas1...@gmail.com> wrote:
>>
>>> Assume that we have K=q*k nodes (slaves) where q,k are positive integers
>>> >= 2.
>>>
>>> Based on the scheme that I am currently using I create [q^(k-1)]*(q-1)
>>> groups (along with their communicators). Each group consists of k nodes and
>>> within each group exactly k broadcasts take place (each node broadcasts
>>> something to the rest of them). So in total [q^(k-1)]*(q-1)*k MPI
>>> broadcasts take place. Let me skip the details of the above scheme.
>>>
>>> Now theoretically I figured out that there are q-1 groups that can
>>> communicate in parallel at the same time i.e. groups that have no common
>>> nodes and I would like to utilize that to speedup the shuffling. I have
>>> seen here https://stackoverflow.com/questions/11372012/mpi-severa
>>> l-broadcast-at-the-same-time that this is possible in MPI.
>>>
>>> In my case it's more complicated since q,k are parameters of the problem
>>> and change between different experiments. If I get the idea about the 2nd
>>> method that is proposed there and assume that we have only 3 groups within
>>> which some communication takes places one can simply do:
>>>
>>> *if my rank belongs to group 1{*
>>> *    comm1.Bcast(..., ..., ..., rootId);*
>>> *}else if my rank belongs to group 2{*
>>> *    comm2.Bcast(..., ..., ..., rootId);*
>>> *}else if my rank belongs to group3{*
>>> *    comm3.Bcast(..., ..., ..., rootId);*
>>> *} *
>>>
>>> where comm1, comm2, comm3 are the corresponding sub-communicators that
>>> contain only the members of each group.
>>>
>>> But how can I generalize the above idea to arbitrary number of groups or
>>> perhaps do something else?
>>>
>>> The code is in C++ and the MPI installed is described in the attached
>>> file.
>>>
>>> Regards,
>>> Kostas
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Parallel MPI broadcasts (parameterized)

Reply via email to