Re: [OMPI users] How can I measure synchronization time of MPI_Bcast()

Gilles Gouaillardet Mon, 23 Oct 2017 17:48:02 -0700

Konstantinos,

i previously suggested you use the profiler interface (aka PMPI)specified in the MPI standard.

An example is available athttp://mpi-forum.org/docs/mpi-3.1/mpi31-report/node363.htm#Node363

The pro is you simply need to rewrite MPI_Bcast() vs add some codearound each MPI_Bcast() invokation.

Your option will also work, and fwiw, i suggest you use the standardMPI_Wtime() instead of clock()

Strictly speaking, you should time PMPI_Barrier(), get the max accrossall ranks, and then sum this max


in order to get the total time spent in synchronization.


Cheers,


Gilles


On 10/24/2017 4:27 AM, Konstantinos Konstantinidis wrote:

I do not completely understand whether that involves changing some MPIcode. I have no prior experience with that.

But if I get the idea something like this could potentially work(assume that comm is the communicator of the groups that communicatesat each iteration):


/*clock_t total_time = clock();
*/
/*clock_t sync_time = 0;
*/
/*
*/
/*for each transmission{*/
/*
*/
/*    sync_time = sync_time - clock();*/
/*    comm.Barrier();
*/
/*    sync_time = sync_time + clock();
*/
/*
*/
/*    comm.Bcast(...);*/
/*
*/
/*}*/
/*
*/
/*total_time = clock() - total_time;
*/
/*
*/
/*//Total time*/
/*double t_time = double(total_time)/CLOCKS_PER_SEC;
*/
/*
*/
/*//Synchronization time*/
/*double s_time = double(sync_time)/CLOCKS_PER_SEC;
*/
/*
*/
/*//Actual data transmission time*/
/*double d_time = t_time - s_time;*/

I know that I have added a useless barrier call, but do you think thatthis can work the way I think it will and at least give some idea ofthe synchronization time?

Barrett, I am also working on switching to m4.large instances and willcheck if this helps.


Regards,
Kostas

On Mon, Oct 23, 2017 at 10:20 AM, Barrett, Brian <bbarr...@amazon.com<mailto:bbarr...@amazon.com>> wrote:


    Gilles suggested your best next course of action; time the
    MPI_Bcast and MPI_Barrier calls and see if there’s a non-linear
    scaling effect as you increase group size.

    You mention that you’re using m3.large instances; while this isn’t
    the list for in-depth discussion about EC2 instances (the AWS
    Forums are better for that), I’ll note that unless you’re tied to
    m3 for organizational or reserved instance reasons, you’ll
    probably be happier on another instance type.  m3 was one of the
    last instance families released which does not support Enhanced
    Networking.  There’s significantly more jitter and latency in the
    m3 network stack compared to platforms which support Enhanced
    Networking (including the m4 platform).  If networking costs are
    causing your scaling problems, the first step will be migrating
    instance types.

    Brian

    > On Oct 23, 2017, at 4:19 AM, Gilles Gouaillardet
    <gilles.gouaillar...@gmail.com
    <mailto:gilles.gouaillar...@gmail.com>> wrote:
    >
    > Konstantions,
    >
    > A simple way is to rewrite MPI_Bcast() and insert timer and
    > PMPI_Barrier() before invoking the real PMPI_Bcast().
    > time spent in PMPI_Barrier() can be seen as time NOT spent on actual
    > data transmission,
    > and since all tasks are synchronized upon exit, time spent in
    > PMPI_Bcast() can be seen as time spent on actual data transmission.
    > this is not perfect, but this is a pretty good approximation.
    > You can add extra timers so you end up with an idea of how much time
    > is spent in PMPI_Barrier() vs PMPI_Bcast().
    >
    > Cheers,
    >
    > Gilles
    >
    > On Mon, Oct 23, 2017 at 4:16 PM, Konstantinos Konstantinidis
    > <kostas1...@gmail.com <mailto:kostas1...@gmail.com>> wrote:
    >> In any case, do you think that the time NOT spent on actual data
    >> transmission can impact the total time of the broadcast
    especially when
    >> there are so many groups that communicate (please refer to the
    numbers I
    >> gave before if you want to get an idea).
    >>
    >> Also, is there any way to quantify this impact i.e. to measure
    the time not
    >> spent on actual data transmissions?
    >>
    >> Kostas
    >>
    >> On Fri, Oct 20, 2017 at 10:32 PM, Jeff Hammond
    <jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>>
    >> wrote:
    >>>
    >>> Broadcast is collective but not necessarily synchronous in the
    sense you
    >>> imply. If you broadcast message size under the eager limit,
    the root may
    >>> return before any non-root processes enter the function. Data
    transfer may
    >>> happen prior to processes entering the function. Only
    rendezvous forces
    >>> synchronization between any two processes but there may still
    be asynchrony
    >>> between different levels of the broadcast tree.
    >>>
    >>> Jeff
    >>>
    >>> On Fri, Oct 20, 2017 at 3:27 PM Konstantinos Konstantinidis
    >>> <kostas1...@gmail.com <mailto:kostas1...@gmail.com>> wrote:
    >>>>
    >>>> Hi,
    >>>>
    >>>> I am running some tests on Amazon EC2 and they require a lot of
    >>>> communication among m3.large instances.
    >>>>
    >>>> I would like to give you an idea of what kind of
    communication takes
    >>>> place. There are 40 m3.large instances. Now, 28672 groups of
    5 instances are
    >>>> formed in a specific manner (let's skip the details). Within
    each group,
    >>>> each instance broadcasts some unsigned char data to the other
    4 instances in
    >>>> the group. So within each group, exactly 5 broadcasts take place.
    >>>>
    >>>> The problem is that if I increase the size of the group from
    5 to 10
    >>>> there is significant delay in terms of transmission rate
    while, based on
    >>>> some theoretical results, this is not reasonable.
    >>>>
    >>>> I want to check if one of the reasons that this is happening
    is due to
    >>>> the time needed for the instances to synchronize when they
    call MPI_Bcast()
    >>>> since it's a collective function. As far as I know, all of
    the machines in
    >>>> the broadcast need to call it and then synchronize until the
    actual data
    >>>> transfer starts. Is there any way to measure this
    synchronization time?
    >>>>
    >>>> The code is in C++ and the MPI installed is described in the
    attached
    >>>> file.
    >>>> _______________________________________________
    >>>> users mailing list
    >>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
    >>>> https://lists.open-mpi.org/mailman/listinfo/users
    <https://lists.open-mpi.org/mailman/listinfo/users>
    >>>
    >>> --
    >>> Jeff Hammond
    >>> jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
    >>> http://jeffhammond.github.io/
    >>
    >>
    >>
    >> _______________________________________________
    >> users mailing list
    >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
    >> https://lists.open-mpi.org/mailman/listinfo/users
    <https://lists.open-mpi.org/mailman/listinfo/users>
    > _______________________________________________
    > users mailing list
    > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
    > https://lists.open-mpi.org/mailman/listinfo/users
    <https://lists.open-mpi.org/mailman/listinfo/users>




_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] How can I measure synchronization time of MPI_Bcast()

Reply via email to