I do not completely understand whether that involves changing some MPI code. I have no prior experience with that.
But if I get the idea something like this could potentially work (assume that comm is the communicator of the groups that communicates at each iteration): *clock_t total_time = clock();* *clock_t sync_time = 0;* *for each transmission{* * sync_time = sync_time - clock();* * comm.Barrier();* * sync_time = sync_time + clock();* * comm.Bcast(...);* *}* *total_time = clock() - total_time;* *//Total time* *double t_time = double(total_time)/CLOCKS_PER_SEC;* *//Synchronization time* *double s_time = double(sync_time)/CLOCKS_PER_SEC;* *//Actual data transmission time* *double d_time = t_time - s_time;* I know that I have added a useless barrier call, but do you think that this can work the way I think it will and at least give some idea of the synchronization time? Barrett, I am also working on switching to m4.large instances and will check if this helps. Regards, Kostas On Mon, Oct 23, 2017 at 10:20 AM, Barrett, Brian <bbarr...@amazon.com> wrote: > Gilles suggested your best next course of action; time the MPI_Bcast and > MPI_Barrier calls and see if there’s a non-linear scaling effect as you > increase group size. > > You mention that you’re using m3.large instances; while this isn’t the > list for in-depth discussion about EC2 instances (the AWS Forums are better > for that), I’ll note that unless you’re tied to m3 for organizational or > reserved instance reasons, you’ll probably be happier on another instance > type. m3 was one of the last instance families released which does not > support Enhanced Networking. There’s significantly more jitter and latency > in the m3 network stack compared to platforms which support Enhanced > Networking (including the m4 platform). If networking costs are causing > your scaling problems, the first step will be migrating instance types. > > Brian > > > On Oct 23, 2017, at 4:19 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > > > Konstantions, > > > > A simple way is to rewrite MPI_Bcast() and insert timer and > > PMPI_Barrier() before invoking the real PMPI_Bcast(). > > time spent in PMPI_Barrier() can be seen as time NOT spent on actual > > data transmission, > > and since all tasks are synchronized upon exit, time spent in > > PMPI_Bcast() can be seen as time spent on actual data transmission. > > this is not perfect, but this is a pretty good approximation. > > You can add extra timers so you end up with an idea of how much time > > is spent in PMPI_Barrier() vs PMPI_Bcast(). > > > > Cheers, > > > > Gilles > > > > On Mon, Oct 23, 2017 at 4:16 PM, Konstantinos Konstantinidis > > <kostas1...@gmail.com> wrote: > >> In any case, do you think that the time NOT spent on actual data > >> transmission can impact the total time of the broadcast especially when > >> there are so many groups that communicate (please refer to the numbers I > >> gave before if you want to get an idea). > >> > >> Also, is there any way to quantify this impact i.e. to measure the time > not > >> spent on actual data transmissions? > >> > >> Kostas > >> > >> On Fri, Oct 20, 2017 at 10:32 PM, Jeff Hammond <jeff.scie...@gmail.com> > >> wrote: > >>> > >>> Broadcast is collective but not necessarily synchronous in the sense > you > >>> imply. If you broadcast message size under the eager limit, the root > may > >>> return before any non-root processes enter the function. Data transfer > may > >>> happen prior to processes entering the function. Only rendezvous forces > >>> synchronization between any two processes but there may still be > asynchrony > >>> between different levels of the broadcast tree. > >>> > >>> Jeff > >>> > >>> On Fri, Oct 20, 2017 at 3:27 PM Konstantinos Konstantinidis > >>> <kostas1...@gmail.com> wrote: > >>>> > >>>> Hi, > >>>> > >>>> I am running some tests on Amazon EC2 and they require a lot of > >>>> communication among m3.large instances. > >>>> > >>>> I would like to give you an idea of what kind of communication takes > >>>> place. There are 40 m3.large instances. Now, 28672 groups of 5 > instances are > >>>> formed in a specific manner (let's skip the details). Within each > group, > >>>> each instance broadcasts some unsigned char data to the other 4 > instances in > >>>> the group. So within each group, exactly 5 broadcasts take place. > >>>> > >>>> The problem is that if I increase the size of the group from 5 to 10 > >>>> there is significant delay in terms of transmission rate while, based > on > >>>> some theoretical results, this is not reasonable. > >>>> > >>>> I want to check if one of the reasons that this is happening is due to > >>>> the time needed for the instances to synchronize when they call > MPI_Bcast() > >>>> since it's a collective function. As far as I know, all of the > machines in > >>>> the broadcast need to call it and then synchronize until the actual > data > >>>> transfer starts. Is there any way to measure this synchronization > time? > >>>> > >>>> The code is in C++ and the MPI installed is described in the attached > >>>> file. > >>>> _______________________________________________ > >>>> users mailing list > >>>> users@lists.open-mpi.org > >>>> https://lists.open-mpi.org/mailman/listinfo/users > >>> > >>> -- > >>> Jeff Hammond > >>> jeff.scie...@gmail.com > >>> http://jeffhammond.github.io/ > >> > >> > >> > >> _______________________________________________ > >> users mailing list > >> users@lists.open-mpi.org > >> https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users