Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet

George Bosilca Tue, 20 Dec 2005 17:46:14 -0500


On Dec 20, 2005, at 3:19 AM, Carsten Kutzner wrote:

I don't see how you deduct that adding barriers increase the
congestion ? It increase the latency for the all-to-all but for me

When I do an all-to-all a lot of times, I see that the time for asingle

all-to-all varies a lot. My time measurement:

do 100 times
{
  MPI_Barrier
  MPI_Wtime
  ALLTOALL
  MPI_Barrier
  MPI_Wtime
}

This way of computing the time for collective operations is notconsidered as the best approach. Even for p2p communications if youtime them like that, you will find a huge standard deviation. Way toomany things are involved in any communications, and they usually havea big effect on the duration. For collectives the effect of thisapproach on standard deviation is even more drastic. A better way isto split the loop in 2 loops:


  do 10 times
  {
    MPI_Barrier
    start <- MPI_Wtime
    do 10 times
    {
      ALLTOALL
    }
    end <- MPI_Wtime
    total_time = (end - start) / 10
    MPI_Barrier
  }

You will get results that make more sense. There is another problemwith your code. If we look on how the MPI standard define theMPI_Barrier, we can see that the only requirement is that all nodesbelonging to the same communicator reach the barrier. It does notmeans they leave the barrier in same time ! It depend on how thebarrier is implemented. If it use a linear approach (node 0 get amessage from everybody else and then send a message to everybodyelse), it is clear that the node 0 has more chances to get out of thebarrier last. Therefore, when he will reach the next ALLTOALL, themessages will be already there, as all the others nodes are on thealltoall. Now, as he reach the alltoall later, imagine the effectthat it will have on the communications between the others nodes. Ifit late enough, then there will be congestion as all others will bewaiting for a sendrecv with the node 0.

There are others approaches for performance measurement, but they aremore complex. The one I will describe give correct results with afairly simple algorithm. What people usually do for measuringperformances is that after filling up the array with their individualresults, and before computing the mean-time they remove the best andthe worst results (the 2 extremum). They can be considered asanomalies. If there are several "worst" then they will show up anywayin the standard deviation as you will remove just one.


    george.


For the ring-sendrecv all-to-all I get something like
 ...
sending   131072 bytes to 32 processes took ...    0.06433 seconds
sending   131072 bytes to 32 processes took ...    0.06866 seconds
sending   131072 bytes to 32 processes took ...    0.06233 seconds
sending   131072 bytes to 32 processes took ...    0.26683 seconds (*)
sending   131072 bytes to 32 processes took ...    0.06353 seconds
sending   131072 bytes to 32 processes took ...    0.06470 seconds
sending   131072 bytes to 32 processes took ...    0.06483 seconds
Summary (100-run average, timer resolution 0.000001):

32768 floats took 0.068903 (0.028432) seconds. Min: 0.061708 max:0.266832


The typical time my all-to-all takes is around 0.065 seconds, while

sometimes (*) it takes 0.2+ seconds more. This I interpret ascongestion.


It can be congestion ...


When I add a barrier after the MPI_Sendrecv inside the alltoall, I get
many more of these congestion events:
  ...
sending   131072 bytes to 32 processes took ...    0.11023 seconds
sending   131072 bytes to 32 processes took ...    0.48874 seconds
sending   131072 bytes to 32 processes took ...    0.27856 seconds
sending   131072 bytes to 32 processes took ...    0.27711 seconds
sending   131072 bytes to 32 processes took ...    0.31615 seconds
sending   131072 bytes to 32 processes took ...    0.07439 seconds
sending   131072 bytes to 32 processes took ...    0.07440 seconds
sending   131072 bytes to 32 processes took ...    0.07490 seconds
sending   131072 bytes to 32 processes took ...    0.27524 seconds
sending   131072 bytes to 32 processes took ...    0.07464 seconds
Summary (100-run average, timer resolution 0.000001):

32768 floats took 0.250027 (0.158686) seconds. Min: 0.072322 max:0.970822

Indeed, the all-to-all time has increased from 0.065 to 0.075seconds by

the barrier, but the most severe problem is congestion as it happens
nearly every step now.

Anyway, the algorithm you describe with the MPI_Sendrecv act as an
implicit barrier as they all wait for the other at some point. What's
happens if you make sure that all MPI_Sendrecv act only between 2
nodes at each moment (make [source:destination] an unique tuple) ?

I actually already have tried this. But I get worse timingscompared to

the ring pattern, what I don't understand. I now choose
     /* send to dest */
     dest = m[cpuid][i];
     /* receive from source */
     source = dest;

With a matrix m chosen such that each processor pair communicates in
exactly one phase. I get

Without barrier:
sending   131072 bytes to 32 processes took ...    0.07872 seconds
sending   131072 bytes to 32 processes took ...    0.07667 seconds
sending   131072 bytes to 32 processes took ...    0.07637 seconds
sending   131072 bytes to 32 processes took ...    0.28047 seconds
sending   131072 bytes to 32 processes took ...    0.28580 seconds
sending   131072 bytes to 32 processes took ...    0.28156 seconds
sending   131072 bytes to 32 processes took ...    0.28533 seconds
sending   131072 bytes to 32 processes took ...    0.07763 seconds
sending   131072 bytes to 32 processes took ...    0.27871 seconds
sending   131072 bytes to 32 processes took ...    0.07749 seconds
Summary (100-run average, timer resolution 0.000001):

32768 floats took 0.186031 (0.140984) seconds. Min: 0.075035 max:0.576157


With barrier:
sending   131072 bytes to 32 processes took ...    0.08342 seconds
sending   131072 bytes to 32 processes took ...    0.08432 seconds
sending   131072 bytes to 32 processes took ...    0.08378 seconds
sending   131072 bytes to 32 processes took ...    0.08412 seconds
sending   131072 bytes to 32 processes took ...    0.08312 seconds
sending   131072 bytes to 32 processes took ...    0.08365 seconds
sending   131072 bytes to 32 processes took ...    0.08332 seconds
sending   131072 bytes to 32 processes took ...    0.08376 seconds
sending   131072 bytes to 32 processes took ...    0.08367 seconds
sending   131072 bytes to 32 processes took ...    0.32773 seconds
Summary (100-run average, timer resolution 0.000001):

32768 floats took 0.107121 (0.066466) seconds. Min: 0.082758 max:0.357322


In the case of paired communication the barrier improves stuff. Let me

stress that both paired and ring communication show no congestionfor upto 16 nodes. The problem arises in the 32 CPU case. It should notbe due

to the switch, since it has 48 ports and a 96 Gbit/s backplane.

Does all this mean the congestion problem cannot be solved for
Gbit Ethernet?

   Carsten


---------------------------------------------------
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics Department
Am Fassberg 11
37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
eMail ckut...@gwdg.de
http://www.gwdg.de/~ckutzne

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

"Half of what I say is meaningless; but I say it so that the otherhalf may reach you"

                                  Kahlil Gibran

Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet

Reply via email to