Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet

George Bosilca Mon, 19 Dec 2005 16:57:37 -0500

Carsten,

In the Open MPI source code directory there is a collective componentcalled tuned (ompi/mca/coll/tuned). This component is not enabled bydefault right now, but usually it give better performances than thebasic one. You should give it a try (go inside and removethe .ompi_ignore file and redo the autogen and configure).

I don't see how you deduct that adding barriers increase thecongestion ? It increase the latency for the all-to-all but for methat make sense. For each pair of message that you send (and thesepair will send them in parallel) you add a global synchronization ontop of it (depend on the algorithm used for the barrier but it canhardly be pipelined with others communications). If you have ahardware barrier it can help, but over TCP or any other p2p networkit won't.

Anyway, the algorithm you describe with the MPI_Sendrecv act as animplicit barrier as they all wait for the other at some point. What'shappens if you make sure that all MPI_Sendrecv act only between 2nodes at each moment (make [source:destination] an unique tuple) ?


  Thanks,
    george.

On Dec 19, 2005, at 7:26 AM, Carsten Kutzner wrote:

Hello,

I am desparately trying to get better all-to-all performance on Gbit
Ethernet (flow control is enabled). I have been playing around with
several all-to-all schemes and been able to reduce congestion by
communicating in an ordered fashion.

E.g. the simplest scheme looks like

   for (i=0; i<ncpu; i++)
   {
     /* send to dest */
     dest = (cpuid + i) % ncpu;
     /* receive from source  */
     source   = (ncpu + cpuid - i) % ncpu;
MPI_Sendrecv(sendbuf+dest *sendcount, sendcount, sendtype,dest , 0,recvbuf+source*recvcount, recvcount, recvtype,source, 0,
                  comm, &status);
   }
For sendcount=32768 and sendtype=float (yields 131072 bytes) thetime such
an all-to-all takes is (average over 100 runs, std deviation in () ):

SENDRECV ALLTOALL on 16 PROCS
32768 floats took 0.036783 (0.008798) seconds. Min: 0.034175max: 0.123684
SENDRECV ALLTOALL on 32 PROCS
32768 floats took 0.082687 (0.035920) seconds. Min: 0.071915max: 0.285299
For comparison:
MPI_Alltoall on 16 PROCS
32768 floats took 0.057936 (0.073605) seconds. Min: 0.027218max: 0.275988
MPI_Alltoall on 32 PROCS
32768 floats took 0.137835 (0.100580) seconds. Min: 0.055607max: 0.412144
The sendrecv all-to-all performs better for these message sizes, but
on 32 CPUs (on 32 nodes) there is still congestion. When I try toseparatethe communication phases by putting an MPI_Barrier(MPI_COMM_WORLD)after
the sendrecv, this makes the problem of congestion even worse:

SENDRECV ALLTOALL on 32 PROCS, with Barrier:
32768 floats took 0.179162 (0.136885) seconds. Min: 0.091028max: 0.729049
How can a barrier lead to more congestion???

Thanks in advance for helpful comments,
   Carsten


---------------------------------------------------
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics Department
Am Fassberg 11
37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
eMail ckut...@gwdg.de
http://www.gwdg.de/~ckutzne

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

"Half of what I say is meaningless; but I say it so that the otherhalf may reach you"

                                  Kahlil Gibran

Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet

Reply via email to