Hi Graham,

sorry for the long delay, I was on Christmas holidays. I wish a Happy New
Year!

On Fri, 23 Dec 2005, Graham E Fagg wrote:
>
> > I have also tried the tuned alltoalls and they are really great!! Only for
> > very few message sizes in the case of 4 CPUs on a node one of my alltoalls
> > performed better. Are these tuned collectives ready to be used for
> > production runs?
>
> We are actively testing them on larger systems to get better decision
> functions.. can you send me the list of which sizes they do better and
> worse for? (that way I can alter the decision functions). But the real
> question is do they exhibit the strange performance behaviour that you
> have with the other alltoall versions? (Noting that in my previous email
> to you I stated that one of the alltoalls is a sendrecv pairbased
> implementation).

(Uh, I think the previous email did not arrive in my postbox (?)) But yes,
also the OMPI tuned all-to-all shows this strange performance behaviour
(i.e. sometimes it's fast, sometimes it's delayed for 0.2 or more
seconds). For message sizes where the delays occur, I am sometimes able to
do better with an alternative all-to-all routine. It sets up the same
communication pattern as the pairbased sendrecv all-to-all but not on the
basis of the CPUs but on the basis of the nodes. The core looks like

   /* loop over nodes */
   for (i=0; i<nnodes; i++)
   {
     destnode   = (         nodeid + i) % nnodes;  /* send to destination node 
*/
     sourcenode = (nnodes + nodeid - i) % nnodes;  /* receive from source node 
*/
     /* loop over CPUs on each node */
     for (j=0; j<procs_pn; j++)  /* 1 or more processors per node */
     {
       sourcecpu = sourcenode*procs_pn + j; /* source of data */
       destcpu   = destnode  *procs_pn + j; /* destination of data */
       MPI_Irecv(recvbuf + sourcecpu*recvcount, recvcount, recvtype, sourcecpu, 
0, comm, &recvrequests[j]);
       MPI_Isend(sendbuf + destcpu  *sendcount, sendcount, sendtype, destcpu  , 
0, comm, &sendrequests[j]);
     }
     MPI_Waitall(procs_pn,sendrequests,sendstatuses);
     MPI_Waitall(procs_pn,recvrequests,recvstatuses);
   }

I tested for message sizes of 4, 8, 16, 32, ... 131072 byte that are to be
sent from each CPU to every other, and for 4, 8, 16, 24 and 32 nodes (each
node has 1, 2 or 4 CPUs). While in general the OMPI all-to-all performs
better, the alternative one performs better for the following message
sizes:

4 CPU nodes:
128 CPUs on 32 nodes: 512, 1024                                                
byte
 96 CPUs on 24 nodes: 512, 1024, 2048, 4096,       16384                       
byte
 64 CPUs on 16 nodes:                  4096                                    
byte

2 CPU nodes:
 64 CPUs on 32 nodes:      1024, 2048, 4096, 8192                              
byte
 48 CPUs on 24 nodes:            2048, 4096, 8192,                      131072 
byte

1 CPU nodes:
 32 CPUs on 32 nodes:                  4096, 8192, 16384                       
byte
 24 CPUs on 24 nodes:                        8192, 16384, 32768, 65536, 131072 
byte

Here is an example measurement for 128 CPUs on 32 nodes, averages taken
over 25 runs, not counting the 1st one. Performance problems marked by a
(!):

OMPI tuned all-to-all:
======================
       mesg size  time in seconds
#CPUs     floats  average   std.dev.    min.      max.
 128           1  0.001288  0.000102    0.001077  0.001512
 128           2  0.008391  0.000400    0.007861  0.009958
 128           4  0.008403  0.000237    0.008095  0.009018
 128           8  0.008228  0.000942    0.003801  0.008810
 128          16  0.008503  0.000191    0.008233  0.008839
 128          32  0.008656  0.000271    0.008084  0.009177
 128          64  0.009085  0.000209    0.008757  0.009603
 128         128  0.251414  0.073069    0.011547  0.506703 !
 128         256  0.385515  0.127661    0.251431  0.578955 !
 128         512  0.035111  0.000872    0.033358  0.036262
 128        1024  0.046028  0.002116    0.043381  0.052602
 128        2048  0.073392  0.007745    0.066432  0.104531
 128        4096  0.165052  0.072889    0.124589  0.404213
 128        8192  0.341377  0.041815    0.309457  0.530409
 128       16384  0.507200  0.050872    0.492307  0.750956
 128       32768  1.050291  0.132867    0.954496  1.344978
 128       65536  2.213977  0.154987    1.962907  2.492560
 128      131072  4.026107  0.147103    3.800191  4.336205

alternative all-to-all:
======================
 128           1  0.012584  0.000724    0.011073  0.015331
 128           2  0.012506  0.000444    0.011707  0.013461
 128           4  0.012412  0.000511    0.011157  0.013413
 128           8  0.012488  0.000455    0.011767  0.013746
 128          16  0.012664  0.000416    0.011745  0.013362
 128          32  0.012878  0.000410    0.012157  0.013609
 128          64  0.013138  0.000417    0.012452  0.013826
 128         128  0.014016  0.000505    0.013195  0.014942 +
 128         256  0.015843  0.000521    0.015107  0.016725 +
 128         512  0.052240  0.079323    0.027019  0.320653 !
 128        1024  0.123884  0.121560    0.038062  0.308929 !
 128        2048  0.176877  0.125229    0.074457  0.387276 !
 128        4096  0.305030  0.121716    0.176640  0.496375 !
 128        8192  0.546405  0.108007    0.415272  0.899858 !
 128       16384  0.604844  0.056576    0.558657  0.843943 !
 128       32768  1.235298  0.097969    1.094720  1.451241 !
 128       65536  2.926902  0.312733    2.458742  3.895563 !
 128      131072  6.208087  0.472115    5.354304  7.317153 !

The alternative all-to-all has the same performance problems, but they set
in later ... and last longer ;(  The results for the other cases look
similar.

Ciao,
  Carsten


---------------------------------------------------
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics Department
Am Fassberg 11
37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
eMail ckut...@gwdg.de
http://www.gwdg.de/~ckutzne

Reply via email to