Hi Graham, sorry for the long delay, I was on Christmas holidays. I wish a Happy New Year!
On Fri, 23 Dec 2005, Graham E Fagg wrote: > > > I have also tried the tuned alltoalls and they are really great!! Only for > > very few message sizes in the case of 4 CPUs on a node one of my alltoalls > > performed better. Are these tuned collectives ready to be used for > > production runs? > > We are actively testing them on larger systems to get better decision > functions.. can you send me the list of which sizes they do better and > worse for? (that way I can alter the decision functions). But the real > question is do they exhibit the strange performance behaviour that you > have with the other alltoall versions? (Noting that in my previous email > to you I stated that one of the alltoalls is a sendrecv pairbased > implementation). (Uh, I think the previous email did not arrive in my postbox (?)) But yes, also the OMPI tuned all-to-all shows this strange performance behaviour (i.e. sometimes it's fast, sometimes it's delayed for 0.2 or more seconds). For message sizes where the delays occur, I am sometimes able to do better with an alternative all-to-all routine. It sets up the same communication pattern as the pairbased sendrecv all-to-all but not on the basis of the CPUs but on the basis of the nodes. The core looks like /* loop over nodes */ for (i=0; i<nnodes; i++) { destnode = ( nodeid + i) % nnodes; /* send to destination node */ sourcenode = (nnodes + nodeid - i) % nnodes; /* receive from source node */ /* loop over CPUs on each node */ for (j=0; j<procs_pn; j++) /* 1 or more processors per node */ { sourcecpu = sourcenode*procs_pn + j; /* source of data */ destcpu = destnode *procs_pn + j; /* destination of data */ MPI_Irecv(recvbuf + sourcecpu*recvcount, recvcount, recvtype, sourcecpu, 0, comm, &recvrequests[j]); MPI_Isend(sendbuf + destcpu *sendcount, sendcount, sendtype, destcpu , 0, comm, &sendrequests[j]); } MPI_Waitall(procs_pn,sendrequests,sendstatuses); MPI_Waitall(procs_pn,recvrequests,recvstatuses); } I tested for message sizes of 4, 8, 16, 32, ... 131072 byte that are to be sent from each CPU to every other, and for 4, 8, 16, 24 and 32 nodes (each node has 1, 2 or 4 CPUs). While in general the OMPI all-to-all performs better, the alternative one performs better for the following message sizes: 4 CPU nodes: 128 CPUs on 32 nodes: 512, 1024 byte 96 CPUs on 24 nodes: 512, 1024, 2048, 4096, 16384 byte 64 CPUs on 16 nodes: 4096 byte 2 CPU nodes: 64 CPUs on 32 nodes: 1024, 2048, 4096, 8192 byte 48 CPUs on 24 nodes: 2048, 4096, 8192, 131072 byte 1 CPU nodes: 32 CPUs on 32 nodes: 4096, 8192, 16384 byte 24 CPUs on 24 nodes: 8192, 16384, 32768, 65536, 131072 byte Here is an example measurement for 128 CPUs on 32 nodes, averages taken over 25 runs, not counting the 1st one. Performance problems marked by a (!): OMPI tuned all-to-all: ====================== mesg size time in seconds #CPUs floats average std.dev. min. max. 128 1 0.001288 0.000102 0.001077 0.001512 128 2 0.008391 0.000400 0.007861 0.009958 128 4 0.008403 0.000237 0.008095 0.009018 128 8 0.008228 0.000942 0.003801 0.008810 128 16 0.008503 0.000191 0.008233 0.008839 128 32 0.008656 0.000271 0.008084 0.009177 128 64 0.009085 0.000209 0.008757 0.009603 128 128 0.251414 0.073069 0.011547 0.506703 ! 128 256 0.385515 0.127661 0.251431 0.578955 ! 128 512 0.035111 0.000872 0.033358 0.036262 128 1024 0.046028 0.002116 0.043381 0.052602 128 2048 0.073392 0.007745 0.066432 0.104531 128 4096 0.165052 0.072889 0.124589 0.404213 128 8192 0.341377 0.041815 0.309457 0.530409 128 16384 0.507200 0.050872 0.492307 0.750956 128 32768 1.050291 0.132867 0.954496 1.344978 128 65536 2.213977 0.154987 1.962907 2.492560 128 131072 4.026107 0.147103 3.800191 4.336205 alternative all-to-all: ====================== 128 1 0.012584 0.000724 0.011073 0.015331 128 2 0.012506 0.000444 0.011707 0.013461 128 4 0.012412 0.000511 0.011157 0.013413 128 8 0.012488 0.000455 0.011767 0.013746 128 16 0.012664 0.000416 0.011745 0.013362 128 32 0.012878 0.000410 0.012157 0.013609 128 64 0.013138 0.000417 0.012452 0.013826 128 128 0.014016 0.000505 0.013195 0.014942 + 128 256 0.015843 0.000521 0.015107 0.016725 + 128 512 0.052240 0.079323 0.027019 0.320653 ! 128 1024 0.123884 0.121560 0.038062 0.308929 ! 128 2048 0.176877 0.125229 0.074457 0.387276 ! 128 4096 0.305030 0.121716 0.176640 0.496375 ! 128 8192 0.546405 0.108007 0.415272 0.899858 ! 128 16384 0.604844 0.056576 0.558657 0.843943 ! 128 32768 1.235298 0.097969 1.094720 1.451241 ! 128 65536 2.926902 0.312733 2.458742 3.895563 ! 128 131072 6.208087 0.472115 5.354304 7.317153 ! The alternative all-to-all has the same performance problems, but they set in later ... and last longer ;( The results for the other cases look similar. Ciao, Carsten --------------------------------------------------- Dr. Carsten Kutzner Max Planck Institute for Biophysical Chemistry Theoretical and Computational Biophysics Department Am Fassberg 11 37077 Goettingen, Germany Tel. +49-551-2012313, Fax: +49-551-2012302 eMail ckut...@gwdg.de http://www.gwdg.de/~ckutzne