On Mon, 19 Dec 2005, George Bosilca wrote: > Carsten, > > In the Open MPI source code directory there is a collective component > called tuned (ompi/mca/coll/tuned). This component is not enabled by > default right now, but usually it give better performances than the > basic one. You should give it a try (go inside and remove > the .ompi_ignore file and redo the autogen and configure).
Hi George, thanks a lot for your reply. I will definietly try out the tuned collectives! > I don't see how you deduct that adding barriers increase the > congestion ? It increase the latency for the all-to-all but for me When I do an all-to-all a lot of times, I see that the time for a single all-to-all varies a lot. My time measurement: do 100 times { MPI_Barrier MPI_Wtime ALLTOALL MPI_Barrier MPI_Wtime } For the ring-sendrecv all-to-all I get something like ... sending 131072 bytes to 32 processes took ... 0.06433 seconds sending 131072 bytes to 32 processes took ... 0.06866 seconds sending 131072 bytes to 32 processes took ... 0.06233 seconds sending 131072 bytes to 32 processes took ... 0.26683 seconds (*) sending 131072 bytes to 32 processes took ... 0.06353 seconds sending 131072 bytes to 32 processes took ... 0.06470 seconds sending 131072 bytes to 32 processes took ... 0.06483 seconds Summary (100-run average, timer resolution 0.000001): 32768 floats took 0.068903 (0.028432) seconds. Min: 0.061708 max: 0.266832 The typical time my all-to-all takes is around 0.065 seconds, while sometimes (*) it takes 0.2+ seconds more. This I interpret as congestion. When I add a barrier after the MPI_Sendrecv inside the alltoall, I get many more of these congestion events: ... sending 131072 bytes to 32 processes took ... 0.11023 seconds sending 131072 bytes to 32 processes took ... 0.48874 seconds sending 131072 bytes to 32 processes took ... 0.27856 seconds sending 131072 bytes to 32 processes took ... 0.27711 seconds sending 131072 bytes to 32 processes took ... 0.31615 seconds sending 131072 bytes to 32 processes took ... 0.07439 seconds sending 131072 bytes to 32 processes took ... 0.07440 seconds sending 131072 bytes to 32 processes took ... 0.07490 seconds sending 131072 bytes to 32 processes took ... 0.27524 seconds sending 131072 bytes to 32 processes took ... 0.07464 seconds Summary (100-run average, timer resolution 0.000001): 32768 floats took 0.250027 (0.158686) seconds. Min: 0.072322 max: 0.970822 Indeed, the all-to-all time has increased from 0.065 to 0.075 seconds by the barrier, but the most severe problem is congestion as it happens nearly every step now. > Anyway, the algorithm you describe with the MPI_Sendrecv act as an > implicit barrier as they all wait for the other at some point. What's > happens if you make sure that all MPI_Sendrecv act only between 2 > nodes at each moment (make [source:destination] an unique tuple) ? I actually already have tried this. But I get worse timings compared to the ring pattern, what I don't understand. I now choose /* send to dest */ dest = m[cpuid][i]; /* receive from source */ source = dest; With a matrix m chosen such that each processor pair communicates in exactly one phase. I get Without barrier: sending 131072 bytes to 32 processes took ... 0.07872 seconds sending 131072 bytes to 32 processes took ... 0.07667 seconds sending 131072 bytes to 32 processes took ... 0.07637 seconds sending 131072 bytes to 32 processes took ... 0.28047 seconds sending 131072 bytes to 32 processes took ... 0.28580 seconds sending 131072 bytes to 32 processes took ... 0.28156 seconds sending 131072 bytes to 32 processes took ... 0.28533 seconds sending 131072 bytes to 32 processes took ... 0.07763 seconds sending 131072 bytes to 32 processes took ... 0.27871 seconds sending 131072 bytes to 32 processes took ... 0.07749 seconds Summary (100-run average, timer resolution 0.000001): 32768 floats took 0.186031 (0.140984) seconds. Min: 0.075035 max: 0.576157 With barrier: sending 131072 bytes to 32 processes took ... 0.08342 seconds sending 131072 bytes to 32 processes took ... 0.08432 seconds sending 131072 bytes to 32 processes took ... 0.08378 seconds sending 131072 bytes to 32 processes took ... 0.08412 seconds sending 131072 bytes to 32 processes took ... 0.08312 seconds sending 131072 bytes to 32 processes took ... 0.08365 seconds sending 131072 bytes to 32 processes took ... 0.08332 seconds sending 131072 bytes to 32 processes took ... 0.08376 seconds sending 131072 bytes to 32 processes took ... 0.08367 seconds sending 131072 bytes to 32 processes took ... 0.32773 seconds Summary (100-run average, timer resolution 0.000001): 32768 floats took 0.107121 (0.066466) seconds. Min: 0.082758 max: 0.357322 In the case of paired communication the barrier improves stuff. Let me stress that both paired and ring communication show no congestion for up to 16 nodes. The problem arises in the 32 CPU case. It should not be due to the switch, since it has 48 ports and a 96 Gbit/s backplane. Does all this mean the congestion problem cannot be solved for Gbit Ethernet? Carsten --------------------------------------------------- Dr. Carsten Kutzner Max Planck Institute for Biophysical Chemistry Theoretical and Computational Biophysics Department Am Fassberg 11 37077 Goettingen, Germany Tel. +49-551-2012313, Fax: +49-551-2012302 eMail ckut...@gwdg.de http://www.gwdg.de/~ckutzne