Note there are several ways to set the parameters; --mca on command line is just one of them (suitable for quick online tests).
http://www.open-mpi.org/faq/?category=tuning#setting-mca-params We 'tune' our Open MPI by setting environment variables.... Best Paul Kapinos On 12/19/12 11:44, Number Cruncher wrote:
Having run some more benchmarks, the new default is *really* bad for our application (2-10x slower), so I've been looking at the source to try and figure out why. It seems that the biggest difference will occur when the all_to_all is actually sparse (e.g. our application); if most N-M process exchanges are zero in size the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually only post irecv/isend for non-zero exchanges; any zero-size exchanges are skipped. It then waits once for all requests to complete. In contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size exchanges for *every* N-M pair, and wait for each pairwise exchange. This is O(comm_size) waits, may of which are zero. I'm not clear what optimizations there are for zero-size isend/irecv, but surely there's a great deal more latency if each pairwise exchange has to be confirmed complete before executing the next? Relatedly, how would I direct OpenMPI to use the older algorithm programmatically? I don't want the user to have to use "--mca" in their "mpiexec". Is there a C API? Thanks, Simon On 16/11/12 10:15, Iliev, Hristo wrote:Hi, Simon, The pairwise algorithm passes messages in a synchronised ring-like fashion with increasing stride, so it works best when independent communication paths could be established between several ports of the network switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so, some is - it depends (usually on the price). This said, not all algorithms perform the same given a specific type of network interconnect. For example, on our fat-tree InfiniBand network the pairwise algorithm performs better. You can switch back to the basic linear algorithm by providing the following MCA parameters: mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_alltoallv_algorithm 1 ... Algorithm 1 is the basic linear, which used to be the default. Algorithm 2 is the pairwise one. You can also set these values as exported environment variables: export OMPI_MCA_coll_tuned_use_dynamic_rules=1 export OMPI_MCA_coll_tuned_alltoallv_algorithm=1 mpiexec ... You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf: coll_tuned_use_dynamic_rules=1 coll_tuned_alltoallv_algorithm=1 A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to activate process binding with --bind-to-core if you haven't already did so. It prevents MPI processes from being migrated to other NUMA nodes while running. Kind regards, Hristo -- Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg 23, D 52074 Aachen (Germany)-----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Number Cruncher Sent: Thursday, November 15, 2012 5:37 PM To: Open MPI Users Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1 I've noticed a very significant (100%) slow down for MPI_Alltoallv callsas ofversion 1.6.1. * This is most noticeable for high-frequency exchanges over 1Gb ethernet where process-to-process message sizes are fairly small (e.g. 100kbyte)andmuch of the exchange matrix is sparse. * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm to a pairwise exchange", but I'm not clear what this means or how toswitchback to the old "non-default algorithm". I attach a test program which illustrates the sort of usage in our MPI application. I have run as this as 32 processes on four nodes, over 1Gb ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,.. on node 1, rank 1,5,9, ... on node 2, etc. It constructs an array of integers and a nProcess x nProcess exchangetypicalof part of our application. This is then exchanged several thousand times. Output from "mpicc -O3" runs shown below. My guess is that 1.6.1 is hitting additional latency not present in 1.6.0.I alsoattach a plot showing network throughput on our actual mesh generation application. Nodes cfsc01-04 are running 1.6.0 and finish within 35minutes.Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take overahour to run. There seems to be a much greater network demand in the 1.6.1 version, despite the user-code and input data being identical. Thanks for any help you can give, Simon_______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
-- Dipl.-Inform. Paul Kapinos - High Performance Computing, RWTH Aachen University, Center for Computing and Communication Seffenter Weg 23, D 52074 Aachen (Germany) Tel: +49 241/80-24915
smime.p7s
Description: S/MIME Cryptographic Signature