Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
I've looked in more detail at the current two MPI_Alltoallv algorithms and wanted to raise a couple of ideas. Firstly, the new default "pairwise" algorithm. * There is no optimisation for sparse/empty messages, compare to the old basic "linear" algorithm. * The attached "pairwise-nop" patch adds this optimisation and on the test case I first described in this thread (1000's of small, sparse, all-to-all), this cuts runtime by approximately 30% * I think the upper bound on the loop counter for pairwise exchange is off-by-one. As the comment notes "starting from 1 since local exhange [sic] is done"; but when step = (size + 1), the sendto/recvfrom both reduce to rank (self-exchange is already handled in earlier code) The pairwise algorithm still kills performance on my gigabit ethernet network. My message transmission time must be small compared to latency, and the forced MPI_Comm_size() synchronisation steps introduce a minimum delay (single_link_latency * comm_size), i.e. latency scale linearly with comm_size. The linear algorithm doesn't wait for each exchange, so its minimum latency is just a single transmit/receive. Which brings me to the second idea. The problem with the existing implementation of the linear algorithm is that the irecv/isend pattern was identical on all processes, meaning that every process starts by having to wait for process 0 to send to everyone and every process can finish waiting for rank (size-1) to send to everyone. It seems preferable to at least post the send/recv requests in the same order as the pairwise algorithm. The attached "linear-alltoallv" patch implements this and, on my test case, shows some modest 5% improvement. I was wondering if it would address the concerns which led to the switch of default algorithm. Simon diff -r '--exclude=*~' -u openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_alltoallv.c openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_alltoallv.c --- openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_alltoallv.c 2012-04-03 15:30:17.0 +0100 +++ openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_alltoallv.c 2013-01-24 15:12:13.299568308 + @@ -70,7 +70,7 @@ } /* Perform pairwise exchange starting from 1 since local exhange is done */ -for (step = 1; step < size + 1; step++) { +for (step = 1; step < size; step++) { /* Determine sender and receiver for this step. */ sendto = (rank + step) % size; diff -r '--exclude=*~' -u openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_util.c openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_util.c --- openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_util.c 2012-04-03 15:30:17.0 +0100 +++ openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_util.c 2013-01-24 15:11:56.562118400 + @@ -37,25 +37,31 @@ ompi_status_public_t* status ) { /* post receive first, then send, then waitall... should be fast (I hope) */ -int err, line = 0; +int err, line = 0, nreq = 0; ompi_request_t* reqs[2]; ompi_status_public_t statuses[2]; -/* post new irecv */ -err = MCA_PML_CALL(irecv( recvbuf, rcount, rdatatype, source, rtag, - comm, [0])); -if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; } - -/* send data to children */ -err = MCA_PML_CALL(isend( sendbuf, scount, sdatatype, dest, stag, - MCA_PML_BASE_SEND_STANDARD, comm, [1])); -if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; } +if (0 != rcount) { +/* post new irecv */ +err = MCA_PML_CALL(irecv( recvbuf, rcount, rdatatype, source, rtag, + comm, [nreq++])); +if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; } +} -err = ompi_request_wait_all( 2, reqs, statuses ); -if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler_waitall; } +if (0 != scount) { +/* send data to children */ +err = MCA_PML_CALL(isend( sendbuf, scount, sdatatype, dest, stag, + MCA_PML_BASE_SEND_STANDARD, comm, [nreq++])); +if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; } +} -if (MPI_STATUS_IGNORE != status) { -*status = statuses[0]; +if (0 != nreq) { +err = ompi_request_wait_all( nreq, reqs, statuses ); +if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler_waitall; } + +if (MPI_STATUS_IGNORE != status) { +*status = statuses[0]; +} } return (MPI_SUCCESS); @@ -68,7 +74,7 @@ if( MPI_ERR_IN_STATUS == err ) { /* At least we know he error was detected during the wait_all */ int err_index = 0; -if( MPI_SUCCESS != statuses[1].MPI_ERROR ) { +if( nreq > 1 && MPI_SUCCESS != statuses[1].MPI_ERROR ) { err_index = 1; } if (MPI_STATUS_IGNORE != status) { @@ -107,25
Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
that your specific >>>> communication scheme may have. That's why there are different algorithms >>>> and >>>> you are given the option to dynamically select them at run time without the >>>> need to recompile the code. I don't think the change of the default >>>> algorithm (note that the pairwise algorithm has been there for many years - >>>> it is not new, simply the new default one) was introduced in order to piss >>>> users off. >>>> >>>> If you want OMPI to default to the previous algorithm: >>>> >>>> 1) Add this to the system-wide OMPI configuration file >>>> $sysconf/openmpi-mca-params.conf (wher $sysconf would most likely be >>>> $PREFIX/etc, where $PREFIX is the OMPI installation directory): >>>> coll_tuned_use_dynamic_rules = 1 >>>> coll_tuned_alltoallv_algorithm = 1 >>>> >>>> 2) The settings from (1) can be overridden on per user basis by the similar >>>> settings from $HOME/.openmpi/mca-params.conf. >>>> >>>> 3) The settings from (1) and (2) can be overridden on per job basis by >>>> exporting MCA parameters as environment variables: >>>> export OMPI_MCA_coll_tuned_use_dynamic_rules=1 >>>> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1 >>>> >>>> 4) Finally, the settings from (1), (2), and (3) can be overridden on per >>>> MPI >>>> program launch by supplying appropriate MCA parameters to orterun (a.k.a. >>>> mpirun and mpiexec). >>>> >>>> There is also a largely undocumented feature of the "tuned" collective >>>> component where a dynamic rules file can be supplied. In the file a series >>>> of cases tell the library which implementation to use based on the >>>> communicator and message sizes. No idea if it works with ALLTOALLV. >>>> >>>> Kind regards, >>>> Hristo >>>> >>>> (sorry for top posting - damn you, Outlook!) >>>> -- >>>> Hristo Iliev, Ph.D. -- High Performance Computing >>>> RWTH Aachen University, Center for Computing and Communication >>>> Rechen- und Kommunikationszentrum der RWTH Aachen >>>> Seffenter Weg 23, D 52074 Aachen (Germany) >>>> >>>>> -Original Message- >>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >>>>> On Behalf Of Number Cruncher >>>>> Sent: Wednesday, December 19, 2012 5:31 PM >>>>> To: Open MPI Users >>>>> Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to >>>>> 1.6.1 >>>>> >>>>> On 19/12/12 11:08, Paul Kapinos wrote: >>>>>> Did you *really* wanna to dig into code just in order to switch a >>>>>> default communication algorithm? >>>>> No, I didn't want to, but with a huge change in performance, I'm forced to >>>> do >>>>> something! And having looked at the different algorithms, I think there's >>>> a >>>>> problem with the new default whenever message sizes are small enough >>>>> that connection latency will dominate. We're not all running Infiniband, >>>> and >>>>> having to wait for each pairwise exchange to complete before initiating >>>>> another seems wrong if the latency in waiting for completion dominates the >>>>> transmission time. >>>>> >>>>> E.g. If I have 10 small pairwise exchanges to perform,isn't it better to >>>> put all >>>>> 10 outbound messages on the wire, and wait for 10 matching inbound >>>>> messages, in any order? The new algorithm must wait for first exchange to >>>>> complete, then second exchange, then third. Unlike before, does it not >>>> have >>>>> to wait to acknowledge the matching *zero-sized* request? I don't see why >>>>> this temporal ordering matters. >>>>> >>>>> Thanks for your help, >>>>> Simon >>>>> >>>>> >>>>> >>>>> >>>>>> Note there are several ways to set the parameters; --mca on command >>>>>> line is just one of them (suitable for quick online tests). >>>>>> >>>>>> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params >>>>>> >>>>>> We 'tune' our Open MPI by setting environm
Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
o a largely undocumented feature of the "tuned" collective component where a dynamic rules file can be supplied. In the file a series of cases tell the library which implementation to use based on the communicator and message sizes. No idea if it works with ALLTOALLV. Kind regards, Hristo (sorry for top posting - damn you, Outlook!) -- Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg 23, D 52074 Aachen (Germany) -Original Message- From:users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Number Cruncher Sent: Wednesday, December 19, 2012 5:31 PM To: Open MPI Users Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1 On 19/12/12 11:08, Paul Kapinos wrote: Did you *really* wanna to dig into code just in order to switch a default communication algorithm? No, I didn't want to, but with a huge change in performance, I'm forced to do something! And having looked at the different algorithms, I think there's a problem with the new default whenever message sizes are small enough that connection latency will dominate. We're not all running Infiniband, and having to wait for each pairwise exchange to complete before initiating another seems wrong if the latency in waiting for completion dominates the transmission time. E.g. If I have 10 small pairwise exchanges to perform,isn't it better to put all 10 outbound messages on the wire, and wait for 10 matching inbound messages, in any order? The new algorithm must wait for first exchange to complete, then second exchange, then third. Unlike before, does it not have to wait to acknowledge the matching *zero-sized* request? I don't see why this temporal ordering matters. Thanks for your help, Simon Note there are several ways to set the parameters; --mca on command line is just one of them (suitable for quick online tests). http://www.open-mpi.org/faq/?category=tuning#setting-mca-params We 'tune' our Open MPI by setting environment variables Best Paul Kapinos On 12/19/12 11:44, Number Cruncher wrote: Having run some more benchmarks, the new default is *really* bad for our application (2-10x slower), so I've been looking at the source to try and figure out why. It seems that the biggest difference will occur when the all_to_all is actually sparse (e.g. our application); if most N-M process exchanges are zero in size the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually only post irecv/isend for non-zero exchanges; any zero-size exchanges are skipped. It then waits once for all requests to complete. In contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size exchanges for *every* N-M pair, and wait for each pairwise exchange. This is O(comm_size) waits, may of which are zero. I'm not clear what optimizations there are for zero-size isend/irecv, but surely there's a great deal more latency if each pairwise exchange has to be confirmed complete before executing the next? Relatedly, how would I direct OpenMPI to use the older algorithm programmatically? I don't want the user to have to use "--mca" in their "mpiexec". Is there a C API? Thanks, Simon On 16/11/12 10:15, Iliev, Hristo wrote: Hi, Simon, The pairwise algorithm passes messages in a synchronised ring-like fashion with increasing stride, so it works best when independent communication paths could be established between several ports of the network switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so, some is - it depends (usually on the price). This said, not all algorithms perform the same given a specific type of network interconnect. For example, on our fat-tree InfiniBand network the pairwise algorithm performs better. You can switch back to the basic linear algorithm by providing the following MCA parameters: mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_alltoallv_algorithm 1 ... Algorithm 1 is the basic linear, which used to be the default. Algorithm 2 is the pairwise one. You can also set these values as exported environment variables: export OMPI_MCA_coll_tuned_use_dynamic_rules=1 export OMPI_MCA_coll_tuned_alltoallv_algorithm=1 mpiexec ... You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf: coll_tuned_use_dynamic_rules=1 coll_tuned_alltoallv_algorithm=1 A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to activate process binding with --bind-to-core if you haven't already did so. It prevents MPI processes from being migrated to other NUMA nodes while running. Kind regards, Hristo -- Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg 23, D 5
Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
rules file can be supplied. In the file a series >> of cases tell the library which implementation to use based on the >> communicator and message sizes. No idea if it works with ALLTOALLV. >> >> Kind regards, >> Hristo >> >> (sorry for top posting - damn you, Outlook!) >> -- >> Hristo Iliev, Ph.D. -- High Performance Computing >> RWTH Aachen University, Center for Computing and Communication >> Rechen- und Kommunikationszentrum der RWTH Aachen >> Seffenter Weg 23, D 52074 Aachen (Germany) >> >>> -Original Message- >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >>> On Behalf Of Number Cruncher >>> Sent: Wednesday, December 19, 2012 5:31 PM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to >>> 1.6.1 >>> >>> On 19/12/12 11:08, Paul Kapinos wrote: >>>> Did you *really* wanna to dig into code just in order to switch a >>>> default communication algorithm? >>> No, I didn't want to, but with a huge change in performance, I'm forced to >> do >>> something! And having looked at the different algorithms, I think there's >> a >>> problem with the new default whenever message sizes are small enough >>> that connection latency will dominate. We're not all running Infiniband, >> and >>> having to wait for each pairwise exchange to complete before initiating >>> another seems wrong if the latency in waiting for completion dominates the >>> transmission time. >>> >>> E.g. If I have 10 small pairwise exchanges to perform,isn't it better to >> put all >>> 10 outbound messages on the wire, and wait for 10 matching inbound >>> messages, in any order? The new algorithm must wait for first exchange to >>> complete, then second exchange, then third. Unlike before, does it not >> have >>> to wait to acknowledge the matching *zero-sized* request? I don't see why >>> this temporal ordering matters. >>> >>> Thanks for your help, >>> Simon >>> >>> >>> >>> >>>> Note there are several ways to set the parameters; --mca on command >>>> line is just one of them (suitable for quick online tests). >>>> >>>> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params >>>> >>>> We 'tune' our Open MPI by setting environment variables >>>> >>>> Best >>>> Paul Kapinos >>>> >>>> >>>> >>>> On 12/19/12 11:44, Number Cruncher wrote: >>>>> Having run some more benchmarks, the new default is *really* bad for >>>>> our application (2-10x slower), so I've been looking at the source to >>>>> try and figure out why. >>>>> >>>>> It seems that the biggest difference will occur when the all_to_all >>>>> is actually sparse (e.g. our application); if most N-M process >>>>> exchanges are zero in size the 1.6 >>>>> ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually >>>>> only post irecv/isend for non-zero exchanges; any zero-size exchanges >>>>> are skipped. It then waits once for all requests to complete. In >>>>> contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post >>>>> the zero-size exchanges for >>>>> *every* N-M pair, and wait for each pairwise exchange. This is >>>>> O(comm_size) >>>>> waits, may of which are zero. I'm not clear what optimizations there >>>>> are for zero-size isend/irecv, but surely there's a great deal more >>>>> latency if each pairwise exchange has to be confirmed complete before >>>>> executing the next? >>>>> >>>>> Relatedly, how would I direct OpenMPI to use the older algorithm >>>>> programmatically? I don't want the user to have to use "--mca" in >>>>> their "mpiexec". Is there a C API? >>>>> >>>>> Thanks, >>>>> Simon >>>>> >>>>> >>>>> On 16/11/12 10:15, Iliev, Hristo wrote: >>>>>> Hi, Simon, >>>>>> >>>>>> The pairwise algorithm passes messages in a synchronised ring-like >>>>>> fashion with increasing stride, so it works best when independent >>>>>> communication paths could be established between several ports of >>&g
Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
I completely understand there's no "one size fits all", and I appreciate that there are workarounds to the change in behaviour. I'm only trying to make what little contribution I can by questioning the architecture of the pairwise algorithm. I know that for every user you please, there will be some that aren't happy when a default changes (Windows 8 anyone?), but I'm trying to provide some real-world data. If 90% of apps are 10% faster and 10% are 1000% slower, should the default change? all_to_all is a really nice primitive from a developer point of view. Every process' code is symmetric and identical. Maybe I should have to worry that most of the matrix exchange is sparse; I probably could calculate an optimal exchange pattern. But I think this is the implementation's job, and I will continue to argue that *waiting* for each pairwise exchange is (a) unnecessary, (b) doesn't improve performance for *any* application and (c) at worst causes huge slowdown over the last algorithm for sparse cases. In summary: I'm arguing that there's a problem with the pairwise implementation as it stands. It doesn't have any optimization for sparse all_to_all and imposes unnecessary synchronisation barriers in all cases. Simon On 20/12/2012 14:42, Iliev, Hristo wrote: Simon, The goal of any MPI implementation is to be as fast as possible. Unfortunately there is no "one size fits all" algorithm that works on all networks and given all possible kind of peculiarities that your specific communication scheme may have. That's why there are different algorithms and you are given the option to dynamically select them at run time without the need to recompile the code. I don't think the change of the default algorithm (note that the pairwise algorithm has been there for many years - it is not new, simply the new default one) was introduced in order to piss users off. If you want OMPI to default to the previous algorithm: 1) Add this to the system-wide OMPI configuration file $sysconf/openmpi-mca-params.conf (wher $sysconf would most likely be $PREFIX/etc, where $PREFIX is the OMPI installation directory): coll_tuned_use_dynamic_rules = 1 coll_tuned_alltoallv_algorithm = 1 2) The settings from (1) can be overridden on per user basis by the similar settings from $HOME/.openmpi/mca-params.conf. 3) The settings from (1) and (2) can be overridden on per job basis by exporting MCA parameters as environment variables: export OMPI_MCA_coll_tuned_use_dynamic_rules=1 export OMPI_MCA_coll_tuned_alltoallv_algorithm=1 4) Finally, the settings from (1), (2), and (3) can be overridden on per MPI program launch by supplying appropriate MCA parameters to orterun (a.k.a. mpirun and mpiexec). There is also a largely undocumented feature of the "tuned" collective component where a dynamic rules file can be supplied. In the file a series of cases tell the library which implementation to use based on the communicator and message sizes. No idea if it works with ALLTOALLV. Kind regards, Hristo (sorry for top posting - damn you, Outlook!) -- Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg 23, D 52074 Aachen (Germany) -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Number Cruncher Sent: Wednesday, December 19, 2012 5:31 PM To: Open MPI Users Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1 On 19/12/12 11:08, Paul Kapinos wrote: Did you *really* wanna to dig into code just in order to switch a default communication algorithm? No, I didn't want to, but with a huge change in performance, I'm forced to do something! And having looked at the different algorithms, I think there's a problem with the new default whenever message sizes are small enough that connection latency will dominate. We're not all running Infiniband, and having to wait for each pairwise exchange to complete before initiating another seems wrong if the latency in waiting for completion dominates the transmission time. E.g. If I have 10 small pairwise exchanges to perform,isn't it better to put all 10 outbound messages on the wire, and wait for 10 matching inbound messages, in any order? The new algorithm must wait for first exchange to complete, then second exchange, then third. Unlike before, does it not have to wait to acknowledge the matching *zero-sized* request? I don't see why this temporal ordering matters. Thanks for your help, Simon Note there are several ways to set the parameters; --mca on command line is just one of them (suitable for quick online tests). http://www.open-mpi.org/faq/?category=tuning#setting-mca-params We 'tune' our Open MPI by setting environment variables Best Paul Kapinos On 12/19/12 11:44, Number Cruncher wrote: Having run some more benchmarks, t
Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
Simon, The goal of any MPI implementation is to be as fast as possible. Unfortunately there is no "one size fits all" algorithm that works on all networks and given all possible kind of peculiarities that your specific communication scheme may have. That's why there are different algorithms and you are given the option to dynamically select them at run time without the need to recompile the code. I don't think the change of the default algorithm (note that the pairwise algorithm has been there for many years - it is not new, simply the new default one) was introduced in order to piss users off. If you want OMPI to default to the previous algorithm: 1) Add this to the system-wide OMPI configuration file $sysconf/openmpi-mca-params.conf (wher $sysconf would most likely be $PREFIX/etc, where $PREFIX is the OMPI installation directory): coll_tuned_use_dynamic_rules = 1 coll_tuned_alltoallv_algorithm = 1 2) The settings from (1) can be overridden on per user basis by the similar settings from $HOME/.openmpi/mca-params.conf. 3) The settings from (1) and (2) can be overridden on per job basis by exporting MCA parameters as environment variables: export OMPI_MCA_coll_tuned_use_dynamic_rules=1 export OMPI_MCA_coll_tuned_alltoallv_algorithm=1 4) Finally, the settings from (1), (2), and (3) can be overridden on per MPI program launch by supplying appropriate MCA parameters to orterun (a.k.a. mpirun and mpiexec). There is also a largely undocumented feature of the "tuned" collective component where a dynamic rules file can be supplied. In the file a series of cases tell the library which implementation to use based on the communicator and message sizes. No idea if it works with ALLTOALLV. Kind regards, Hristo (sorry for top posting - damn you, Outlook!) -- Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg 23, D 52074 Aachen (Germany) > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] > On Behalf Of Number Cruncher > Sent: Wednesday, December 19, 2012 5:31 PM > To: Open MPI Users > Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to > 1.6.1 > > On 19/12/12 11:08, Paul Kapinos wrote: > > Did you *really* wanna to dig into code just in order to switch a > > default communication algorithm? > > No, I didn't want to, but with a huge change in performance, I'm forced to do > something! And having looked at the different algorithms, I think there's a > problem with the new default whenever message sizes are small enough > that connection latency will dominate. We're not all running Infiniband, and > having to wait for each pairwise exchange to complete before initiating > another seems wrong if the latency in waiting for completion dominates the > transmission time. > > E.g. If I have 10 small pairwise exchanges to perform,isn't it better to put all > 10 outbound messages on the wire, and wait for 10 matching inbound > messages, in any order? The new algorithm must wait for first exchange to > complete, then second exchange, then third. Unlike before, does it not have > to wait to acknowledge the matching *zero-sized* request? I don't see why > this temporal ordering matters. > > Thanks for your help, > Simon > > > > > > > > Note there are several ways to set the parameters; --mca on command > > line is just one of them (suitable for quick online tests). > > > > http://www.open-mpi.org/faq/?category=tuning#setting-mca-params > > > > We 'tune' our Open MPI by setting environment variables > > > > Best > > Paul Kapinos > > > > > > > > On 12/19/12 11:44, Number Cruncher wrote: > >> Having run some more benchmarks, the new default is *really* bad for > >> our application (2-10x slower), so I've been looking at the source to > >> try and figure out why. > >> > >> It seems that the biggest difference will occur when the all_to_all > >> is actually sparse (e.g. our application); if most N-M process > >> exchanges are zero in size the 1.6 > >> ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually > >> only post irecv/isend for non-zero exchanges; any zero-size exchanges > >> are skipped. It then waits once for all requests to complete. In > >> contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post > >> the zero-size exchanges for > >> *every* N-M pair, and wait for each pairwise exchange. This is > >> O(comm_size) > >> waits, may of which are zero. I'm not clear what optimizations there > >> are for zero-size isend/irecv, but surely there's
Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
On 19/12/12 11:08, Paul Kapinos wrote: Did you *really* wanna to dig into code just in order to switch a default communication algorithm? No, I didn't want to, but with a huge change in performance, I'm forced to do something! And having looked at the different algorithms, I think there's a problem with the new default whenever message sizes are small enough that connection latency will dominate. We're not all running Infiniband, and having to wait for each pairwise exchange to complete before initiating another seems wrong if the latency in waiting for completion dominates the transmission time. E.g. If I have 10 small pairwise exchanges to perform,isn't it better to put all 10 outbound messages on the wire, and wait for 10 matching inbound messages, in any order? The new algorithm must wait for first exchange to complete, then second exchange, then third. Unlike before, does it not have to wait to acknowledge the matching *zero-sized* request? I don't see why this temporal ordering matters. Thanks for your help, Simon Note there are several ways to set the parameters; --mca on command line is just one of them (suitable for quick online tests). http://www.open-mpi.org/faq/?category=tuning#setting-mca-params We 'tune' our Open MPI by setting environment variables Best Paul Kapinos On 12/19/12 11:44, Number Cruncher wrote: Having run some more benchmarks, the new default is *really* bad for our application (2-10x slower), so I've been looking at the source to try and figure out why. It seems that the biggest difference will occur when the all_to_all is actually sparse (e.g. our application); if most N-M process exchanges are zero in size the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually only post irecv/isend for non-zero exchanges; any zero-size exchanges are skipped. It then waits once for all requests to complete. In contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size exchanges for *every* N-M pair, and wait for each pairwise exchange. This is O(comm_size) waits, may of which are zero. I'm not clear what optimizations there are for zero-size isend/irecv, but surely there's a great deal more latency if each pairwise exchange has to be confirmed complete before executing the next? Relatedly, how would I direct OpenMPI to use the older algorithm programmatically? I don't want the user to have to use "--mca" in their "mpiexec". Is there a C API? Thanks, Simon On 16/11/12 10:15, Iliev, Hristo wrote: Hi, Simon, The pairwise algorithm passes messages in a synchronised ring-like fashion with increasing stride, so it works best when independent communication paths could be established between several ports of the network switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so, some is - it depends (usually on the price). This said, not all algorithms perform the same given a specific type of network interconnect. For example, on our fat-tree InfiniBand network the pairwise algorithm performs better. You can switch back to the basic linear algorithm by providing the following MCA parameters: mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_alltoallv_algorithm 1 ... Algorithm 1 is the basic linear, which used to be the default. Algorithm 2 is the pairwise one. You can also set these values as exported environment variables: export OMPI_MCA_coll_tuned_use_dynamic_rules=1 export OMPI_MCA_coll_tuned_alltoallv_algorithm=1 mpiexec ... You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf: coll_tuned_use_dynamic_rules=1 coll_tuned_alltoallv_algorithm=1 A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to activate process binding with --bind-to-core if you haven't already did so. It prevents MPI processes from being migrated to other NUMA nodes while running. Kind regards, Hristo -- Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg 23, D 52074 Aachen (Germany) -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Number Cruncher Sent: Thursday, November 15, 2012 5:37 PM To: Open MPI Users Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1 I've noticed a very significant (100%) slow down for MPI_Alltoallv calls as of version 1.6.1. * This is most noticeable for high-frequency exchanges over 1Gb ethernet where process-to-process message sizes are fairly small (e.g. 100kbyte) and much of the exchange matrix is sparse. * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm to a pairwise exchange", but I'm not clear what this means or how to switch back to the old "non-default algorit
Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
Did you *really* wanna to dig into code just in order to switch a default communication algorithm? Note there are several ways to set the parameters; --mca on command line is just one of them (suitable for quick online tests). http://www.open-mpi.org/faq/?category=tuning#setting-mca-params We 'tune' our Open MPI by setting environment variables Best Paul Kapinos On 12/19/12 11:44, Number Cruncher wrote: Having run some more benchmarks, the new default is *really* bad for our application (2-10x slower), so I've been looking at the source to try and figure out why. It seems that the biggest difference will occur when the all_to_all is actually sparse (e.g. our application); if most N-M process exchanges are zero in size the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually only post irecv/isend for non-zero exchanges; any zero-size exchanges are skipped. It then waits once for all requests to complete. In contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size exchanges for *every* N-M pair, and wait for each pairwise exchange. This is O(comm_size) waits, may of which are zero. I'm not clear what optimizations there are for zero-size isend/irecv, but surely there's a great deal more latency if each pairwise exchange has to be confirmed complete before executing the next? Relatedly, how would I direct OpenMPI to use the older algorithm programmatically? I don't want the user to have to use "--mca" in their "mpiexec". Is there a C API? Thanks, Simon On 16/11/12 10:15, Iliev, Hristo wrote: Hi, Simon, The pairwise algorithm passes messages in a synchronised ring-like fashion with increasing stride, so it works best when independent communication paths could be established between several ports of the network switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so, some is - it depends (usually on the price). This said, not all algorithms perform the same given a specific type of network interconnect. For example, on our fat-tree InfiniBand network the pairwise algorithm performs better. You can switch back to the basic linear algorithm by providing the following MCA parameters: mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_alltoallv_algorithm 1 ... Algorithm 1 is the basic linear, which used to be the default. Algorithm 2 is the pairwise one. You can also set these values as exported environment variables: export OMPI_MCA_coll_tuned_use_dynamic_rules=1 export OMPI_MCA_coll_tuned_alltoallv_algorithm=1 mpiexec ... You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf: coll_tuned_use_dynamic_rules=1 coll_tuned_alltoallv_algorithm=1 A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to activate process binding with --bind-to-core if you haven't already did so. It prevents MPI processes from being migrated to other NUMA nodes while running. Kind regards, Hristo -- Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg 23, D 52074 Aachen (Germany) -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Number Cruncher Sent: Thursday, November 15, 2012 5:37 PM To: Open MPI Users Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1 I've noticed a very significant (100%) slow down for MPI_Alltoallv calls as of version 1.6.1. * This is most noticeable for high-frequency exchanges over 1Gb ethernet where process-to-process message sizes are fairly small (e.g. 100kbyte) and much of the exchange matrix is sparse. * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm to a pairwise exchange", but I'm not clear what this means or how to switch back to the old "non-default algorithm". I attach a test program which illustrates the sort of usage in our MPI application. I have run as this as 32 processes on four nodes, over 1Gb ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,.. on node 1, rank 1,5,9, ... on node 2, etc. It constructs an array of integers and a nProcess x nProcess exchange typical of part of our application. This is then exchanged several thousand times. Output from "mpicc -O3" runs shown below. My guess is that 1.6.1 is hitting additional latency not present in 1.6.0. I also attach a plot showing network throughput on our actual mesh generation application. Nodes cfsc01-04 are running 1.6.0 and finish within 35 minutes. Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over a hour to run. There seems to be a much greater network demand in the 1.6.1 version, despite the user-code and input data being identical. Thanks for any help you can give, Simon ___
Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
Having run some more benchmarks, the new default is *really* bad for our application (2-10x slower), so I've been looking at the source to try and figure out why. It seems that the biggest difference will occur when the all_to_all is actually sparse (e.g. our application); if most N-M process exchanges are zero in size the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually only post irecv/isend for non-zero exchanges; any zero-size exchanges are skipped. It then waits once for all requests to complete. In contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size exchanges for *every* N-M pair, and wait for each pairwise exchange. This is O(comm_size) waits, may of which are zero. I'm not clear what optimizations there are for zero-size isend/irecv, but surely there's a great deal more latency if each pairwise exchange has to be confirmed complete before executing the next? Relatedly, how would I direct OpenMPI to use the older algorithm programmatically? I don't want the user to have to use "--mca" in their "mpiexec". Is there a C API? Thanks, Simon On 16/11/12 10:15, Iliev, Hristo wrote: Hi, Simon, The pairwise algorithm passes messages in a synchronised ring-like fashion with increasing stride, so it works best when independent communication paths could be established between several ports of the network switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so, some is - it depends (usually on the price). This said, not all algorithms perform the same given a specific type of network interconnect. For example, on our fat-tree InfiniBand network the pairwise algorithm performs better. You can switch back to the basic linear algorithm by providing the following MCA parameters: mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_alltoallv_algorithm 1 ... Algorithm 1 is the basic linear, which used to be the default. Algorithm 2 is the pairwise one. You can also set these values as exported environment variables: export OMPI_MCA_coll_tuned_use_dynamic_rules=1 export OMPI_MCA_coll_tuned_alltoallv_algorithm=1 mpiexec ... You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf: coll_tuned_use_dynamic_rules=1 coll_tuned_alltoallv_algorithm=1 A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to activate process binding with --bind-to-core if you haven't already did so. It prevents MPI processes from being migrated to other NUMA nodes while running. Kind regards, Hristo -- Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg 23, D 52074 Aachen (Germany) -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Number Cruncher Sent: Thursday, November 15, 2012 5:37 PM To: Open MPI Users Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1 I've noticed a very significant (100%) slow down for MPI_Alltoallv calls as of version 1.6.1. * This is most noticeable for high-frequency exchanges over 1Gb ethernet where process-to-process message sizes are fairly small (e.g. 100kbyte) and much of the exchange matrix is sparse. * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm to a pairwise exchange", but I'm not clear what this means or how to switch back to the old "non-default algorithm". I attach a test program which illustrates the sort of usage in our MPI application. I have run as this as 32 processes on four nodes, over 1Gb ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,.. on node 1, rank 1,5,9, ... on node 2, etc. It constructs an array of integers and a nProcess x nProcess exchange typical of part of our application. This is then exchanged several thousand times. Output from "mpicc -O3" runs shown below. My guess is that 1.6.1 is hitting additional latency not present in 1.6.0. I also attach a plot showing network throughput on our actual mesh generation application. Nodes cfsc01-04 are running 1.6.0 and finish within 35 minutes. Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over a hour to run. There seems to be a much greater network demand in the 1.6.1 version, despite the user-code and input data being identical. Thanks for any help you can give, Simon
Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
Hi, Simon, The pairwise algorithm passes messages in a synchronised ring-like fashion with increasing stride, so it works best when independent communication paths could be established between several ports of the network switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so, some is - it depends (usually on the price). This said, not all algorithms perform the same given a specific type of network interconnect. For example, on our fat-tree InfiniBand network the pairwise algorithm performs better. You can switch back to the basic linear algorithm by providing the following MCA parameters: mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_alltoallv_algorithm 1 ... Algorithm 1 is the basic linear, which used to be the default. Algorithm 2 is the pairwise one. You can also set these values as exported environment variables: export OMPI_MCA_coll_tuned_use_dynamic_rules=1 export OMPI_MCA_coll_tuned_alltoallv_algorithm=1 mpiexec ... You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf: coll_tuned_use_dynamic_rules=1 coll_tuned_alltoallv_algorithm=1 A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to activate process binding with --bind-to-core if you haven't already did so. It prevents MPI processes from being migrated to other NUMA nodes while running. Kind regards, Hristo -- Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg 23, D 52074 Aachen (Germany) > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] > On Behalf Of Number Cruncher > Sent: Thursday, November 15, 2012 5:37 PM > To: Open MPI Users > Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1 > > I've noticed a very significant (100%) slow down for MPI_Alltoallv calls as of > version 1.6.1. > * This is most noticeable for high-frequency exchanges over 1Gb ethernet > where process-to-process message sizes are fairly small (e.g. 100kbyte) and > much of the exchange matrix is sparse. > * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm > to a pairwise exchange", but I'm not clear what this means or how to switch > back to the old "non-default algorithm". > > I attach a test program which illustrates the sort of usage in our MPI > application. I have run as this as 32 processes on four nodes, over 1Gb > ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,.. > on node 1, rank 1,5,9, ... on node 2, etc. > > It constructs an array of integers and a nProcess x nProcess exchange typical > of part of our application. This is then exchanged several thousand times. > Output from "mpicc -O3" runs shown below. > > My guess is that 1.6.1 is hitting additional latency not present in 1.6.0. I also > attach a plot showing network throughput on our actual mesh generation > application. Nodes cfsc01-04 are running 1.6.0 and finish within 35 minutes. > Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over a > hour to run. There seems to be a much greater network demand in the 1.6.1 > version, despite the user-code and input data being identical. > > Thanks for any help you can give, > Simon > > For 1.6.0: > > Open MPI 1.6.0 > Proc 0: 50 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 198 x 100 int > Proc 1: 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 148 x 100 int > Proc 2: 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0Total: 109 x 100 int > Proc 3: 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 80 x 100 int > Proc 4: 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 58 x 100 int > Proc 5: 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 41 x 100 int > Proc 6: 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0Total: 29 x 100 int > Proc 7: 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 20 x 100 int > Proc 8: 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 14 x > 100 int > Proc 9: 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 9 x > 100 int > Proc 10: 2
[OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
I've noticed a very significant (100%) slow down for MPI_Alltoallv calls as of version 1.6.1. * This is most noticeable for high-frequency exchanges over 1Gb ethernet where process-to-process message sizes are fairly small (e.g. 100kbyte) and much of the exchange matrix is sparse. * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm to a pairwise exchange", but I'm not clear what this means or how to switch back to the old "non-default algorithm". I attach a test program which illustrates the sort of usage in our MPI application. I have run as this as 32 processes on four nodes, over 1Gb ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,.. on node 1, rank 1,5,9, ... on node 2, etc. It constructs an array of integers and a nProcess x nProcess exchange typical of part of our application. This is then exchanged several thousand times. Output from "mpicc -O3" runs shown below. My guess is that 1.6.1 is hitting additional latency not present in 1.6.0. I also attach a plot showing network throughput on our actual mesh generation application. Nodes cfsc01-04 are running 1.6.0 and finish within 35 minutes. Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over a hour to run. There seems to be a much greater network demand in the 1.6.1 version, despite the user-code and input data being identical. Thanks for any help you can give, Simon For 1.6.0: Open MPI 1.6.0 Proc 0: 50 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 198 x 100 int Proc 1: 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 148 x 100 int Proc 2: 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0Total: 109 x 100 int Proc 3: 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 80 x 100 int Proc 4: 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 58 x 100 int Proc 5: 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 41 x 100 int Proc 6: 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0Total: 29 x 100 int Proc 7: 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 20 x 100 int Proc 8: 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 14 x 100 int Proc 9: 3 2 1 0 0 0 0 0 0 0 0 0 Total: 9 x 100 int Proc 10: 2 1 0 0 0 0 0 0 0 0 0 Total: 6 x 100 int Proc 11: 1 0 0 0 0 0 0 0 0 0 0 Total: 4 x 100 int Proc 12: 0 0 0 0 0 0 0 0 0 0 0 Total: 2 x 100 int Proc 13: 0 0 0 0 0 0 0 0 0 0 0 Total: 1 x 100 int Proc 14: 0 0 0 0 0 0 0 0 0 0 0 Total: 1 x 100 int Proc 15: 0 0 0 0 0 0 0 0 0 0 0 Total: 0 x 100 int Proc 16: 0 0 0 0 0 0 0 0 0 0 0 Total: 0 x 100 int Proc 17: 0 0 0 0 0 0 0 0 0 0 0 Total: 1 x 100 int Proc 18: 0 0 0 0 0 0 0 0 0 0 0 Total: 1 x 100 int Proc 19: 0 0 0 0 0 0 0 0 0 0 0 Total: 2 x 100 int Proc 20: 0 0 0 0 0 0 0 0 0 0 1 Total: 4 x 100 int Proc 21: 0 0 0 0 0 0 0 0 0 1 2 Total: 6 x 100 int Proc 22: 0 0 0 0 0 0 0 0 0 1 2 3 Total: 9 x 100 int Proc 23: 0 0 0 0 0 0 0 0 0 1 2 3 4 Total: 14 x 100 int Proc 24:0 0 0 0 0 0 0 0 0 1 2 3 4 6 Total: 20 x 100 int Proc 25: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 Total: 29 x 100 int Proc 26: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 Total: 41 x 100 int Proc 27: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 16 Total: 58 x 100 int Proc 28:0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 16 22 Total: 80 x 100 int Proc 29: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 16 22