Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
I can argue the opposite: in the most general case, each process will exchange data with all other processes, so a blocking approach as implemented in the current version make sense. As you noticed, this lead to poor results when the exchange pattern is sparse. We took what we believed is the most common usage of the alltoallv collective, and provided a default algorithm we consider the best for it (pairwise due to a tightly coupled structure of communications). However, as one of the main developers of the collective module, I'm not insensible to your argument. I would have loved to be able to alter the behavior of the alltoallv to adapt more carefully to the collective pattern itself. Unfortunately, it is very difficult as the alltoallv is one of the few collective, where the knowledge about the communication pattern is not evenly distributed among the peers (every rank knows only about the communications where it is involved). Thus, without requiring extra communications, the only valid parameter which can affect the selection of one of the underlying implementations is the number of participants in the collective (not even the number of participants exchanging real data, but the number of participants in the communicator). Not enough to make a smartest decision. As suggested several times already in this thread, there are quite a few MCA parameters that allow specialized behaviors for applications with communication patterns we did not considered as mainstream. You should definitively take advantage of these to further optimize your applications. George. On Dec 21, 2012, at 13:25 , Number Cruncherwrote: > I completely understand there's no "one size fits all", and I appreciate that > there are workarounds to the change in behaviour. I'm only trying to make > what little contribution I can by questioning the architecture of the > pairwise algorithm. > > I know that for every user you please, there will be some that aren't happy > when a default changes (Windows 8 anyone?), but I'm trying to provide some > real-world data. If 90% of apps are 10% faster and 10% are 1000% slower, > should the default change? > > all_to_all is a really nice primitive from a developer point of view. Every > process' code is symmetric and identical. Maybe I should have to worry that > most of the matrix exchange is sparse; I probably could calculate an optimal > exchange pattern. But I think this is the implementation's job, and I will > continue to argue that *waiting* for each pairwise exchange is (a) > unnecessary, (b) doesn't improve performance for *any* application and (c) at > worst causes huge slowdown over the last algorithm for sparse cases. > > In summary: I'm arguing that there's a problem with the pairwise > implementation as it stands. It doesn't have any optimization for sparse > all_to_all and imposes unnecessary synchronisation barriers in all cases. > > Simon > > > > On 20/12/2012 14:42, Iliev, Hristo wrote: >> Simon, >> >> The goal of any MPI implementation is to be as fast as possible. >> Unfortunately there is no "one size fits all" algorithm that works on all >> networks and given all possible kind of peculiarities that your specific >> communication scheme may have. That's why there are different algorithms and >> you are given the option to dynamically select them at run time without the >> need to recompile the code. I don't think the change of the default >> algorithm (note that the pairwise algorithm has been there for many years - >> it is not new, simply the new default one) was introduced in order to piss >> users off. >> >> If you want OMPI to default to the previous algorithm: >> >> 1) Add this to the system-wide OMPI configuration file >> $sysconf/openmpi-mca-params.conf (wher $sysconf would most likely be >> $PREFIX/etc, where $PREFIX is the OMPI installation directory): >> coll_tuned_use_dynamic_rules = 1 >> coll_tuned_alltoallv_algorithm = 1 >> >> 2) The settings from (1) can be overridden on per user basis by the similar >> settings from $HOME/.openmpi/mca-params.conf. >> >> 3) The settings from (1) and (2) can be overridden on per job basis by >> exporting MCA parameters as environment variables: >> export OMPI_MCA_coll_tuned_use_dynamic_rules=1 >> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1 >> >> 4) Finally, the settings from (1), (2), and (3) can be overridden on per MPI >> program launch by supplying appropriate MCA parameters to orterun (a.k.a. >> mpirun and mpiexec). >> >> There is also a largely undocumented feature of the "tuned" collective >> component where a dynamic rules file can be supplied. In the file a series >> of cases tell the library which implementation to use based on the >> communicator and message sizes. No idea if it works with ALLTOALLV. >> >> Kind regards, >> Hristo >> >> (sorry for top posting - damn you, Outlook!) >> -- >> Hristo Iliev, Ph.D. -- High Performance
Re: [OMPI users] Question about Lost Messages
Corey, The communication pattern looks legit, it is very difficult to see what is going wrong with a code to look at. Can you provide a simple case (maybe the skeleton of your application) we can work from? George. On Dec 20, 2012, at 22:07 , Corey Allenwrote: > Hello, > > I am trying to confirm that I am using OpenMPI in a correct way. I > seem to be losing messages but I don't like to assume there's a bug > when I'm still new to MPI in general. > > I have multiple processes in a master / slaves type setup, and I am > trying to have multiple persistent non-blocking message requests > between them to prevent starvation. (Tech detail: 4-core Intel running > Ubuntu 64-bit and OpenMPI 1.4. Everything is local. Total processes is > 5. One master, four slaves. The problem only surfaces on the slowest > slave - the one with the most work.) > > The setup is like this: > > Master: > > Create 3 persistent send requests, with three different buffers (in a 2D > array) > Load data into each buffer > Start each send request > In a loop: > TestSome on the 3 sends > for each send that's completed > load new data into the buffer > restart that send > loop > > Slave: > > Create 3 persistent receive requests, with three different buffers (in > a 2D array) > Start each receive request > In a loop: > WaitAny on the 3 receives > Consume data from the one receive buffer from WaitAny > Start that receive again > loop > > Basically what I'm seeing is that the master gets a "completed" send > request from TestSome and loads new data, restarts, etc. but the slave > never sees that particular message. I was under the impression that > WaitAny should return only one message but also should eventually > return every message sent in this situation. > > I am operating under the assumption that even if the send request is > completed and the buffer overwritten in the master, the receive for > that message eventually occurs with the correct data in the slave. I > did not think I had to advise the master that the slave was finished > reading data out of the receive buffer before the master could reuse > the send buffer. > > What it LOOKS like to me is that WaitAny is marking more than one send > completed, so the master sends the next message, but I can't see it in > the slave. > > I hope this is making sense. Any input on whether I'm doing this wrong > or a way to see if the message is really being lost would be helpful. > If there's a good example code of multiple simultaneous asynchronous > messages to avoid starvation that is set up better than my approach, > I'd like to see it. > > Thanks! > > Corey > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI planned outage
Oops! The times that were sent were wrong. Here's the correct times: - 3:00am-09:00am Pacific US time - 4:00am-10:00am Mountain US time - 5:00am-11:00am Central US time - 6:00am-12:00am Eastern US time - 11:00am-05:00pm GMT On Dec 21, 2012, at 12:44 PM, Jeff Squyres wrote: > Our Indiana U. hosting providers will be doing some maintenance over the > holiday break. > > All Open MPI services -- web, trac, SVN, ...etc. -- will be down on > Wednesday, December 26th, 2012 during the following time period: > > - 5:00am-11:00am Pacific US time > - 6:00am-12:00pm Mountain US time > - 7:00am-01:00pm Central US time > - 6:00am-02:00pm Eastern US time > - 11:00am-05:00pm GMT > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] broadcasting basic data items in Java
Hi > Hmmm...weird. Well, it looks like OMPI itself is okay, so the issue > appears to be in the Java side of things. For whatever reason, your > Java VM is refusing to allow a malloc to succeed. I suspect it has > something to do with its setup, but I'm not enough of a Java person > to point you to the problem. > > Is it possible that the program was compiled against a different > (perhaps incompatible) version of Java? No, I don't think so. A small Java program without MPI methods works. linpc1 bin 122 which mpicc /usr/local/openmpi-1.9_64_cc/bin/mpicc linpc1 bin 123 pwd /usr/local/openmpi-1.9_64_cc/bin linpc1 bin 124 grep jdk * mpijavac:my $my_compiler = "/usr/local/jdk1.7.0_07-64/bin/javac"; mpijavac.pl:my $my_compiler = "/usr/local/jdk1.7.0_07-64/bin/javac"; linpc1 bin 125 which java /usr/local/jdk1.7.0_07-64/bin/java linpc1 bin 126 linpc1 prog 110 javac MiniProgMain.java linpc1 prog 111 java MiniProgMain Message 0 Message 1 Message 2 Message 3 Message 4 linpc1 prog 112 mpiexec java MiniProgMain Message 0 Message 1 Message 2 Message 3 Message 4 linpc1 prog 113 mpiexec -np 2 java MiniProgMain Message 0 Message 1 Message 2 Message 3 Message 4 Message 0 Message 1 Message 2 Message 3 Message 4 A small program which allocates buffer for a new string. ... stringBUFLEN = new String (string.substring (0, len)); ... linpc1 prog 115 javac MemAllocMain.java linpc1 prog 116 java MemAllocMain Type something ("quit" terminates program): ffghhfhh Received input: ffghhfhh Converted to upper case: FFGHHFHH Type something ("quit" terminates program): quit Received input: quit Converted to upper case: QUIT linpc1 prog 117 mpiexec java MemAllocMain Type something ("quit" terminates program): fbhshnhjs Received input: fbhshnhjs Converted to upper case: FBHSHNHJS Type something ("quit" terminates program): quit Received input: quit Converted to upper case: QUIT linpc1 prog 118 I'm not sure if this is of any help, but the problem starts with MPI methods. The following program calls just the Init() and Finalize() method. tyr java 203 mpiexec -host linpc1 java InitFinalizeMain -- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_base_open failed --> Returned value -2 instead of OPAL_SUCCESS ... Hopefully somebody will have an idea what goes wrong on my Linux system. Thank you very much for any help in advance. Kind regards Siegmar > Just shooting in the dark here - I suspect you'll have to ask someone > more knowledgeable on JVMs. > > > On Dec 21, 2012, at 7:32 AM, Siegmar Grosswrote: > > > Hi > > > >> I can't speak to the other issues, but for these - it looks like > >> something isn't right in the system. Could be an incompatibility > >> with Suse 12.1. > >> > >> What the errors are saying is that malloc is failing when used at > >> a very early stage in starting the process. Can you run even a > >> C-based MPI "hello" program? > > > > Yes. I have implemented more or less the same program in C and Java. > > > > tyr hello_1 131 mpiexec -np 2 -host linpc0,linpc1 hello_1_mpi > > Process 0 of 2 running on linpc0 > > Process 1 of 2 running on linpc1 > > > > Now 1 slave tasks are sending greetings. > > > > Greetings from task 1: > > message type:3 > > msg length: 132 characters > > message: > >hostname: linpc1 > >operating system: Linux > >release: 3.1.10-1.16-desktop > >processor: x86_64 > > > > > > tyr hello_1 132 mpiexec -np 2 -host linpc0,linpc1 java HelloMainWithBarrier > > -- > > It looks like opal_init failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during opal_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal failure; > > here's some additional information (which may only be relevant to an > > Open MPI developer): > > > > mca_base_open failed > > --> Returned value -2 instead of OPAL_SUCCESS > > ... > > > > > > Thank you very much for any help in advance. > > > > Kind regards > > > > Siegmar > > > > > > > >> On Dec 21, 2012, at 1:41 AM, Siegmar Gross > > wrote: > >> > >>> The program breaks if I use two Linux.x86_64 machines (Open Suse 12.1). > >>> > >>> linpc1 etc 101 mpiexec -np 2 -host linpc0,linpc1 java BcastIntArrayMain > >>>
Re: [OMPI users] broadcasting basic data items in Java
I can't speak to the other issues, but for these - it looks like something isn't right in the system. Could be an incompatibility with Suse 12.1. What the errors are saying is that malloc is failing when used at a very early stage in starting the process. Can you run even a C-based MPI "hello" program? On Dec 21, 2012, at 1:41 AM, Siegmar Grosswrote: > The program breaks if I use two Linux.x86_64 machines (Open Suse 12.1). > > linpc1 etc 101 mpiexec -np 2 -host linpc0,linpc1 java BcastIntArrayMain > -- > It looks like opal_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during opal_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > mca_base_open failed > --> Returned value -2 instead of OPAL_SUCCESS > ... > ompi_mpi_init: orte_init failed > --> Returned "Out of resource" (-2) instead of "Success" (0) > -- > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > ***and potentially your MPI job) > [(null):10586] Local abort before MPI_INIT completed successfully; not able > to > aggregate error messages, and not able to guarantee that all other processes > were killed! > --- > Primary job terminated normally, but 1 process returned > a non-zero exit code.. Per user-direction, the job has been aborted. > --- > -- > mpiexec detected that one or more processes exited with non-zero status, thus > causing > the job to be terminated. The first process to do so was: > > Process name: [[16706,1],1] > Exit code:1 > -- > > > > I use a valid environment on all machines. The problem occurs as well > when I compile and run the program directly on the Linux system. > > linpc1 java 101 mpijavac BcastIntMain.java > linpc1 java 102 mpiexec -np 2 -host linpc0,linpc1 java -cp `pwd` BcastIntMain > -- > It looks like opal_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during opal_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > mca_base_open failed > --> Returned value -2 instead of OPAL_SUCCESS
Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
I completely understand there's no "one size fits all", and I appreciate that there are workarounds to the change in behaviour. I'm only trying to make what little contribution I can by questioning the architecture of the pairwise algorithm. I know that for every user you please, there will be some that aren't happy when a default changes (Windows 8 anyone?), but I'm trying to provide some real-world data. If 90% of apps are 10% faster and 10% are 1000% slower, should the default change? all_to_all is a really nice primitive from a developer point of view. Every process' code is symmetric and identical. Maybe I should have to worry that most of the matrix exchange is sparse; I probably could calculate an optimal exchange pattern. But I think this is the implementation's job, and I will continue to argue that *waiting* for each pairwise exchange is (a) unnecessary, (b) doesn't improve performance for *any* application and (c) at worst causes huge slowdown over the last algorithm for sparse cases. In summary: I'm arguing that there's a problem with the pairwise implementation as it stands. It doesn't have any optimization for sparse all_to_all and imposes unnecessary synchronisation barriers in all cases. Simon On 20/12/2012 14:42, Iliev, Hristo wrote: Simon, The goal of any MPI implementation is to be as fast as possible. Unfortunately there is no "one size fits all" algorithm that works on all networks and given all possible kind of peculiarities that your specific communication scheme may have. That's why there are different algorithms and you are given the option to dynamically select them at run time without the need to recompile the code. I don't think the change of the default algorithm (note that the pairwise algorithm has been there for many years - it is not new, simply the new default one) was introduced in order to piss users off. If you want OMPI to default to the previous algorithm: 1) Add this to the system-wide OMPI configuration file $sysconf/openmpi-mca-params.conf (wher $sysconf would most likely be $PREFIX/etc, where $PREFIX is the OMPI installation directory): coll_tuned_use_dynamic_rules = 1 coll_tuned_alltoallv_algorithm = 1 2) The settings from (1) can be overridden on per user basis by the similar settings from $HOME/.openmpi/mca-params.conf. 3) The settings from (1) and (2) can be overridden on per job basis by exporting MCA parameters as environment variables: export OMPI_MCA_coll_tuned_use_dynamic_rules=1 export OMPI_MCA_coll_tuned_alltoallv_algorithm=1 4) Finally, the settings from (1), (2), and (3) can be overridden on per MPI program launch by supplying appropriate MCA parameters to orterun (a.k.a. mpirun and mpiexec). There is also a largely undocumented feature of the "tuned" collective component where a dynamic rules file can be supplied. In the file a series of cases tell the library which implementation to use based on the communicator and message sizes. No idea if it works with ALLTOALLV. Kind regards, Hristo (sorry for top posting - damn you, Outlook!) -- Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg 23, D 52074 Aachen (Germany) -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Number Cruncher Sent: Wednesday, December 19, 2012 5:31 PM To: Open MPI Users Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1 On 19/12/12 11:08, Paul Kapinos wrote: Did you *really* wanna to dig into code just in order to switch a default communication algorithm? No, I didn't want to, but with a huge change in performance, I'm forced to do something! And having looked at the different algorithms, I think there's a problem with the new default whenever message sizes are small enough that connection latency will dominate. We're not all running Infiniband, and having to wait for each pairwise exchange to complete before initiating another seems wrong if the latency in waiting for completion dominates the transmission time. E.g. If I have 10 small pairwise exchanges to perform,isn't it better to put all 10 outbound messages on the wire, and wait for 10 matching inbound messages, in any order? The new algorithm must wait for first exchange to complete, then second exchange, then third. Unlike before, does it not have to wait to acknowledge the matching *zero-sized* request? I don't see why this temporal ordering matters. Thanks for your help, Simon Note there are several ways to set the parameters; --mca on command line is just one of them (suitable for quick online tests). http://www.open-mpi.org/faq/?category=tuning#setting-mca-params We 'tune' our Open MPI by setting environment variables Best Paul Kapinos On 12/19/12 11:44, Number Cruncher wrote: Having run some more benchmarks, the new default is *really* bad
Re: [OMPI users] OpenMPI with cMake on Windows
On 18 December 2012 22:04, Stephen Conleywrote: > Hello, > > ** ** > > I have installed CMake version 2.8.10.2 and OpenMPI version 1.6.2 on a 64 > bit Windows 7 computer. > > ** ** > > OpenMPI is installed in “C:\program files\OpenMPI” and the path has been > updated to include the bin subdirectory. > > ** ** > > In the cmakelists.txt file, I have: find_package(MPI REQUIRED) > > ** ** > > When I run cmake, I receive the following error: > > ** ** > > *C:\Users\steve\workspace\Dales\build>cmake ..\src -G "MinGW Makefiles"* > > *CMake Error at C:/Program Files (x86)/CMake > 2.8/share/cmake-2.8/Modules/FindPack* > > *ageHandleStandardArgs.cmake:97 (message):* > > * Could NOT find MPI_C (missing: MPI_C_LIBRARIES)* > > *Call Stack (most recent call first):* > > * C:/Program Files (x86)/CMake > 2.8/share/cmake-2.8/Modules/FindPackageHandleStan* > > *dardArgs.cmake:291 (_FPHSA_FAILURE_MESSAGE)* > > * C:/Program Files (x86)/CMake > 2.8/share/cmake-2.8/Modules/FindMPI.cmake:587 (fi* > > *nd_package_handle_standard_args)* > > * CMakeLists.txt:9 (find_package)* > > * * > > * * > > *-- Configuring incomplete, errors occurred!* > > * * > > Any ideas as to what I am missing? > > ** ** > Try to google for some changes I've made to findmpi.cmake that worked for windows a few years ago. They have to do with an if test syntax in that file that the cmd.exe doesn't accept This was 2.8.4 i believe but probably still true now. I'll answer with more details in a month time. Regards,
[OMPI users] broadcasting basic data items in Java
Hi I'm still using "Open MPI: 1.9a1r27668" and Java 1.7.0_07. Today I implemented a few programs to broadcast int, int[], double, or double[]. I can compile all four programs without problems, which means that "Object buf" as a parameter in "MPI.COMM_WORLD.Bcast" isn't a problem for basic datatypes. Unfortunately I only get the expected result for arrays of a basic datatype. Process 1 doesn't receive an int value (both processes run on Solaris 10 Sparc) tyr java 159 mpiexec -np 2 java BcastIntMain Process 1 running on tyr.informatik.hs-fulda.de. intValue: 0 Process 0 running on tyr.informatik.hs-fulda.de. intValue: 1234567 Process 1 receives all values from an int array. tyr java 160 mpiexec -np 2 java BcastIntArrayMain Process 0 running on tyr.informatik.hs-fulda.de. intValues[0]: 1234567intValues[1]: 7654321 Process 1 running on tyr.informatik.hs-fulda.de. intValues[0]: 1234567intValues[1]: 7654321 The program breaks if I use one little endian and one big endian machine. tyr java 161 mpiexec -np 2 -host sunpc0,tyr java BcastIntMain [tyr:7657] *** An error occurred in MPI_Comm_dup [tyr:7657] *** reported by process [3150053377,1] [tyr:7657] *** on communicator MPI_COMM_WORLD [tyr:7657] *** MPI_ERR_INTERN: internal error [tyr:7657] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [tyr:7657] ***and potentially your MPI job) The program works if I use two "Solaris 10 x86_64" machines. tyr java 163 mpiexec -np 2 -host sunpc0,sunpc1 java BcastIntArrayMain Process 1 running on sunpc1. intValues[0]: 1234567intValues[1]: 7654321 Process 0 running on sunpc0. intValues[0]: 1234567intValues[1]: 7654321 The program breaks if I use two Linux.x86_64 machines (Open Suse 12.1). linpc1 etc 101 mpiexec -np 2 -host linpc0,linpc1 java BcastIntArrayMain -- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_base_open failed --> Returned value -2 instead of OPAL_SUCCESS ... ompi_mpi_init: orte_init failed --> Returned "Out of resource" (-2) instead of "Success" (0) -- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, ***and potentially your MPI job) [(null):10586] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! --- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. --- -- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[16706,1],1] Exit code:1 -- I use a valid environment on all machines. The problem occurs as well when I compile and run the program directly on the Linux system. linpc1 java 101 mpijavac BcastIntMain.java linpc1 java 102 mpiexec -np 2 -host linpc0,linpc1 java -cp `pwd` BcastIntMain -- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_base_open failed --> Returned value -2 instead of OPAL_SUCCESS I get the same errors for the programs with double values. Does anybody have any suggestions how to solve the problems. Thank you very much for any help in advance. Kind regards Siegmar BcastIntMain.java Description: BcastIntMain.java BcastIntArrayMain.java Description: BcastIntArrayMain.java BcastDoubleMain.java Description: BcastDoubleMain.java BcastDoubleArrayMain.java Description: BcastDoubleArrayMain.java