Re: [OMPI users] openmpi credits for eager messages
Ron's comments are probably dead on for an application like bug3. If bug3 is long running and libmpi is doing eager protocol buffer management as I contend the standard requires then the producers will not get far ahead of the consumer before they are forced to synchronous send under the covers anyway. From then on, producers will run no faster than their output can be absorbed. They will spent the nonproductive parts of their time blocked on either MPI_Send or MPI_Ssend. The job will not finish until the consumer finishes because the consumer is a constant bottleneck anyway. The slow consumer is the major drag on scalability. As long as the producers can be expected to outrun the consumer for the life of the job you will probably find it hard to measure a difference between synchronous send and flow controlled standard send. Eager protocol gets more interesting when the pace of the consumer and of the producers is variable. If the consumer can absorb a message per millisecond and the production rate is close to one message per millisecond but fluctuates a bit then eager protocol may speed the whole job significantly. The producers can never get ahead with synchronous send even in a phase when they might be able to create a message every 1/2 millisecond. The producers spend half this easy phase blocked in MPI_Ssend. If producers now enter a compute intensive phase where messages can only be generated once every 2 milliseconds the consumer spends time idle. If the consumer had been able to accumulate queued messages with eager protocol when the producers were able to run faster it could now make itself useful catching up. Both producers and consumer would come closer to 100% productive work and the job would finish sooner.. Dick Dick Treumann - MPI Team/TCEM IBM Systems & Technology Group Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 02/05/2008 01:26:24 PM: > > Re: MPI_Ssend(). This indeed fixes bug3, the process at rank 0 has > > reasonable memory usage and the execution proceeds normally. > > > > Re scalable: One second. I know well bug3 is not scalable, and when to > > use MPI_Isend. The point is programmers want to count on the MPI spec as > > written, as Richard pointed out. We want to send small messages quickly > > and efficiently, without the danger of overloading the receiver's > > resources. We can use MPI_Ssend() but it is slow compared MPI_Send(). > > Your last statement is not necessarily true. By synchronizing processes > using MPI_Ssend(), you can potentially avoid large numbers of unexpected > messages that need to be buffered and copied, and that also need to be > searched every time a receive is posted. There is no guarantee that the > protocol overhead on each message incurred with MPI_Ssend() slows down an > application more than the buffering, copying, and searching overhead of a > large number of unexpected messages. > > It is true that MPI_Ssend() is slower than MPI_Send() for ping-pong > micro-benchmarks, but the length of the unexpected message queue doesn't > have to get very long before they are about the same. > > > > > Since identifying this behavior we have implemented the desired flow > > control in our application. > > It would be interesting to see performance results comparing doing flow > control in the application versus having MPI do it for you > > -Ron > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openmpi credits for eager messages
> Re: MPI_Ssend(). This indeed fixes bug3, the process at rank 0 has > reasonable memory usage and the execution proceeds normally. > > Re scalable: One second. I know well bug3 is not scalable, and when to > use MPI_Isend. The point is programmers want to count on the MPI spec as > written, as Richard pointed out. We want to send small messages quickly > and efficiently, without the danger of overloading the receiver's > resources. We can use MPI_Ssend() but it is slow compared MPI_Send(). Your last statement is not necessarily true. By synchronizing processes using MPI_Ssend(), you can potentially avoid large numbers of unexpected messages that need to be buffered and copied, and that also need to be searched every time a receive is posted. There is no guarantee that the protocol overhead on each message incurred with MPI_Ssend() slows down an application more than the buffering, copying, and searching overhead of a large number of unexpected messages. It is true that MPI_Ssend() is slower than MPI_Send() for ping-pong micro-benchmarks, but the length of the unexpected message queue doesn't have to get very long before they are about the same. > > Since identifying this behavior we have implemented the desired flow > control in our application. It would be interesting to see performance results comparing doing flow control in the application versus having MPI do it for you -Ron
Re: [OMPI users] mpirun, paths and xterm again
Jody, jody wrote: Hi Tim Your desktop is plankton, and you want to run a job on both plankton and nano, and have xterms show up on nano. Not on nano, but on plankton, but ithink this was just a typo :) Correct. It looks like you are already doing this, but to make sure, the way I would use xhost is: plankton$ xhost +nano_00 plankton$ mpirun -np 4 --hostfile testhosts -x DISPLAY=plankton:0.0 xterm -hold -e ../MPITest This gives me 2 lines of xterm Xt error: Can't open display: plankton:0.0 Can you try running: plankton$ mpirun -np 1 -host nano_00 -x DISPLAY=plankton:0.0 printenv This yields DISPLAY=plankton:0.0 just to make sure the environment variable is being properly set. You might also try: in terminal 1: plankton$ xhost +nano_00 in terminal 2: plankton$ ssh -x nano_00 nano_00$ export DISPLAY="plankton:0.0" nano_00$ xterm This experiment also gives xterm Xt error: Can't open display: plankton:0.0 This will ssh into nano, disabling ssh X forwarding, and try to launch an xterm. If this does not work, then something is wrong with your x setup. If it does work, it should work with Open MPI as well. So i guess something is wrong with my X setup. I wonder what it could be ... So this is an X issue, not an Open MPI issue then. I do not know enough about X setup to help here... Doing the same with X11 forwarding works perfectly. But why is X11 forwarding bad? Or differently asked, does Opem MPI make the ssh connection in such a way that X11 forwarding is disabled? What Open MPI does is it uses ssh to launch a daemon on a remote node, then it disconnects the ssh session. This is done to prevent running out of resources at scale. We then send a message to the daemon to launch the client application. So we are not doing anything to prevent ssh X11 forwarding, it is just that by the time the application launched the ssh sessions are no longer around. There is a way to force the ssh sessions to stay open. However doing so will result in a bunch of excess debug output. If you add "--debug-daemons" to the mpirun command line, the ssh connections should stay open. Hope this helps, Tim
Re: [OMPI users] openmpi credits for eager messages
Re: MPI_Ssend(). This indeed fixes bug3, the process at rank 0 has reasonable memory usage and the execution proceeds normally. Re scalable: One second. I know well bug3 is not scalable, and when to use MPI_Isend. The point is programmers want to count on the MPI spec as written, as Richard pointed out. We want to send small messages quickly and efficiently, without the danger of overloading the receiver's resources. We can use MPI_Ssend() but it is slow compared MPI_Send(). Since identifying this behavior we have implemented the desired flow control in our application. Thanks, fds -Original Message- From: Brightwell, Ronald [mailto:rbbr...@sandia.gov] Sent: Monday, February 04, 2008 4:35 PM To: Sacerdoti, Federico Cc: Open MPI Users Subject: Re: [OMPI users] openmpi credits for eager messages On Mon Feb 4, 2008 14:23:13... Sacerdoti, Federico wrote > To keep this out of the weeds, I have attached a program called "bug3" > that illustrates this problem on openmpi 1.2.5 using the openib BTL. In > bug3 process with rank 0 uses all available memory buffering > "unexpected" messages from its neighbors. > > Bug3 is a test-case derived from a real, scalable application (desmond > for molecular dynamics) that several experienced MPI developers have > worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the > openmpi silently sends them in the background and overwhelms process 0 > due to lack of flow control. This looks like an N->1 communication pattern to me. This is the definition not scalable. > > It may not be hard to change desmond to work around openmpi's small > message semantics, but a programmer should reasonably be allowed to > think a blocking send will block if the receiver cannot handle it yet. It's actually pretty easy -- change MPI_Send() to MPI_Ssend(). It sounds like you may be confused by what the term "blocking" means in MPI... -Ron
Re: [OMPI users] mpirun, paths and xterm again
Hi Tim > Your desktop is plankton, and you want > to run a job on both plankton and nano, and have xterms show up on nano. Not on nano, but on plankton, but ithink this was just a typo :) > It looks like you are already doing this, but to make sure, the way I > would use xhost is: > plankton$ xhost +nano_00 > plankton$ mpirun -np 4 --hostfile testhosts -x DISPLAY=plankton:0.0 > xterm -hold -e ../MPITest This gives me 2 lines of xterm Xt error: Can't open display: plankton:0.0 > > Can you try running: > plankton$ mpirun -np 1 -host nano_00 -x DISPLAY=plankton:0.0 printenv This yields DISPLAY=plankton:0.0 OMPI_MCA_orte_precondition_transports=4a0f9ccb4c13cd0e-6255330fbb0289f9 OMPI_MCA_rds=proxy OMPI_MCA_ras=proxy OMPI_MCA_rmaps=proxy OMPI_MCA_pls=proxy OMPI_MCA_rmgr=proxy SHELL=/bin/bash SSH_CLIENT=130.60.49.141 59524 22 USER=jody LD_LIBRARY_PATH=/opt/openmpi/lib SSH_AUTH_SOCK=/tmp/ssh-enOzt24653/agent.24653 MAIL=/var/mail/jody PATH=/opt/openmpi/bin:/usr/local/bin:/bin:/usr/bin PWD=/home/jody SHLVL=1 HOME=/home/jody LOGNAME=jody SSH_CONNECTION=130.60.49.141 59524 130.60.49.128 22 _=/opt/openmpi/bin/orted OMPI_MCA_mpi_yield_when_idle=0 OMPI_MCA_mpi_paffinity_processor=0 OMPI_MCA_universe=j...@aim-plankton.unizh.ch:default-universe-10265 OMPI_MCA_ns_replica_uri=0.0.0;tcp://130.60.49.141:50310 OMPI_MCA_gpr_replica_uri=0.0.0;tcp://130.60.49.141:50310 OMPI_MCA_orte_app_num=0 OMPI_MCA_orte_base_nodename=nano_00 OMPI_MCA_ns_nds=env OMPI_MCA_ns_nds_cellid=0 OMPI_MCA_ns_nds_jobid=1 OMPI_MCA_ns_nds_vpid=0 OMPI_MCA_ns_nds_vpid_start=0 OMPI_MCA_ns_nds_num_procs=1 > > just to make sure the environment variable is being properly set. > > You might also try: > in terminal 1: > plankton$ xhost +nano_00 > > in terminal 2: > plankton$ ssh -x nano_00 > nano_00$ export DISPLAY="plankton:0.0" > nano_00$ xterm > This experiment also gives xterm Xt error: Can't open display: plankton:0.0 > This will ssh into nano, disabling ssh X forwarding, and try to launch > an xterm. If this does not work, then something is wrong with your x > setup. If it does work, it should work with Open MPI as well. > So i guess something is wrong with my X setup. I wonder what it could be ... Doing the same with X11 forwarding works perfectly. But why is X11 forwarding bad? Or differently asked, does Opem MPI make the ssh connection in such a way that X11 forwarding is disabled? Thank YOu Jody
Re: [OMPI users] mpirun, paths and xterm again
Hi Jody, Just to make sure I understand. Your desktop is plankton, and you want to run a job on both plankton and nano, and have xterms show up on nano. It looks like you are already doing this, but to make sure, the way I would use xhost is: plankton$ xhost +nano_00 plankton$ mpirun -np 4 --hostfile testhosts -x DISPLAY=plankton:0.0 xterm -hold -e ../MPITest Can you try running: plankton$ mpirun -np 1 -host nano_00 -x DISPLAY=plankton:0.0 printenv just to make sure the environment variable is being properly set. You might also try: in terminal 1: plankton$ xhost +nano_00 in terminal 2: plankton$ ssh -x nano_00 nano_00$ export DISPLAY="plankton:0.0" nano_00$ xterm This will ssh into nano, disabling ssh X forwarding, and try to launch an xterm. If this does not work, then something is wrong with your x setup. If it does work, it should work with Open MPI as well. For your second question: I'm not sure why there would be a difference in finding the shared libraries in gdb vs. with the xterm. Tim jody wrote: Hi Sorry to bring this subject up again - but i have a problem getting xterms running for all of my processes (for debugging purposes). There are actually two problem involved: display, and paths. my ssh is set up so that X forwarding is allowed, and, indeed, ssh nano_00 xterm opens an xterm from the remote machine nano_00. When i run my program normally, it works ok: [jody]:/mnt/data1/neander:$mpirun -np 4 --hostfile testhosts ./MPITest [aim-plankton.unizh.ch]I am #0/4 global [aim-plankton.unizh.ch]I am #1/4 global [aim-nano_00]I am #2/4 global [aim-nano_00]I am #3/4 global But when i try to see it in xterms [jody]:/mnt/data1/neander:$mpirun -np 4 --hostfile testhosts -x DISPLAY xterm -hold -e ./MPITest xterm Xt error: Can't open display: :0.0 xterm Xt error: Can't open display: :0.0 (same happens, if i set DISPLAY=plankton:0.0, or if i use plankton's ip address; and xhost is enabled for nano_00) the other two (the "local") xterms open, but they display the message: ./MPITest: error while loading shared libraries: libmpi_cxx.so.0: cannot open shared object file: No such file or directory (This also happens if i only have local processes) So my first question is: what do i do to enable nano_00 to display an xterm on plankton? Using normal ssh there seems to be no problem. Second question: why does the use of xterm "hide" the open-mpi libs? Interestingly: if i use xterm with gdb to start my application, it works. Any ideas? Thank you Jody ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI piggyback mechanism
Thank you Josh, that's interesting. I'll have a look. --Oleg On Feb 5, 2008 2:39 PM, Josh Hursey wrote: > Oleg, > > Interesting work. You mentioned late in your email that you believe > that adding support for piggybacking to the MPI standard would be the > best solution. As you may know, the MPI Forum has reconvened and there > is a working group for Fault Tolerance. This working group is > discussing a piggybacking interface proposal for the standard, amongst > other things. If you are interested in contributing to this > conversation you can find the mailing list here: > http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft > > Best, > Josh > > On Feb 5, 2008, at 4:58 AM, Oleg Morajko wrote: > > > Hi, > > > > I've been working on MPI piggyback technique as a part of my PhD work. > > > > Although MPI does not provide a native support, there are several > > different > > solutions to transmit piggyback data over every MPI communication. > > You may > > find a brief overview in papers [1, 2]. This includes copying the > > original > > message and the extra data to a bigger buffer, sending additional > > message or > > changing the sendtype to a dynamically created wrapper datatype that > > contains a pointer to the original data and the piggyback data. I > > have tried > > all mechanisms and they work, but considering the overhead, there is > > no "the > > best" technique that outperforms the others in all scenarios. Jeff > > Squyres > > had interesting comments on this subject before (in this mailing > > list). > > > > Finally after some benchmarking, I have implemented *a *hybrid > > technique > > that combines existing mechanisms. For small, point-to-point messages > > datatype wrapping seems to be the less intrusive, at least considering > > OpenMPI implementation of derived datatypes. For large, point-to-point > > messages, experiments confirmed that sending an additional message > > is much > > cheaper than wrapping (and besides the intrusion is small as we are > > already > > sending a large message). Moreover, the implementation may > > interleave the > > original send with an asynchronous send of piggyback data. This > > optimization > > partially hides the latency of additional send and lowers overall > > intrusion. > > The same criteria can be applied for collective operations, except > > barrier > > and reduce operations. As the former does not transmit any data and > > the > > latter transforms the data, the only solution is to send additional > > messages. > > > > There is a penalty of course. Especially for collective operations > > with very > > small messages the intrusion may reach 15% and that's a lot. It than > > decreases down to 0.1% for bigger messages, but anyway it's still > > there. I > > don't know what are your requirements/expectations for that issue. > > The only > > work that reported lower overheads is [3] but they added native > > piggyback > > support by changing underlying MPI implementation. > > > > I think the best possible option is to add piggyback support for MPI > > as a > > part of the standard. A growing number of runtime tools use this > > functionality for multiple reasons and certainly PMPI itself is not > > enough. > > References of interest: > > > > - > > > > [1] Shende, S., Malony, A., Morris, A., Wolf, F. "Performance > > Profiling Overhead Compensation for MPI Programs". 12th EuroPVM-MPI > > Conference, LNCS, vol. 3666, pp. 359-367, 2005. They review various > > techniques and come up with datatype wrapping. > > > > - > > > > [2] Schulz, M., "Extracting Critical Path Graphs from MPI > > Applications". Cluster Computing 2005, IEEE International, pp. 1-10, > > September 2005. They use datatype wrapping. > > - [3] Jeffrey Vetter, "Dynamic Statistical Profiling of > > Communication > > Activity in Distributed Applications". They add support for > > piggyback at MPI > > implementation level and report very low overheads (no surprise). > > > > Regards, > > Oleg Morajko > > > > > > On Feb 1, 2008 5:08 PM, Aurélien Bouteiller > > wrote: > > > >> I don't know of any work in that direction for now. Indeed, we plan > >> to > >> eventually integrate at least causal message logging in the pml-v, > >> which also includes piggybacking. Therefore we are open for > >> collaboration with you on this matter. Please let us know :) > >> > >> Aurelien > >> > >> > >> > >> Le 1 févr. 08 à 09:51, Thomas Ropars a écrit : > >> > >>> Hi, > >>> > >>> I'm currently working on optimistic message logging and I would like > >>> to > >>> implement an optimistic message logging protocol in OpenMPI. > >>> Optimistic > >>> message logging protocols piggyback information about dependencies > >>> between processes on the application messages to be able to find a > >>> consistent global state after a failure. That's why I'm interested > >>> in > >>> the problem of piggybacking information on MPI messages. > >>> > >>> Is there some works on this problem at the moment ?
Re: [OMPI users] openmpi credits for eager messages
So with an Isend your program becomes valid MPI and a very nice illustrarion of why the MPI standard cannot limit envelops (or send/recv descriptors) and why at some point the number of descriptors can blow the limits. It also illustrates how the management of eager messages remains workable. (Not the same as affordable or appropriate. I agree it has serious scaling issues) Let's assume there is managed early arrival space for 10 messages per sender. Each MPI_Isend generates an envelop that goes to the destination. For your program to unwind properly, every envelop must be delivered to the destination. The first (blocking) MPI_Recv is looking for the tag in the last envelop so if libmpi does not deliver all 5000 envelops per sender, the first MPI_Recv will block forever. It is not acceptable for a valid MPI program to deadlock. If the destination cannot hold all the envelops there is no choice but to fail the job. The standard allows this. The Forum considered it to be better to fail a job than to deadlock it. If each sender sends its first 10 messages eagerly the send side tokens will be used up and the buffer space at the destination will fill up but not overflow. The senders now fall back to rendevous for their remaining 4990 MPI_Isends. The MPI_Isends cannot block. They send envelops as fast as the loop can run but the user send buffers involved cannot be altered until the waits occur. Once the last sent envelop with tag 5000 arrives and matches the posted MPI_Recv, an OK_to_send goes back to the sender and the data can be moved from the still intact send buffer to the waiting receive buffer. The MPI_Waits for the Isend requests can be done in any order but no send buffer can be changed until the corresponding MPI_Wait returns. No system buffer needed for massage data. The MPI_Recvs being posted in reverse order (5000,4999 .. 11. ) each ship OK_to_send and data flows directly from send to recv buffers. Finally the MPI_Recvs for tags (10 ... 1) get posted and pull their message data from the early arrival space. The program has unwound correctly and as the early arrival space frees up, credits can be returned to the sender. Performance discussions aside - the semantic is clean and reliable. Thanks - Dick PS - If anyone responds to this I hope you will state clearly whether you want to talk about: - What does the standard require? or - What should the standard require? Dick Treumann - MPI Team/TCEM IBM Systems & Technology Group Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 02/04/2008 06:04:22 PM: > Richard, > > You're absolutely right. What a shame :) If I have spent less time > drawing the boxes around the code I might have noticed the typo. The > Send should be an Isend. > >george. > > On Feb 4, 2008, at 5:32 PM, Richard Treumann wrote: > > > Hi George > > > > Sorry - This is not a valid MPI program. It violates the requirement > > that a program not depend on there being any system buffering. See > > page 32-33 of MPI 1.1 > > > > Lets simplify to: > > Task 0: > > MPI_Recv( from 1 with tag 1) > > MPI_Recv( from 1 with tag 0) > > > > Task 1: > > MPI_Send(to 0 with tag 0) > > MPI_Send(to 0 with tag 1) > > > > Without any early arrival buffer (or with eager size set to 0) task > > 0 will hang in the first MPI_Recv and never post a recv with tag 0. > > Task 1 will hang in the MPI_Send with tag 0 because it cannot get > > past it until the matching recv is posted by task 0. > > > > If there is enough early arrival buffer for the first MPI_Send on > > task 1 to complete and the second MPI_Send to be posted the example > > will run. Once both sends are posted by task 1, task 0 will harvest > > the second send and get out of its first recv. Task 0's second recv > > can now pick up the message from the early arrival buffer where it > > had to go to let task 1complete send 1 and post send 2. > > > > If an application wants to do this kind of order inversion it should > > use some non blocking operations. For example, if task 0 posted an > > MPI_Irecv for tag 1, an MPI_Recv for tag 0 and lastly, an MPI_Wait > > for the Irecv, the example is valid. > > > > I am not aware of any case where the standard allows a correct MPI > > program to be deadlocked by an implementation limit. It can be > > failed if it exceeds a limit but I do not think it is ever OK to hang. > > > > Dick > > > > Dick Treumann - MPI Team/TCEM > > IBM Systems & Technology Group > > Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 > > Tele (845) 433-7846 Fax (845) 433-8363 > > > > > > users-boun...@open-mpi.org wrote on 02/04/2008 04:41:21 PM: > > > > > Please allow me to slightly modify your example. It still follow the > > > rules from the MPI standard, so I think it's a 100% standard > > compliant > > > parallel application. > > > > > > +--
Re: [OMPI users] openmpi credits for eager messages
Wow this sparked a much more heated discussion than I was expecting. I was just commenting that the behaviour the original author (Federico Sacerdoti) mentioned would explain something I observed in one of my early trials of OpenMPI. But anyway, because it seems that quite a few people were interested, I've attached a simplified version of the test I was describing (with all the timing checks and some of the crazier output removed). Now that I go back and retest this it turns out that it wasn't actually a segfault that was killing it, but running out of memory as you and others have predicted. Brian W. Barrett brbarret-at-open-mpi.org |openmpi-users/Allow| wrote: > Now that this discussion has gone way off into the MPI standard woods :). > > Was your test using Open MPI 1.2.4 or 1.2.5 (the one with the segfault)? > There was definitely a bug in 1.2.4 that could cause exactly the behavior > you are describing when using the shared memory BTL, due to a silly > delayed initialization bug/optimization. I'm still using Open MPI 1.2.4 and actually the SM BTL seems to be the hardest to break (I guess I'm dodging the bullet on that delayed initialization bug you're referring to). > If you are using the OB1 PML (the default), you will still have the > possibility of running the receiver out of memory if the unexpected queue > grows without bounds. I'll withold my opinion on what the standard says > so that we can perhaps actually help you solve your problem and stay out > of the weeds :). Note however, that in general unexpected messages are a > bad idea and thousands of them from one peer to another should be avoided > at all costs -- this is just good MPI programming practice. Actually I was expecting to break something with this test. I just wanted to find out where it broke. Lesson learned, I wrote my more serious programs doing exactly that (no unexpected messages). I was just surprised that the default Open MPI settings allowed me to flood the system so easily whereas MPICH/MX still finished not matter what I threw at it (albeit with terrible performance (in the bad cases)). > Now, if you are using MX, you can replicate MPICH/MX's behavior (including > the very slow part) by using the CM PML (--mca pml cm on the mpirun > command line), which will use the MX library message matching and > unexpected queue and therefore behave exactly like MPICH/MX. That works exactly as you described, and it does indeed prevent memory usage from going wild due to the unexpected messages. Thanks for your help! (and to the others for the educational discussion!) > > Brian > > > On Sat, 2 Feb 2008, 8mj6tc...@sneakemail.com wrote: > >> That would make sense. I able to break OpenMPI by having Node A wait for >> messages from Node B. Node B is in fact sleeping while Node C bombards >> Node A with a few thousand messages. After a while Node B wakes up and >> sends Node A the message it's been waiting on, but Node A has long since >> been buried and seg faults. If I decrease the number of messages C is >> sending, it works properly. This was on OpenMPI 1.2.4 (using I think the >> SM BTL (might have been MX or TCP, but certainly not infiniband. I could >> dig up the test and try again if anyone is seriously curious). >> >> Trying the same test on MPICH/MX went very very slow (I don't think they >> have any clever buffer management) but it didn't crash. >> >> Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com >> |openmpi-users/Allow| wrote: >>> Hi, >>> >>> I am readying an openmpi 1.2.5 software stack for use with a >>> many-thousand core cluster. I have a question about sending small >>> messages that I hope can be answered on this list. >>> >>> I was under the impression that if node A wants to send a small MPI >>> message to node B, it must have a credit to do so. The credit assures A >>> that B has enough buffer space to accept the message. Credits are >>> required by the mpi layer regardless of the BTL transport layer used. >>> >>> I have been told by a Voltaire tech that this is not so, the credits are >>> used by the infiniband transport layer to reliably send a message, and >>> is not an openmpi feature. >>> >>> Thanks, >>> Federico >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- --Kris 叶ってしまう夢は本当の夢と言えん。 [A dream that comes true can't really be called a dream.] #include #include #include #include #include //for atoi (in case someone doesn't have boost) const int buflen=5000; int main(int argc, char *argv[]) { using namespace std; int reps=1000; if(argc>1){ //optionally specify number of repeats on the command line reps=atoi(argv[1]); } int numprocs, rank, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; M
Re: [OMPI users] MPI piggyback mechanism
Oleg, Is there an implementation in Open MPI of your techniques ? Can we put our greedy nasty pawns on it ? Thanks for the link, Josh. Aurelien Le 5 févr. 08 à 08:39, Josh Hursey a écrit : Oleg, Interesting work. You mentioned late in your email that you believe that adding support for piggybacking to the MPI standard would be the best solution. As you may know, the MPI Forum has reconvened and there is a working group for Fault Tolerance. This working group is discussing a piggybacking interface proposal for the standard, amongst other things. If you are interested in contributing to this conversation you can find the mailing list here: http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft Best, Josh On Feb 5, 2008, at 4:58 AM, Oleg Morajko wrote: Hi, I've been working on MPI piggyback technique as a part of my PhD work. Although MPI does not provide a native support, there are several different solutions to transmit piggyback data over every MPI communication. You may find a brief overview in papers [1, 2]. This includes copying the original message and the extra data to a bigger buffer, sending additional message or changing the sendtype to a dynamically created wrapper datatype that contains a pointer to the original data and the piggyback data. I have tried all mechanisms and they work, but considering the overhead, there is no "the best" technique that outperforms the others in all scenarios. Jeff Squyres had interesting comments on this subject before (in this mailing list). Finally after some benchmarking, I have implemented *a *hybrid technique that combines existing mechanisms. For small, point-to-point messages datatype wrapping seems to be the less intrusive, at least considering OpenMPI implementation of derived datatypes. For large, point-to- point messages, experiments confirmed that sending an additional message is much cheaper than wrapping (and besides the intrusion is small as we are already sending a large message). Moreover, the implementation may interleave the original send with an asynchronous send of piggyback data. This optimization partially hides the latency of additional send and lowers overall intrusion. The same criteria can be applied for collective operations, except barrier and reduce operations. As the former does not transmit any data and the latter transforms the data, the only solution is to send additional messages. There is a penalty of course. Especially for collective operations with very small messages the intrusion may reach 15% and that's a lot. It than decreases down to 0.1% for bigger messages, but anyway it's still there. I don't know what are your requirements/expectations for that issue. The only work that reported lower overheads is [3] but they added native piggyback support by changing underlying MPI implementation. I think the best possible option is to add piggyback support for MPI as a part of the standard. A growing number of runtime tools use this functionality for multiple reasons and certainly PMPI itself is not enough. References of interest: - [1] Shende, S., Malony, A., Morris, A., Wolf, F. "Performance Profiling Overhead Compensation for MPI Programs". 12th EuroPVM-MPI Conference, LNCS, vol. 3666, pp. 359-367, 2005. They review various techniques and come up with datatype wrapping. - [2] Schulz, M., "Extracting Critical Path Graphs from MPI Applications". Cluster Computing 2005, IEEE International, pp. 1-10, September 2005. They use datatype wrapping. - [3] Jeffrey Vetter, "Dynamic Statistical Profiling of Communication Activity in Distributed Applications". They add support for piggyback at MPI implementation level and report very low overheads (no surprise). Regards, Oleg Morajko On Feb 1, 2008 5:08 PM, Aurélien Bouteiller wrote: I don't know of any work in that direction for now. Indeed, we plan to eventually integrate at least causal message logging in the pml-v, which also includes piggybacking. Therefore we are open for collaboration with you on this matter. Please let us know :) Aurelien Le 1 févr. 08 à 09:51, Thomas Ropars a écrit : Hi, I'm currently working on optimistic message logging and I would like to implement an optimistic message logging protocol in OpenMPI. Optimistic message logging protocols piggyback information about dependencies between processes on the application messages to be able to find a consistent global state after a failure. That's why I'm interested in the problem of piggybacking information on MPI messages. Is there some works on this problem at the moment ? Has anyone already implemented some mechanisms in OpenMPI to piggyback data on MPI messages? Regards, Thomas Oleg Morajko wrote: Hi, I'm developing a causality chain tracking library and need a mechanism to attach an extra data to every MPI message, so called piggyback mechanism. As far as I know there are a few solutions to this problem from which the two fundamental ones are the following
Re: [OMPI users] MPI piggyback mechanism
Oleg, Interesting work. You mentioned late in your email that you believe that adding support for piggybacking to the MPI standard would be the best solution. As you may know, the MPI Forum has reconvened and there is a working group for Fault Tolerance. This working group is discussing a piggybacking interface proposal for the standard, amongst other things. If you are interested in contributing to this conversation you can find the mailing list here: http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft Best, Josh On Feb 5, 2008, at 4:58 AM, Oleg Morajko wrote: Hi, I've been working on MPI piggyback technique as a part of my PhD work. Although MPI does not provide a native support, there are several different solutions to transmit piggyback data over every MPI communication. You may find a brief overview in papers [1, 2]. This includes copying the original message and the extra data to a bigger buffer, sending additional message or changing the sendtype to a dynamically created wrapper datatype that contains a pointer to the original data and the piggyback data. I have tried all mechanisms and they work, but considering the overhead, there is no "the best" technique that outperforms the others in all scenarios. Jeff Squyres had interesting comments on this subject before (in this mailing list). Finally after some benchmarking, I have implemented *a *hybrid technique that combines existing mechanisms. For small, point-to-point messages datatype wrapping seems to be the less intrusive, at least considering OpenMPI implementation of derived datatypes. For large, point-to-point messages, experiments confirmed that sending an additional message is much cheaper than wrapping (and besides the intrusion is small as we are already sending a large message). Moreover, the implementation may interleave the original send with an asynchronous send of piggyback data. This optimization partially hides the latency of additional send and lowers overall intrusion. The same criteria can be applied for collective operations, except barrier and reduce operations. As the former does not transmit any data and the latter transforms the data, the only solution is to send additional messages. There is a penalty of course. Especially for collective operations with very small messages the intrusion may reach 15% and that's a lot. It than decreases down to 0.1% for bigger messages, but anyway it's still there. I don't know what are your requirements/expectations for that issue. The only work that reported lower overheads is [3] but they added native piggyback support by changing underlying MPI implementation. I think the best possible option is to add piggyback support for MPI as a part of the standard. A growing number of runtime tools use this functionality for multiple reasons and certainly PMPI itself is not enough. References of interest: - [1] Shende, S., Malony, A., Morris, A., Wolf, F. "Performance Profiling Overhead Compensation for MPI Programs". 12th EuroPVM-MPI Conference, LNCS, vol. 3666, pp. 359-367, 2005. They review various techniques and come up with datatype wrapping. - [2] Schulz, M., "Extracting Critical Path Graphs from MPI Applications". Cluster Computing 2005, IEEE International, pp. 1-10, September 2005. They use datatype wrapping. - [3] Jeffrey Vetter, "Dynamic Statistical Profiling of Communication Activity in Distributed Applications". They add support for piggyback at MPI implementation level and report very low overheads (no surprise). Regards, Oleg Morajko On Feb 1, 2008 5:08 PM, Aurélien Bouteiller wrote: I don't know of any work in that direction for now. Indeed, we plan to eventually integrate at least causal message logging in the pml-v, which also includes piggybacking. Therefore we are open for collaboration with you on this matter. Please let us know :) Aurelien Le 1 févr. 08 à 09:51, Thomas Ropars a écrit : Hi, I'm currently working on optimistic message logging and I would like to implement an optimistic message logging protocol in OpenMPI. Optimistic message logging protocols piggyback information about dependencies between processes on the application messages to be able to find a consistent global state after a failure. That's why I'm interested in the problem of piggybacking information on MPI messages. Is there some works on this problem at the moment ? Has anyone already implemented some mechanisms in OpenMPI to piggyback data on MPI messages? Regards, Thomas Oleg Morajko wrote: Hi, I'm developing a causality chain tracking library and need a mechanism to attach an extra data to every MPI message, so called piggyback mechanism. As far as I know there are a few solutions to this problem from which the two fundamental ones are the following: * Dynamic datatype wrapping - if a user MPI_Send, let's say 1024 doubles, the wrapped send call implemen
Re: [OMPI users] openmpi credits for eager messages
On Tue, Feb 05, 2008 at 08:07:59AM -0500, Richard Treumann wrote: > There is no misunderstanding of the MPI standard or the definition of > blocking in the bug3 example. Both bug 3 and the example I provided are > valid MPI. > > As you say, blocking means the send buffer can be reused when the MPI_Send > returns. This is exactly what bug3 is count on. > > MPI is a reliable protocol which means that once MPI_Send returns, the > application can assume the message will be delivered once a matching recv > is posted. There are only two ways I can think of for MPI to keep that > guarantee. > 1) Before return from MPI_Send, copy the envelop and data to some buffer > that will be preserved until the MPI_Recv gets posted > 2) delay the return from MPI_Send until the MPI_Recv is posted and then > move data from the intact send buffer to the posted receive buffer. Return > from MPI_Send > > The requirement in the standard is that if libmpi takes option 1, the > return from MPI_Send cannot occur unless there is certainty the buffer > space exists. Libmpi cannot throw the message over the wall and fail the > job if it cannit be buffered. As I said Open MPI has flow control on transport layer to prevent messages from been dropped by network. This mechanism should allow program like yours to work, but bug3 is another story because it generate huge amount of unexpected messages and Open MPI has no mechanism to prevent unexpected messages to blow memory consumption. Your point is that according to MPI spec this is not a valid behaviour. I am not going to argue with that especially as you can get desired behaviour by setting eager limit to zero. > users-boun...@open-mpi.org wrote on 02/05/2008 02:28:27 AM: > > > On Mon, Feb 04, 2008 at 04:23:13PM -0500, Sacerdoti, Federico wrote: > > > Bug3 is a test-case derived from a real, scalable application (desmond > > > for molecular dynamics) that several experienced MPI developers have > > > worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the > > > openmpi silently sends them in the background and overwhelms process 0 > > > due to lack of flow control. > > MPI_Send is *blocking* in MPI sense of the word i.e when MPI_Send returns > > send buffer can be reused. MPI spec section 3.4. > > > > -- > > Gleb. > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Gleb.
Re: [OMPI users] openmpi credits for eager messages
Hi Gleb There is no misunderstanding of the MPI standard or the definition of blocking in the bug3 example. Both bug 3 and the example I provided are valid MPI. As you say, blocking means the send buffer can be reused when the MPI_Send returns. This is exactly what bug3 is count on. MPI is a reliable protocol which means that once MPI_Send returns, the application can assume the message will be delivered once a matching recv is posted. There are only two ways I can think of for MPI to keep that guarantee. 1) Before return from MPI_Send, copy the envelop and data to some buffer that will be preserved until the MPI_Recv gets posted 2) delay the return from MPI_Send until the MPI_Recv is posted and then move data from the intact send buffer to the posted receive buffer. Return from MPI_Send The requirement in the standard is that if libmpi takes option 1, the return from MPI_Send cannot occur unless there is certainty the buffer space exists. Libmpi cannot throw the message over the wall and fail the job if it cannit be buffered. Dick Dick Treumann - MPI Team/TCEM IBM Systems & Technology Group Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 02/05/2008 02:28:27 AM: > On Mon, Feb 04, 2008 at 04:23:13PM -0500, Sacerdoti, Federico wrote: > > Bug3 is a test-case derived from a real, scalable application (desmond > > for molecular dynamics) that several experienced MPI developers have > > worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the > > openmpi silently sends them in the background and overwhelms process 0 > > due to lack of flow control. > MPI_Send is *blocking* in MPI sense of the word i.e when MPI_Send returns > send buffer can be reused. MPI spec section 3.4. > > -- > Gleb. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI piggyback mechanism
Hi, I've been working on MPI piggyback technique as a part of my PhD work. Although MPI does not provide a native support, there are several different solutions to transmit piggyback data over every MPI communication. You may find a brief overview in papers [1, 2]. This includes copying the original message and the extra data to a bigger buffer, sending additional message or changing the sendtype to a dynamically created wrapper datatype that contains a pointer to the original data and the piggyback data. I have tried all mechanisms and they work, but considering the overhead, there is no "the best" technique that outperforms the others in all scenarios. Jeff Squyres had interesting comments on this subject before (in this mailing list). Finally after some benchmarking, I have implemented *a *hybrid technique that combines existing mechanisms. For small, point-to-point messages datatype wrapping seems to be the less intrusive, at least considering OpenMPI implementation of derived datatypes. For large, point-to-point messages, experiments confirmed that sending an additional message is much cheaper than wrapping (and besides the intrusion is small as we are already sending a large message). Moreover, the implementation may interleave the original send with an asynchronous send of piggyback data. This optimization partially hides the latency of additional send and lowers overall intrusion. The same criteria can be applied for collective operations, except barrier and reduce operations. As the former does not transmit any data and the latter transforms the data, the only solution is to send additional messages. There is a penalty of course. Especially for collective operations with very small messages the intrusion may reach 15% and that's a lot. It than decreases down to 0.1% for bigger messages, but anyway it's still there. I don't know what are your requirements/expectations for that issue. The only work that reported lower overheads is [3] but they added native piggyback support by changing underlying MPI implementation. I think the best possible option is to add piggyback support for MPI as a part of the standard. A growing number of runtime tools use this functionality for multiple reasons and certainly PMPI itself is not enough. References of interest: - [1] Shende, S., Malony, A., Morris, A., Wolf, F. "Performance Profiling Overhead Compensation for MPI Programs". 12th EuroPVM-MPI Conference, LNCS, vol. 3666, pp. 359-367, 2005. They review various techniques and come up with datatype wrapping. - [2] Schulz, M., "Extracting Critical Path Graphs from MPI Applications". Cluster Computing 2005, IEEE International, pp. 1-10, September 2005. They use datatype wrapping. - [3] Jeffrey Vetter, "Dynamic Statistical Profiling of Communication Activity in Distributed Applications". They add support for piggyback at MPI implementation level and report very low overheads (no surprise). Regards, Oleg Morajko On Feb 1, 2008 5:08 PM, Aurélien Bouteiller wrote: > I don't know of any work in that direction for now. Indeed, we plan to > eventually integrate at least causal message logging in the pml-v, > which also includes piggybacking. Therefore we are open for > collaboration with you on this matter. Please let us know :) > > Aurelien > > > > Le 1 févr. 08 à 09:51, Thomas Ropars a écrit : > > > Hi, > > > > I'm currently working on optimistic message logging and I would like > > to > > implement an optimistic message logging protocol in OpenMPI. > > Optimistic > > message logging protocols piggyback information about dependencies > > between processes on the application messages to be able to find a > > consistent global state after a failure. That's why I'm interested in > > the problem of piggybacking information on MPI messages. > > > > Is there some works on this problem at the moment ? > > Has anyone already implemented some mechanisms in OpenMPI to piggyback > > data on MPI messages? > > > > Regards, > > > > Thomas > > > > Oleg Morajko wrote: > >> Hi, > >> > >> I'm developing a causality chain tracking library and need a > >> mechanism > >> to attach an extra data to every MPI message, so called piggyback > >> mechanism. > >> > >> As far as I know there are a few solutions to this problem from which > >> the two fundamental ones are the following: > >> > >>* Dynamic datatype wrapping - if a user MPI_Send, let's say 1024 > >> doubles, the wrapped send call implementation dynamically > >> creates a derived datatype that is a structure composed of a > >> pointer to 1024 doubles and extra fields to be piggybacked. The > >> datatype is constructed with absolute addresses to avoid copying > >> the original buffer. The receivers side creates the equivalent > >> datatype to receive the original data and extra data. The > >> performance of this solution depends on the how good is derived > >> data type handling, but seems to
Re: [OMPI users] openmpi credits for eager messages
On Mon, Feb 04, 2008 at 04:23:13PM -0500, Sacerdoti, Federico wrote: > Bug3 is a test-case derived from a real, scalable application (desmond > for molecular dynamics) that several experienced MPI developers have > worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the > openmpi silently sends them in the background and overwhelms process 0 > due to lack of flow control. MPI_Send is *blocking* in MPI sense of the word i.e when MPI_Send returns send buffer can be reused. MPI spec section 3.4. -- Gleb.