Re: [OMPI users] openmpi credits for eager messages
What I missed in this whole conversation is that the pieces of text that Ron and Dick are citing are *on the same page* in the MPI spec; they're not disparate parts of the spec that accidentally overlap in discussion scope. Specifically, it says: Resource limitations Any pending communication operation consumes system resources that are limited. Errors may occur when lack of resources prevent the execution of an MPI call. A quality implementation will use a (small) fixed amount of resources for each pending send in the ready or synchronous mode and for each pending receive. However, buffer space may be consumed to store messages sent in standard mode, and must be consumed to store messages sent in buffered mode, when no matching receive is available. The amount of space available for buffering will be much smaller than program data memory on many systems. Then, it will be easy to write programs that overrun available buffer space. ...12 lines down on that page, on the same page, in the same section... Consider a situation where a producer repeatedly produces new values and sends them to a consumer. Assume that the producer produces new values faster than the consumer can consume them. ...skip 2 sentences about buffered sends... If standard sends are used, then the producer will be automatically throttled, as its send operations will block when buffer space is unavailable. I find that to be unambiguous. 1. A loop of MPI_ISENDs on a producer can cause a malloc failure (can't malloc a new MPI_Request), and that's an error. Tough luck. 2. A loop of MPI_SENDs on a producer can run a slow-but-MPI-active consumer out of buffer space if all the incoming messages are queued up (e.g., in the unexpected queue). The language above is pretty clear about this: MPI_SEND on the producer is supposed to block at this point. FWIW: Open MPI does support this mode of operation, as George and Gleb noted (by setting the eager size to 0, therefore forcing *all* sends to be synchronous -- a producer cannot "run ahead" for a while and eventually be throttled when receive buffering is exhausted), but a) it's not the default, and b) it's not documented this way. On Feb 4, 2008, at 1:29 PM, Richard Treumann wrote: Hi Ron - I am well aware of the scaling problems related to the standard send requirements in MPI. I t is a very difficult issue. However, here is what the standard says: MPI 1.2, page 32 lines 29-37 === a standard send operation that cannot complete because of lack of buffer space will merely block, waiting for buffer space to become available or for a matching receive to be posted. This behavior is preferable in many situations. Consider a situation where a producer repeatedly produces new values and sends them to a consumer. Assume that the producer produces new values faster than the consumer can consume them. If buffered sends are used, then a buffer overflow will result. Additional synchronization has to be added to the program so as to prevent this from occurring. If standard sends are used, then the producer will be automatically throttled, as its send operations will block when buffer space is unavailable. If there are people who want to argue that this is unclear or that it should be changed, the MPI Forum can and should take up the discussion. I think this particular wording is pretty clear. The piece of MPI standard wording you quote is somewhat ambiguous: The amount of space available for buffering will be much smaller than program data memory on many systems. Then, it will be easy to write programs that overrun available buffer space. But note that this wording mentions a problem that an application can create but does not say the MPI implementation can fail the job. The language I have pointed to is where the standard says what the MPI implementation must do. The "lack of resource" statement is more about send and receive descriptors than buffer space. If I write a program with an infinite loop of MPI_IRECV postings the standard allows that to fail. Dick Dick Treumann - MPI Team/TCEM IBM Systems & Technology Group Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 02/04/2008 12:24:11 PM: > > > Is what George says accurate? If so, it sounds to me like OpenMPI > > does not comply with the MPI standard on the behavior of eager > > protocol. MPICH is getting dinged in this discussion because they > > have complied with the requirements of the MPI standard. IBM MPI > > also complies with the standard. > > > > If there is any debate about whether the MPI standard does (or > > should) require the behavior I describe below then we should move > > the discussion to the MPI 2.1 Forum and get a clarification. > > [...] > > The MPI
Re: [OMPI users] openmpi credits for eager messages
Ron's comments are probably dead on for an application like bug3. If bug3 is long running and libmpi is doing eager protocol buffer management as I contend the standard requires then the producers will not get far ahead of the consumer before they are forced to synchronous send under the covers anyway. From then on, producers will run no faster than their output can be absorbed. They will spent the nonproductive parts of their time blocked on either MPI_Send or MPI_Ssend. The job will not finish until the consumer finishes because the consumer is a constant bottleneck anyway. The slow consumer is the major drag on scalability. As long as the producers can be expected to outrun the consumer for the life of the job you will probably find it hard to measure a difference between synchronous send and flow controlled standard send. Eager protocol gets more interesting when the pace of the consumer and of the producers is variable. If the consumer can absorb a message per millisecond and the production rate is close to one message per millisecond but fluctuates a bit then eager protocol may speed the whole job significantly. The producers can never get ahead with synchronous send even in a phase when they might be able to create a message every 1/2 millisecond. The producers spend half this easy phase blocked in MPI_Ssend. If producers now enter a compute intensive phase where messages can only be generated once every 2 milliseconds the consumer spends time idle. If the consumer had been able to accumulate queued messages with eager protocol when the producers were able to run faster it could now make itself useful catching up. Both producers and consumer would come closer to 100% productive work and the job would finish sooner.. Dick Dick Treumann - MPI Team/TCEM IBM Systems & Technology Group Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 02/05/2008 01:26:24 PM: > > Re: MPI_Ssend(). This indeed fixes bug3, the process at rank 0 has > > reasonable memory usage and the execution proceeds normally. > > > > Re scalable: One second. I know well bug3 is not scalable, and when to > > use MPI_Isend. The point is programmers want to count on the MPI spec as > > written, as Richard pointed out. We want to send small messages quickly > > and efficiently, without the danger of overloading the receiver's > > resources. We can use MPI_Ssend() but it is slow compared MPI_Send(). > > Your last statement is not necessarily true. By synchronizing processes > using MPI_Ssend(), you can potentially avoid large numbers of unexpected > messages that need to be buffered and copied, and that also need to be > searched every time a receive is posted. There is no guarantee that the > protocol overhead on each message incurred with MPI_Ssend() slows down an > application more than the buffering, copying, and searching overhead of a > large number of unexpected messages. > > It is true that MPI_Ssend() is slower than MPI_Send() for ping-pong > micro-benchmarks, but the length of the unexpected message queue doesn't > have to get very long before they are about the same. > > > > > Since identifying this behavior we have implemented the desired flow > > control in our application. > > It would be interesting to see performance results comparing doing flow > control in the application versus having MPI do it for you > > -Ron > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openmpi credits for eager messages
> Re: MPI_Ssend(). This indeed fixes bug3, the process at rank 0 has > reasonable memory usage and the execution proceeds normally. > > Re scalable: One second. I know well bug3 is not scalable, and when to > use MPI_Isend. The point is programmers want to count on the MPI spec as > written, as Richard pointed out. We want to send small messages quickly > and efficiently, without the danger of overloading the receiver's > resources. We can use MPI_Ssend() but it is slow compared MPI_Send(). Your last statement is not necessarily true. By synchronizing processes using MPI_Ssend(), you can potentially avoid large numbers of unexpected messages that need to be buffered and copied, and that also need to be searched every time a receive is posted. There is no guarantee that the protocol overhead on each message incurred with MPI_Ssend() slows down an application more than the buffering, copying, and searching overhead of a large number of unexpected messages. It is true that MPI_Ssend() is slower than MPI_Send() for ping-pong micro-benchmarks, but the length of the unexpected message queue doesn't have to get very long before they are about the same. > > Since identifying this behavior we have implemented the desired flow > control in our application. It would be interesting to see performance results comparing doing flow control in the application versus having MPI do it for you -Ron
Re: [OMPI users] openmpi credits for eager messages
So with an Isend your program becomes valid MPI and a very nice illustrarion of why the MPI standard cannot limit envelops (or send/recv descriptors) and why at some point the number of descriptors can blow the limits. It also illustrates how the management of eager messages remains workable. (Not the same as affordable or appropriate. I agree it has serious scaling issues) Let's assume there is managed early arrival space for 10 messages per sender. Each MPI_Isend generates an envelop that goes to the destination. For your program to unwind properly, every envelop must be delivered to the destination. The first (blocking) MPI_Recv is looking for the tag in the last envelop so if libmpi does not deliver all 5000 envelops per sender, the first MPI_Recv will block forever. It is not acceptable for a valid MPI program to deadlock. If the destination cannot hold all the envelops there is no choice but to fail the job. The standard allows this. The Forum considered it to be better to fail a job than to deadlock it. If each sender sends its first 10 messages eagerly the send side tokens will be used up and the buffer space at the destination will fill up but not overflow. The senders now fall back to rendevous for their remaining 4990 MPI_Isends. The MPI_Isends cannot block. They send envelops as fast as the loop can run but the user send buffers involved cannot be altered until the waits occur. Once the last sent envelop with tag 5000 arrives and matches the posted MPI_Recv, an OK_to_send goes back to the sender and the data can be moved from the still intact send buffer to the waiting receive buffer. The MPI_Waits for the Isend requests can be done in any order but no send buffer can be changed until the corresponding MPI_Wait returns. No system buffer needed for massage data. The MPI_Recvs being posted in reverse order (5000,4999 .. 11. ) each ship OK_to_send and data flows directly from send to recv buffers. Finally the MPI_Recvs for tags (10 ... 1) get posted and pull their message data from the early arrival space. The program has unwound correctly and as the early arrival space frees up, credits can be returned to the sender. Performance discussions aside - the semantic is clean and reliable. Thanks - Dick PS - If anyone responds to this I hope you will state clearly whether you want to talk about: - What does the standard require? or - What should the standard require? Dick Treumann - MPI Team/TCEM IBM Systems & Technology Group Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 02/04/2008 06:04:22 PM: > Richard, > > You're absolutely right. What a shame :) If I have spent less time > drawing the boxes around the code I might have noticed the typo. The > Send should be an Isend. > >george. > > On Feb 4, 2008, at 5:32 PM, Richard Treumann wrote: > > > Hi George > > > > Sorry - This is not a valid MPI program. It violates the requirement > > that a program not depend on there being any system buffering. See > > page 32-33 of MPI 1.1 > > > > Lets simplify to: > > Task 0: > > MPI_Recv( from 1 with tag 1) > > MPI_Recv( from 1 with tag 0) > > > > Task 1: > > MPI_Send(to 0 with tag 0) > > MPI_Send(to 0 with tag 1) > > > > Without any early arrival buffer (or with eager size set to 0) task > > 0 will hang in the first MPI_Recv and never post a recv with tag 0. > > Task 1 will hang in the MPI_Send with tag 0 because it cannot get > > past it until the matching recv is posted by task 0. > > > > If there is enough early arrival buffer for the first MPI_Send on > > task 1 to complete and the second MPI_Send to be posted the example > > will run. Once both sends are posted by task 1, task 0 will harvest > > the second send and get out of its first recv. Task 0's second recv > > can now pick up the message from the early arrival buffer where it > > had to go to let task 1complete send 1 and post send 2. > > > > If an application wants to do this kind of order inversion it should > > use some non blocking operations. For example, if task 0 posted an > > MPI_Irecv for tag 1, an MPI_Recv for tag 0 and lastly, an MPI_Wait > > for the Irecv, the example is valid. > > > > I am not aware of any case where the standard allows a correct MPI > > program to be deadlocked by an implementation limit. It can be > > failed if it exceeds a limit but I do not think it is ever OK to hang. > > > > Dick > > > > Dick Treumann - MPI Team/TCEM > > IBM Systems & Technology Group > > Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 > > Tele (845) 433-7846 Fax (845) 433-8363 > > > > > > users-boun...@open-mpi.org wrote on 02/04/2008 04:41:21 PM: > > > > > Please allow me to slightly modify your example. It still follow the > > > rules from the MPI standard, so I think it's a 100% standard > > compliant > > > parallel application. > > > > > >
Re: [OMPI users] openmpi credits for eager messages
Wow this sparked a much more heated discussion than I was expecting. I was just commenting that the behaviour the original author (Federico Sacerdoti) mentioned would explain something I observed in one of my early trials of OpenMPI. But anyway, because it seems that quite a few people were interested, I've attached a simplified version of the test I was describing (with all the timing checks and some of the crazier output removed). Now that I go back and retest this it turns out that it wasn't actually a segfault that was killing it, but running out of memory as you and others have predicted. Brian W. Barrett brbarret-at-open-mpi.org |openmpi-users/Allow| wrote: > Now that this discussion has gone way off into the MPI standard woods :). > > Was your test using Open MPI 1.2.4 or 1.2.5 (the one with the segfault)? > There was definitely a bug in 1.2.4 that could cause exactly the behavior > you are describing when using the shared memory BTL, due to a silly > delayed initialization bug/optimization. I'm still using Open MPI 1.2.4 and actually the SM BTL seems to be the hardest to break (I guess I'm dodging the bullet on that delayed initialization bug you're referring to). > If you are using the OB1 PML (the default), you will still have the > possibility of running the receiver out of memory if the unexpected queue > grows without bounds. I'll withold my opinion on what the standard says > so that we can perhaps actually help you solve your problem and stay out > of the weeds :). Note however, that in general unexpected messages are a > bad idea and thousands of them from one peer to another should be avoided > at all costs -- this is just good MPI programming practice. Actually I was expecting to break something with this test. I just wanted to find out where it broke. Lesson learned, I wrote my more serious programs doing exactly that (no unexpected messages). I was just surprised that the default Open MPI settings allowed me to flood the system so easily whereas MPICH/MX still finished not matter what I threw at it (albeit with terrible performance (in the bad cases)). > Now, if you are using MX, you can replicate MPICH/MX's behavior (including > the very slow part) by using the CM PML (--mca pml cm on the mpirun > command line), which will use the MX library message matching and > unexpected queue and therefore behave exactly like MPICH/MX. That works exactly as you described, and it does indeed prevent memory usage from going wild due to the unexpected messages. Thanks for your help! (and to the others for the educational discussion!) > > Brian > > > On Sat, 2 Feb 2008, 8mj6tc...@sneakemail.com wrote: > >> That would make sense. I able to break OpenMPI by having Node A wait for >> messages from Node B. Node B is in fact sleeping while Node C bombards >> Node A with a few thousand messages. After a while Node B wakes up and >> sends Node A the message it's been waiting on, but Node A has long since >> been buried and seg faults. If I decrease the number of messages C is >> sending, it works properly. This was on OpenMPI 1.2.4 (using I think the >> SM BTL (might have been MX or TCP, but certainly not infiniband. I could >> dig up the test and try again if anyone is seriously curious). >> >> Trying the same test on MPICH/MX went very very slow (I don't think they >> have any clever buffer management) but it didn't crash. >> >> Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com >> |openmpi-users/Allow| wrote: >>> Hi, >>> >>> I am readying an openmpi 1.2.5 software stack for use with a >>> many-thousand core cluster. I have a question about sending small >>> messages that I hope can be answered on this list. >>> >>> I was under the impression that if node A wants to send a small MPI >>> message to node B, it must have a credit to do so. The credit assures A >>> that B has enough buffer space to accept the message. Credits are >>> required by the mpi layer regardless of the BTL transport layer used. >>> >>> I have been told by a Voltaire tech that this is not so, the credits are >>> used by the infiniband transport layer to reliably send a message, and >>> is not an openmpi feature. >>> >>> Thanks, >>> Federico >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- --Kris 叶ってしまう夢は本当の夢と言えん。 [A dream that comes true can't really be called a dream.] #include #include #include #include #include //for atoi (in case someone doesn't have boost) const int buflen=5000; int main(int argc, char *argv[]) { using namespace std; int reps=1000; if(argc>1){ //optionally specify number of repeats on the command line reps=atoi(argv[1]); } int numprocs, rank, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME];
Re: [OMPI users] openmpi credits for eager messages
On Tue, Feb 05, 2008 at 08:07:59AM -0500, Richard Treumann wrote: > There is no misunderstanding of the MPI standard or the definition of > blocking in the bug3 example. Both bug 3 and the example I provided are > valid MPI. > > As you say, blocking means the send buffer can be reused when the MPI_Send > returns. This is exactly what bug3 is count on. > > MPI is a reliable protocol which means that once MPI_Send returns, the > application can assume the message will be delivered once a matching recv > is posted. There are only two ways I can think of for MPI to keep that > guarantee. > 1) Before return from MPI_Send, copy the envelop and data to some buffer > that will be preserved until the MPI_Recv gets posted > 2) delay the return from MPI_Send until the MPI_Recv is posted and then > move data from the intact send buffer to the posted receive buffer. Return > from MPI_Send > > The requirement in the standard is that if libmpi takes option 1, the > return from MPI_Send cannot occur unless there is certainty the buffer > space exists. Libmpi cannot throw the message over the wall and fail the > job if it cannit be buffered. As I said Open MPI has flow control on transport layer to prevent messages from been dropped by network. This mechanism should allow program like yours to work, but bug3 is another story because it generate huge amount of unexpected messages and Open MPI has no mechanism to prevent unexpected messages to blow memory consumption. Your point is that according to MPI spec this is not a valid behaviour. I am not going to argue with that especially as you can get desired behaviour by setting eager limit to zero. > users-boun...@open-mpi.org wrote on 02/05/2008 02:28:27 AM: > > > On Mon, Feb 04, 2008 at 04:23:13PM -0500, Sacerdoti, Federico wrote: > > > Bug3 is a test-case derived from a real, scalable application (desmond > > > for molecular dynamics) that several experienced MPI developers have > > > worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the > > > openmpi silently sends them in the background and overwhelms process 0 > > > due to lack of flow control. > > MPI_Send is *blocking* in MPI sense of the word i.e when MPI_Send returns > > send buffer can be reused. MPI spec section 3.4. > > > > -- > > Gleb. > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Gleb.
Re: [OMPI users] openmpi credits for eager messages
Hi Gleb There is no misunderstanding of the MPI standard or the definition of blocking in the bug3 example. Both bug 3 and the example I provided are valid MPI. As you say, blocking means the send buffer can be reused when the MPI_Send returns. This is exactly what bug3 is count on. MPI is a reliable protocol which means that once MPI_Send returns, the application can assume the message will be delivered once a matching recv is posted. There are only two ways I can think of for MPI to keep that guarantee. 1) Before return from MPI_Send, copy the envelop and data to some buffer that will be preserved until the MPI_Recv gets posted 2) delay the return from MPI_Send until the MPI_Recv is posted and then move data from the intact send buffer to the posted receive buffer. Return from MPI_Send The requirement in the standard is that if libmpi takes option 1, the return from MPI_Send cannot occur unless there is certainty the buffer space exists. Libmpi cannot throw the message over the wall and fail the job if it cannit be buffered. Dick Dick Treumann - MPI Team/TCEM IBM Systems & Technology Group Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 02/05/2008 02:28:27 AM: > On Mon, Feb 04, 2008 at 04:23:13PM -0500, Sacerdoti, Federico wrote: > > Bug3 is a test-case derived from a real, scalable application (desmond > > for molecular dynamics) that several experienced MPI developers have > > worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the > > openmpi silently sends them in the background and overwhelms process 0 > > due to lack of flow control. > MPI_Send is *blocking* in MPI sense of the word i.e when MPI_Send returns > send buffer can be reused. MPI spec section 3.4. > > -- > Gleb. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openmpi credits for eager messages
On Mon, Feb 04, 2008 at 04:23:13PM -0500, Sacerdoti, Federico wrote: > Bug3 is a test-case derived from a real, scalable application (desmond > for molecular dynamics) that several experienced MPI developers have > worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the > openmpi silently sends them in the background and overwhelms process 0 > due to lack of flow control. MPI_Send is *blocking* in MPI sense of the word i.e when MPI_Send returns send buffer can be reused. MPI spec section 3.4. -- Gleb.
Re: [OMPI users] openmpi credits for eager messages
Richard, You're absolutely right. What a shame :) If I have spent less time drawing the boxes around the code I might have noticed the typo. The Send should be an Isend. george. On Feb 4, 2008, at 5:32 PM, Richard Treumann wrote: Hi George Sorry - This is not a valid MPI program. It violates the requirement that a program not depend on there being any system buffering. See page 32-33 of MPI 1.1 Lets simplify to: Task 0: MPI_Recv( from 1 with tag 1) MPI_Recv( from 1 with tag 0) Task 1: MPI_Send(to 0 with tag 0) MPI_Send(to 0 with tag 1) Without any early arrival buffer (or with eager size set to 0) task 0 will hang in the first MPI_Recv and never post a recv with tag 0. Task 1 will hang in the MPI_Send with tag 0 because it cannot get past it until the matching recv is posted by task 0. If there is enough early arrival buffer for the first MPI_Send on task 1 to complete and the second MPI_Send to be posted the example will run. Once both sends are posted by task 1, task 0 will harvest the second send and get out of its first recv. Task 0's second recv can now pick up the message from the early arrival buffer where it had to go to let task 1complete send 1 and post send 2. If an application wants to do this kind of order inversion it should use some non blocking operations. For example, if task 0 posted an MPI_Irecv for tag 1, an MPI_Recv for tag 0 and lastly, an MPI_Wait for the Irecv, the example is valid. I am not aware of any case where the standard allows a correct MPI program to be deadlocked by an implementation limit. It can be failed if it exceeds a limit but I do not think it is ever OK to hang. Dick Dick Treumann - MPI Team/TCEM IBM Systems & Technology Group Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 02/04/2008 04:41:21 PM: > Please allow me to slightly modify your example. It still follow the > rules from the MPI standard, so I think it's a 100% standard compliant > parallel application. > > ++ > | task 0:| > ++ > | MPI_Init() | > | sleep(3000)| > | for( msg = 0; msg < 5000; msg++ ) {| > | for( peer = 0; peer < com_size; peer++ ) { | > | MPI_Recv( ..., from = peer, tag = (5000 - msg),... ); | > | }| > | } | > ++ > > ++ > | task 1 to com_size: | > ++ > | MPI_Init() | > | for( msg = 0; msg < 5000; msg++ ) {| > | MPI_Send( ..., 0, tag = msg, ... ); | > | } | > ++ > > Isn't that the flow control will stop the application to run to > completion ? It's easy to write an application that break a particular > MPI implementation. It doesn't necessarily make this implementation > non standard compliant. > > george. > > On Feb 4, 2008, at 9:08 AM, Richard Treumann wrote: > > > Is what George says accurate? If so, it sounds to me like OpenMPI > > does not comply with the MPI standard on the behavior of eager > > protocol. MPICH is getting dinged in this discussion because they > > have complied with the requirements of the MPI standard. IBM MPI > > also complies with the standard. > > > > If there is any debate about whether the MPI standard does (or > > should) require the behavior I describe below then we should move > > the discussion to the MPI 2.1 Forum and get a clarification. > > > > To me, the MPI standard is clear that a program like this: > > > > task 0: > > MPI_Init > > sleep(3000); > > start receiving messages > > > > each of tasks 1 to n-1: > > MPI_Init > > loop 5000 times > > MPI_Send(small message to 0) > > end loop > > > > May send some small messages eagerly if there is space at task 0 but > > must block each task 1 to n-1 before allowing task 0 to run out of > > eager buffer space. Doing this requires a token or credit management > > system in which each task has credits for known buffer space at task > > 0. Each task will send eagerly to task 0 until the sender runs out > > of credits and then must switch to rendezvous protocol. Tasks 1to > > n-1 might each do 3 MPI_Sends or 300 MPI_Sends before blocking, > > depending on how much buffer space there is at task 0 but they would
Re: [OMPI users] openmpi credits for eager messages
Hi George Sorry - This is not a valid MPI program. It violates the requirement that a program not depend on there being any system buffering. See page 32-33 of MPI 1.1 Lets simplify to: Task 0: MPI_Recv( from 1 with tag 1) MPI_Recv( from 1 with tag 0) Task 1: MPI_Send(to 0 with tag 0) MPI_Send(to 0 with tag 1) Without any early arrival buffer (or with eager size set to 0) task 0 will hang in the first MPI_Recv and never post a recv with tag 0. Task 1 will hang in the MPI_Send with tag 0 because it cannot get past it until the matching recv is posted by task 0. If there is enough early arrival buffer for the first MPI_Send on task 1 to complete and the second MPI_Send to be posted the example will run. Once both sends are posted by task 1, task 0 will harvest the second send and get out of its first recv. Task 0's second recv can now pick up the message from the early arrival buffer where it had to go to let task 1complete send 1 and post send 2. If an application wants to do this kind of order inversion it should use some non blocking operations. For example, if task 0 posted an MPI_Irecv for tag 1, an MPI_Recv for tag 0 and lastly, an MPI_Wait for the Irecv, the example is valid. I am not aware of any case where the standard allows a correct MPI program to be deadlocked by an implementation limit. It can be failed if it exceeds a limit but I do not think it is ever OK to hang. Dick Dick Treumann - MPI Team/TCEM IBM Systems & Technology Group Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 02/04/2008 04:41:21 PM: > Please allow me to slightly modify your example. It still follow the > rules from the MPI standard, so I think it's a 100% standard compliant > parallel application. > > ++ > | task 0:| > ++ > | MPI_Init() | > | sleep(3000)| > | for( msg = 0; msg < 5000; msg++ ) {| > | for( peer = 0; peer < com_size; peer++ ) { | > | MPI_Recv( ..., from = peer, tag = (5000 - msg),... ); | > | }| > | } | > ++ > > ++ > | task 1 to com_size: | > ++ > | MPI_Init() | > | for( msg = 0; msg < 5000; msg++ ) {| > | MPI_Send( ..., 0, tag = msg, ... ); | > | } | > ++ > > Isn't that the flow control will stop the application to run to > completion ? It's easy to write an application that break a particular > MPI implementation. It doesn't necessarily make this implementation > non standard compliant. > > george. > > On Feb 4, 2008, at 9:08 AM, Richard Treumann wrote: > > > Is what George says accurate? If so, it sounds to me like OpenMPI > > does not comply with the MPI standard on the behavior of eager > > protocol. MPICH is getting dinged in this discussion because they > > have complied with the requirements of the MPI standard. IBM MPI > > also complies with the standard. > > > > If there is any debate about whether the MPI standard does (or > > should) require the behavior I describe below then we should move > > the discussion to the MPI 2.1 Forum and get a clarification. > > > > To me, the MPI standard is clear that a program like this: > > > > task 0: > > MPI_Init > > sleep(3000); > > start receiving messages > > > > each of tasks 1 to n-1: > > MPI_Init > > loop 5000 times > > MPI_Send(small message to 0) > > end loop > > > > May send some small messages eagerly if there is space at task 0 but > > must block each task 1 to n-1 before allowing task 0 to run out of > > eager buffer space. Doing this requires a token or credit management > > system in which each task has credits for known buffer space at task > > 0. Each task will send eagerly to task 0 until the sender runs out > > of credits and then must switch to rendezvous protocol. Tasks 1to > > n-1 might each do 3 MPI_Sends or 300 MPI_Sends before blocking, > > depending on how much buffer space there is at task 0 but they would > > need to block in some MPI_Send before task 0 blows up. > > > > When task 0 wakes up and begins receiving the early arrivals, tasks > > 1 to n-1 will unblock and resume looping.. Allowing the user to shut > > off eager protocol by setting eager size to 0 does
Re: [OMPI users] openmpi credits for eager messages
To keep this out of the weeds, I have attached a program called "bug3" that illustrates this problem on openmpi 1.2.5 using the openib BTL. In bug3 process with rank 0 uses all available memory buffering "unexpected" messages from its neighbors. Bug3 is a test-case derived from a real, scalable application (desmond for molecular dynamics) that several experienced MPI developers have worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the openmpi silently sends them in the background and overwhelms process 0 due to lack of flow control. It may not be hard to change desmond to work around openmpi's small message semantics, but a programmer should reasonably be allowed to think a blocking send will block if the receiver cannot handle it yet. Federico -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Brightwell, Ronald Sent: Monday, February 04, 2008 3:30 PM To: Patrick Geoffray Cc: Open MPI Users Subject: Re: [OMPI users] openmpi credits for eager messages > > I'm looking at a network where the number of endpoints is large enough that > > everybody can't have a credit to start with, and the "offender" isn't any > > single process, but rather a combination of processes doing N-to-1 where N > > is sufficiently large. I can't just tell one process to slow down. I have > > to tell them all to slow down and do it quickly... > > When you have N->1 patterns, then the hardware flow-control will > throttle the senders, or drop packets if there is no hardware > flow-control. If you don't have HOL blocking but the receiver does not > consume for any reasons (busy, sleeping, dead, whatever), then you can > still drop packets on the receiver (NIC, driver, thread) at a last > resort, this is what TCP does. The key is have exponential backoff (or a > reasonably large resend timeout) to no continue the hammering. > > It costs nothing in the common case (unlike the credits approach), but > it does handle corner cases without affecting too much other nodes > (unlike hardware flow-control). Right. For a sufficiently large number of endpoints, flow control has to get pushed out of MPI and down into the network, which is why I don't necesarily want an MPI that does flow control at the user-level. > > But you know all that. You are just being mean to your users because you > can :-) The sick part is that I think I envy you... You know it :) -Ron ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users bug3.c Description: bug3.c
Re: [OMPI users] openmpi credits for eager messages
> > I'm looking at a network where the number of endpoints is large enough that > > everybody can't have a credit to start with, and the "offender" isn't any > > single process, but rather a combination of processes doing N-to-1 where N > > is sufficiently large. I can't just tell one process to slow down. I have > > to tell them all to slow down and do it quickly... > > When you have N->1 patterns, then the hardware flow-control will > throttle the senders, or drop packets if there is no hardware > flow-control. If you don't have HOL blocking but the receiver does not > consume for any reasons (busy, sleeping, dead, whatever), then you can > still drop packets on the receiver (NIC, driver, thread) at a last > resort, this is what TCP does. The key is have exponential backoff (or a > reasonably large resend timeout) to no continue the hammering. > > It costs nothing in the common case (unlike the credits approach), but > it does handle corner cases without affecting too much other nodes > (unlike hardware flow-control). Right. For a sufficiently large number of endpoints, flow control has to get pushed out of MPI and down into the network, which is why I don't necesarily want an MPI that does flow control at the user-level. > > But you know all that. You are just being mean to your users because you > can :-) The sick part is that I think I envy you... You know it :) -Ron
Re: [OMPI users] openmpi credits for eager messages
Brightwell, Ronald wrote: I'm looking at a network where the number of endpoints is large enough that everybody can't have a credit to start with, and the "offender" isn't any single process, but rather a combination of processes doing N-to-1 where N is sufficiently large. I can't just tell one process to slow down. I have to tell them all to slow down and do it quickly... When you have N->1 patterns, then the hardware flow-control will throttle the senders, or drop packets if there is no hardware flow-control. If you don't have HOL blocking but the receiver does not consume for any reasons (busy, sleeping, dead, whatever), then you can still drop packets on the receiver (NIC, driver, thread) at a last resort, this is what TCP does. The key is have exponential backoff (or a reasonably large resend timeout) to no continue the hammering. It costs nothing in the common case (unlike the credits approach), but it does handle corner cases without affecting too much other nodes (unlike hardware flow-control). But you know all that. You are just being mean to your users because you can :-) The sick part is that I think I envy you... Patrick
Re: [OMPI users] openmpi credits for eager messages
On Mon, Feb 04, 2008 at 02:54:46PM -0500, Richard Treumann wrote: > In my example, each sender task 1 to n-1 will have one rendezvous message > to task 0 at a time. The MPI standard suggests descriptors be small enough > and there be enough descriptor space for reasonable programs . The > standard is clear that unreasonable programs can run out of space and fail. > The standard does not try to quantify reasonableness. You are right about your example, but I was not talking specifically about it. Your example should work with Open MPI over IB/TCP because while rank 0 sleeps without calling progress, transport layer flow control should throttle senders. (SM doesn't have flow control that is why it fails.) What I was trying to say that in MPI a process can't fully control its resource usage. -- Gleb.
Re: [OMPI users] openmpi credits for eager messages
Now that this discussion has gone way off into the MPI standard woods :). Was your test using Open MPI 1.2.4 or 1.2.5 (the one with the segfault)? There was definitely a bug in 1.2.4 that could cause exactly the behavior you are describing when using the shared memory BTL, due to a silly delayed initialization bug/optimization. If you are using the OB1 PML (the default), you will still have the possibility of running the receiver out of memory if the unexpected queue grows without bounds. I'll withold my opinion on what the standard says so that we can perhaps actually help you solve your problem and stay out of the weeds :). Note however, that in general unexpected messages are a bad idea and thousands of them from one peer to another should be avoided at all costs -- this is just good MPI programming practice. Now, if you are using MX, you can replicate MPICH/MX's behavior (including the very slow part) by using the CM PML (--mca pml cm on the mpirun command line), which will use the MX library message matching and unexpected queue and therefore behave exactly like MPICH/MX. Brian On Sat, 2 Feb 2008, 8mj6tc...@sneakemail.com wrote: That would make sense. I able to break OpenMPI by having Node A wait for messages from Node B. Node B is in fact sleeping while Node C bombards Node A with a few thousand messages. After a while Node B wakes up and sends Node A the message it's been waiting on, but Node A has long since been buried and seg faults. If I decrease the number of messages C is sending, it works properly. This was on OpenMPI 1.2.4 (using I think the SM BTL (might have been MX or TCP, but certainly not infiniband. I could dig up the test and try again if anyone is seriously curious). Trying the same test on MPICH/MX went very very slow (I don't think they have any clever buffer management) but it didn't crash. Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com |openmpi-users/Allow| wrote: Hi, I am readying an openmpi 1.2.5 software stack for use with a many-thousand core cluster. I have a question about sending small messages that I hope can be answered on this list. I was under the impression that if node A wants to send a small MPI message to node B, it must have a credit to do so. The credit assures A that B has enough buffer space to accept the message. Credits are required by the mpi layer regardless of the BTL transport layer used. I have been told by a Voltaire tech that this is not so, the credits are used by the infiniband transport layer to reliably send a message, and is not an openmpi feature. Thanks, Federico ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openmpi credits for eager messages
Gleb In my example, each sender task 1 to n-1 will have one rendezvous message to task 0 at a time. The MPI standard suggests descriptors be small enough and there be enough descriptor space for reasonable programs . The standard is clear that unreasonable programs can run out of space and fail. The standard does not try to quantify reasonableness. This gets really interesting when we talk about hundreds of thousands of tasks. If on a general purpose MPI there are 16 tasks and task 0 cannot hold 1 envelop from each of the other 15, it is probably a poor quality MPI.If there are a million tasks and task 0 can only hold 100,000 envelops then it is fair to argue that holding 100,000 evelopes is generous and the million task job is not being reasonable. This little example could be reasonable for small task counts and unreasonable for huge task counts. If there are 2 tasks and and the single sender posts 15 MPI_ISENDs to task 0, a quality MPI should probably handle that too. If the sender tries to post a million MPI_ISENDs and either sender or receiver run out of descriptor space after 100,000 it is again fair to fail the job and argue the program is not being reasonable. The line between reasonable and unreasonable application behavior is not a bright, sharp line. A big part of my reason for prodding this is that I think it is bettter to have the MPI Forum discuss changes to the standard than to have MPI implementors deciding what parts to ignore. If the MPI Forum does bless a mode that allows my example to crash, IBM MPI will support that mode and some of our users will chose to run in that mode. If their applications are "well structured" in certain specific ways they will never have a problem with early arrival oveflow. If the standard is unclear then this is the time to make it clear. DIck Dick Treumann - MPI Team/TCEM IBM Systems & Technology Group Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 02/04/2008 02:03:20 PM: > On Mon, Feb 04, 2008 at 09:08:45AM -0500, Richard Treumann wrote: > > To me, the MPI standard is clear that a program like this: > > > > task 0: > > MPI_Init > > sleep(3000); > > start receiving messages > > > > each of tasks 1 to n-1: > > MPI_Init > > loop 5000 times > >MPI_Send(small message to 0) > > end loop > > > > May send some small messages eagerly if there is space at task 0 but must > > block each task 1 to n-1 before allowing task 0 to run out of eager buffer > > space. Doing this requires a token or credit management system in which > > each task has credits for known buffer space at task 0. Each task will send > > eagerly to task 0 until the sender runs out of credits and then must switch > > to rendezvous protocol. > And rendezvous messages are not free either. So this approach will only > postpone failure a little bit. > > -- > Gleb. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openmpi credits for eager messages
> > Not to muddy the point, but if there's enough ambiguity in the Standard > > for people to ignore the progress rule, then I think (hope) there's enough > > ambiguity for people to ignore the sender throttling issue too ;) > > I understand your position, and I used to agree until I was forced to > change my mind by naive users :-) Right. That's what I meant by: "Most of the vendors aren't allowed to have this perspective". > > Poorly written MPI codes won't likely segfault or deadlock because the > progress rule was ignored. However, users will proudly tell you that you > have a memory leak if you don't limit the size of the unexpected queue > and their codes with no flow control blow up. Yep. I don't lose money when I tell these people to go fix their code. I like to think that I actually get paid to tell these people to go fix their code > > You don't have to make it very efficient (per-sender credits > definitively does not scale), but you need to have a way to stall/slow > the sender when the unexpected queue gets too big. That's quite easy to > do without affecting the common case. Not on my network. I don't have the nice situation that the Standard refers to where one producer is overwhelming the consumer. For a reasonable number of endpoints and a known offending sender, it's pretty straightforward to do a user-level credit-based flow control. I'm looking at a network where the number of endpoints is large enough that everybody can't have a credit to start with, and the "offender" isn't any single process, but rather a combination of processes doing N-to-1 where N is sufficiently large. I can't just tell one process to slow down. I have to tell them all to slow down and do it quickly... -Ron
Re: [OMPI users] openmpi credits for eager messages
Ron, Brightwell, Ronald wrote: Not to muddy the point, but if there's enough ambiguity in the Standard for people to ignore the progress rule, then I think (hope) there's enough ambiguity for people to ignore the sender throttling issue too ;) I understand your position, and I used to agree until I was forced to change my mind by naive users :-) Poorly written MPI codes won't likely segfault or deadlock because the progress rule was ignored. However, users will proudly tell you that you have a memory leak if you don't limit the size of the unexpected queue and their codes with no flow control blow up. You don't have to make it very efficient (per-sender credits definitively does not scale), but you need to have a way to stall/slow the sender when the unexpected queue gets too big. That's quite easy to do without affecting the common case. Patrick
Re: [OMPI users] openmpi credits for eager messages
On Mon, Feb 04, 2008 at 09:08:45AM -0500, Richard Treumann wrote: > To me, the MPI standard is clear that a program like this: > > task 0: > MPI_Init > sleep(3000); > start receiving messages > > each of tasks 1 to n-1: > MPI_Init > loop 5000 times >MPI_Send(small message to 0) > end loop > > May send some small messages eagerly if there is space at task 0 but must > block each task 1 to n-1 before allowing task 0 to run out of eager buffer > space. Doing this requires a token or credit management system in which > each task has credits for known buffer space at task 0. Each task will send > eagerly to task 0 until the sender runs out of credits and then must switch > to rendezvous protocol. And rendezvous messages are not free either. So this approach will only postpone failure a little bit. -- Gleb.
Re: [OMPI users] openmpi credits for eager messages
Hi Ron - I am well aware of the scaling problems related to the standard send requirements in MPI. I t is a very difficult issue. However, here is what the standard says: MPI 1.2, page 32 lines 29-37 === a standard send operation that cannot complete because of lack of buffer space will merely block, waiting for buffer space to become available or for a matching receive to be posted. This behavior is preferable in many situations. Consider a situation where a producer repeatedly produces new values and sends them to a consumer. Assume that the producer produces new values faster than the consumer can consume them. If buffered sends are used, then a buffer overflow will result. Additional synchronization has to be added to the program so as to prevent this from occurring. If standard sends are used, then the producer will be automatically throttled, as its send operations will block when buffer space is unavailable. If there are people who want to argue that this is unclear or that it should be changed, the MPI Forum can and should take up the discussion. I think this particular wording is pretty clear. The piece of MPI standard wording you quote is somewhat ambiguous: The amount of space available for buffering will be much smaller than program data memory on many systems. Then, it will be easy to write programs that overrun available buffer space. But note that this wording mentions a problem that an application can create but does not say the MPI implementation can fail the job. The language I have pointed to is where the standard says what the MPI implementation must do. The "lack of resource" statement is more about send and receive descriptors than buffer space. If I write a program with an infinite loop of MPI_IRECV postings the standard allows that to fail. Dick Dick Treumann - MPI Team/TCEM IBM Systems & Technology Group Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 02/04/2008 12:24:11 PM: > > > Is what George says accurate? If so, it sounds to me like OpenMPI > > does not comply with the MPI standard on the behavior of eager > > protocol. MPICH is getting dinged in this discussion because they > > have complied with the requirements of the MPI standard. IBM MPI > > also complies with the standard. > > > > If there is any debate about whether the MPI standard does (or > > should) require the behavior I describe below then we should move > > the discussion to the MPI 2.1 Forum and get a clarification. > > [...] > > The MPI Standard also says the following about resource limitations: > > Any pending communication operation consumes system resources that are > limited. Errors may occur when lack of resources prevent the execution > of an MPI call. A quality implementation will use a (small) fixed amount > of resources for each pending send in the ready or synchronous mode and > for each pending receive. However, buffer space may be consumed to store > messages sent in standard mode, and must be consumed to store messages > sent in buffered mode, when no matching receive is available. The amount > of space available for buffering will be much smaller than program data > memory on many systems. Then, it will be easy to write programs that > overrun available buffer space. > > Since I work on MPI implementations that are expected to allow applications > to scale to tens of thousands of processes, I don't want the overhead of > a user-level flow control protocol that penalizes scalable applications in > favor of non-scalable ones. I also don't want an MPI implementation that > hides such non-scalable application behavior, but rather exposes it at lower > processor counts -- preferably in a way that makes the application developer > aware of the resources requirements of their code and allows them to make > the appropriate choice regarding the structure of their code, the underlying > protocols, and the amount of buffer resources. > > But I work in a place where codes are expected to scale and not just work. > Most of the vendors aren't allowed to have this perspective > > -Ron > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openmpi credits for eager messages
> Is what George says accurate? If so, it sounds to me like OpenMPI > does not comply with the MPI standard on the behavior of eager > protocol. MPICH is getting dinged in this discussion because they > have complied with the requirements of the MPI standard. IBM MPI > also complies with the standard. > > If there is any debate about whether the MPI standard does (or > should) require the behavior I describe below then we should move > the discussion to the MPI 2.1 Forum and get a clarification. > [...] The MPI Standard also says the following about resource limitations: Any pending communication operation consumes system resources that are limited. Errors may occur when lack of resources prevent the execution of an MPI call. A quality implementation will use a (small) fixed amount of resources for each pending send in the ready or synchronous mode and for each pending receive. However, buffer space may be consumed to store messages sent in standard mode, and must be consumed to store messages sent in buffered mode, when no matching receive is available. The amount of space available for buffering will be much smaller than program data memory on many systems. Then, it will be easy to write programs that overrun available buffer space. Since I work on MPI implementations that are expected to allow applications to scale to tens of thousands of processes, I don't want the overhead of a user-level flow control protocol that penalizes scalable applications in favor of non-scalable ones. I also don't want an MPI implementation that hides such non-scalable application behavior, but rather exposes it at lower processor counts -- preferably in a way that makes the application developer aware of the resources requirements of their code and allows them to make the appropriate choice regarding the structure of their code, the underlying protocols, and the amount of buffer resources. But I work in a place where codes are expected to scale and not just work. Most of the vendors aren't allowed to have this perspective -Ron
Re: [OMPI users] openmpi credits for eager messages
That would make sense. I able to break OpenMPI by having Node A wait for messages from Node B. Node B is in fact sleeping while Node C bombards Node A with a few thousand messages. After a while Node B wakes up and sends Node A the message it's been waiting on, but Node A has long since been buried and seg faults. If I decrease the number of messages C is sending, it works properly. This was on OpenMPI 1.2.4 (using I think the SM BTL (might have been MX or TCP, but certainly not infiniband. I could dig up the test and try again if anyone is seriously curious). Trying the same test on MPICH/MX went very very slow (I don't think they have any clever buffer management) but it didn't crash. Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com |openmpi-users/Allow| wrote: > Hi, > > I am readying an openmpi 1.2.5 software stack for use with a > many-thousand core cluster. I have a question about sending small > messages that I hope can be answered on this list. > > I was under the impression that if node A wants to send a small MPI > message to node B, it must have a credit to do so. The credit assures A > that B has enough buffer space to accept the message. Credits are > required by the mpi layer regardless of the BTL transport layer used. > > I have been told by a Voltaire tech that this is not so, the credits are > used by the infiniband transport layer to reliably send a message, and > is not an openmpi feature. > > Thanks, > Federico > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- --Kris 叶ってしまう夢は本当の夢と言えん。 [A dream that comes true can't really be called a dream.]
Re: [OMPI users] openmpi credits for eager messages
The Voltaire tech was right. There is no credit management at the upper level, each BTL is allowed to do it (if needed). This doesn't means the transport is not reliable. Most of the devices have internal flow control, and Open MPI rely on it instead of implementing our own. However, the devices that do not provide in their low level drivers or hardware such feature, have it implemented at the BTL layer. As an example, infiniband have a flow control mechanism implemented in the BTL. george. On Feb 1, 2008, at 3:05 PM, Sacerdoti, Federico wrote: Hi, I am readying an openmpi 1.2.5 software stack for use with a many-thousand core cluster. I have a question about sending small messages that I hope can be answered on this list. I was under the impression that if node A wants to send a small MPI message to node B, it must have a credit to do so. The credit assures A that B has enough buffer space to accept the message. Credits are required by the mpi layer regardless of the BTL transport layer used. I have been told by a Voltaire tech that this is not so, the credits are used by the infiniband transport layer to reliably send a message, and is not an openmpi feature. Thanks, Federico ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
[OMPI users] openmpi credits for eager messages
Hi, I am readying an openmpi 1.2.5 software stack for use with a many-thousand core cluster. I have a question about sending small messages that I hope can be answered on this list. I was under the impression that if node A wants to send a small MPI message to node B, it must have a credit to do so. The credit assures A that B has enough buffer space to accept the message. Credits are required by the mpi layer regardless of the BTL transport layer used. I have been told by a Voltaire tech that this is not so, the credits are used by the infiniband transport layer to reliably send a message, and is not an openmpi feature. Thanks, Federico