To keep this out of the weeds, I have attached a program called "bug3" that illustrates this problem on openmpi 1.2.5 using the openib BTL. In bug3 process with rank 0 uses all available memory buffering "unexpected" messages from its neighbors.
Bug3 is a test-case derived from a real, scalable application (desmond for molecular dynamics) that several experienced MPI developers have worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the openmpi silently sends them in the background and overwhelms process 0 due to lack of flow control. It may not be hard to change desmond to work around openmpi's small message semantics, but a programmer should reasonably be allowed to think a blocking send will block if the receiver cannot handle it yet. Federico -----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Brightwell, Ronald Sent: Monday, February 04, 2008 3:30 PM To: Patrick Geoffray Cc: Open MPI Users Subject: Re: [OMPI users] openmpi credits for eager messages > > I'm looking at a network where the number of endpoints is large enough that > > everybody can't have a credit to start with, and the "offender" isn't any > > single process, but rather a combination of processes doing N-to-1 where N > > is sufficiently large. I can't just tell one process to slow down. I have > > to tell them all to slow down and do it quickly... > > When you have N->1 patterns, then the hardware flow-control will > throttle the senders, or drop packets if there is no hardware > flow-control. If you don't have HOL blocking but the receiver does not > consume for any reasons (busy, sleeping, dead, whatever), then you can > still drop packets on the receiver (NIC, driver, thread) at a last > resort, this is what TCP does. The key is have exponential backoff (or a > reasonably large resend timeout) to no continue the hammering. > > It costs nothing in the common case (unlike the credits approach), but > it does handle corner cases without affecting too much other nodes > (unlike hardware flow-control). Right. For a sufficiently large number of endpoints, flow control has to get pushed out of MPI and down into the network, which is why I don't necesarily want an MPI that does flow control at the user-level. > > But you know all that. You are just being mean to your users because you > can :-) The sick part is that I think I envy you... You know it :) -Ron _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
bug3.c
Description: bug3.c