Re: [OMPI users] openmpi credits for eager messages

2008-02-07 Thread Jeff Squyres
What I missed in this whole conversation is that the pieces of text  
that Ron and Dick are citing are *on the same page* in the MPI spec;  
they're not disparate parts of the spec that accidentally overlap in  
discussion scope.


Specifically, it says:

   Resource limitations

   Any pending communication operation consumes system resources that  
are
   limited. Errors may occur when lack of resources prevent the  
execution
   of an MPI call. A quality implementation will use a (small) fixed  
amount
   of resources for each pending send in the ready or synchronous  
mode and
   for each pending receive. However, buffer space may be consumed to  
store
   messages sent in standard mode, and must be consumed to store  
messages
   sent in buffered mode, when no matching receive is available. The  
amount
   of space available for buffering will be much smaller than program  
data

   memory on many systems. Then, it will be easy to write programs that
   overrun available buffer space.
...12 lines down on that page, on the same page, in the same section...
   Consider a situation where a producer repeatedly produces new values
   and sends them to a consumer. Assume that the producer produces new
   values faster than the consumer can consume them.
...skip 2 sentences about buffered sends...
   If standard sends are used, then the producer will be automatically
   throttled, as its send operations will block when buffer space is
   unavailable.

I find that to be unambiguous.

1. A loop of MPI_ISENDs on a producer can cause a malloc failure  
(can't malloc a new MPI_Request), and that's an error.  Tough luck.


2. A loop of MPI_SENDs on a producer can run a slow-but-MPI-active  
consumer out of buffer space if all the incoming messages are queued  
up (e.g., in the unexpected queue).  The language above is pretty  
clear about this: MPI_SEND on the producer is supposed to block at  
this point.


FWIW: Open MPI does support this mode of operation, as George and Gleb  
noted (by setting the eager size to 0, therefore forcing *all* sends  
to be synchronous -- a producer cannot "run ahead" for a while and  
eventually be throttled when receive buffering is exhausted), but a)  
it's not the default, and b) it's not documented this way.




On Feb 4, 2008, at 1:29 PM, Richard Treumann wrote:


Hi Ron -

I am well aware of the scaling problems related to the standard send  
requirements in MPI. I t is a very difficult issue.


However, here is what the standard says: MPI 1.2, page 32 lines 29-37

===
a standard send operation that cannot complete because of lack of  
buffer space will merely block, waiting for buffer space to become  
available or for a matching receive to be posted. This behavior is  
preferable in many situations. Consider a situation where a producer  
repeatedly produces new values and sends them to a consumer. Assume  
that the producer produces new values faster than the consumer can  
consume them. If buffered sends are used, then a buffer overflow  
will result. Additional synchronization has to be added to the  
program so as to prevent this from occurring. If standard sends are  
used, then the producer will be
automatically throttled, as its send operations will block when  
buffer space is unavailable.



If there are people who want to argue that this is unclear or that  
it should be changed, the MPI Forum can and should take up the  
discussion. I think this particular wording is pretty clear.


The piece of MPI standard wording you quote is somewhat ambiguous:

The amount
of space available for buffering will be much smaller than program  
data

memory on many systems. Then, it will be easy to write programs that
overrun available buffer space.

But note that this wording mentions a problem that an application  
can create but does not say the MPI implementation can fail the job.  
The language I have pointed to is where the standard says what the  
MPI implementation must do.


The "lack of resource" statement is more about send and receive  
descriptors than buffer space. If I write a program with an infinite  
loop of MPI_IRECV postings the standard allows that to fail.



Dick

Dick Treumann - MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 02/04/2008 12:24:11 PM:

>
> > Is what George says accurate? If so, it sounds to me like OpenMPI
> > does not comply with the MPI standard on the behavior of eager
> > protocol. MPICH is getting dinged in this discussion because they
> > have complied with the requirements of the MPI standard. IBM MPI
> > also complies with the standard.
> >
> > If there is any debate about whether the MPI standard does (or
> > should) require the behavior I describe below then we should move
> > the discussion to the MPI 2.1 Forum and get a clarification.
> > [...]
>
> The MPI 

Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread Richard Treumann

Ron's comments are probably dead on for an application like bug3.

If bug3 is long running and libmpi is doing eager protocol buffer
management as I contend the standard requires then the producers will not
get far ahead of the consumer before they are forced to synchronous send
under the covers anyway.  From then on, producers will run no faster than
their output can be absorbed.  They will spent the nonproductive parts of
their time blocked on either MPI_Send or MPI_Ssend.  The job will not
finish until the consumer finishes because the consumer is a constant
bottleneck anyway.  The slow consumer is the major drag on scalability. As
long as the producers can be expected to outrun the consumer for the life
of the job you will probably find it hard to measure a difference between
synchronous send and flow controlled standard send.

Eager protocol gets more interesting when the pace of the consumer and of
the producers is variable.  If the consumer can absorb a message per
millisecond and the production rate is close to one message per millisecond
but fluctuates a bit then eager protocol may speed the whole job
significantly. The producers can never get ahead with synchronous send even
in a phase when they might be able to create a message every 1/2
millisecond. The producers spend half this easy phase blocked in MPI_Ssend.
If producers now enter a compute intensive phase where messages can only be
generated once every 2 milliseconds the consumer spends time idle.  If the
consumer had been able to accumulate queued messages with eager protocol
when the producers were able to run faster it could now make itself useful
catching up.

Both producers and consumer would come closer to 100% productive work and
the job would finish sooner..

   Dick


Dick Treumann  -  MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 02/05/2008 01:26:24 PM:

> > Re: MPI_Ssend(). This indeed fixes bug3, the process at rank 0 has
> > reasonable memory usage and the execution proceeds normally.
> >
> > Re scalable: One second. I know well bug3 is not scalable, and when to
> > use MPI_Isend. The point is programmers want to count on the MPI spec
as
> > written, as Richard pointed out. We want to send small messages quickly
> > and efficiently, without the danger of overloading the receiver's
> > resources. We can use MPI_Ssend() but it is slow compared MPI_Send().
>
> Your last statement is not necessarily true.  By synchronizing processes
> using MPI_Ssend(), you can potentially avoid large numbers of unexpected
> messages that need to be buffered and copied, and that also need to be
> searched every time a receive is posted.  There is no guarantee that the
> protocol overhead on each message incurred with MPI_Ssend() slows down an
> application more than the buffering, copying, and searching overhead of a
> large number of unexpected messages.
>
> It is true that MPI_Ssend() is slower than MPI_Send() for ping-pong
> micro-benchmarks, but the length of the unexpected message queue doesn't
> have to get very long before they are about the same.
>
> >
> > Since identifying this behavior we have implemented the desired flow
> > control in our application.
>
> It would be interesting to see performance results comparing doing flow
> control in the application versus having MPI do it for you
>
> -Ron
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread Brightwell, Ronald
> Re: MPI_Ssend(). This indeed fixes bug3, the process at rank 0 has
> reasonable memory usage and the execution proceeds normally.
> 
> Re scalable: One second. I know well bug3 is not scalable, and when to
> use MPI_Isend. The point is programmers want to count on the MPI spec as
> written, as Richard pointed out. We want to send small messages quickly
> and efficiently, without the danger of overloading the receiver's
> resources. We can use MPI_Ssend() but it is slow compared MPI_Send().

Your last statement is not necessarily true.  By synchronizing processes
using MPI_Ssend(), you can potentially avoid large numbers of unexpected
messages that need to be buffered and copied, and that also need to be
searched every time a receive is posted.  There is no guarantee that the
protocol overhead on each message incurred with MPI_Ssend() slows down an
application more than the buffering, copying, and searching overhead of a
large number of unexpected messages.

It is true that MPI_Ssend() is slower than MPI_Send() for ping-pong
micro-benchmarks, but the length of the unexpected message queue doesn't
have to get very long before they are about the same.

> 
> Since identifying this behavior we have implemented the desired flow
> control in our application.

It would be interesting to see performance results comparing doing flow
control in the application versus having MPI do it for you

-Ron




Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread Richard Treumann

So with an Isend your program becomes valid MPI and a very nice
illustrarion of why the MPI standard cannot limit envelops (or send/recv
descriptors) and why at some point the number of descriptors can blow the
limits. It also illustrates how the management of eager messages remains
workable. (Not the same as affordable or appropriate. I agree it has
serious scaling issues) Let's assume there is managed early arrival space
for 10 messages per sender.

Each MPI_Isend generates an envelop that goes to the destination. For your
program to unwind properly, every envelop must be delivered to the
destination.  The first (blocking) MPI_Recv is looking for the tag in the
last envelop so if libmpi does not deliver all 5000 envelops per sender,
the first MPI_Recv will block forever.  It is not acceptable for a valid
MPI program to deadlock.  If the destination cannot hold all the envelops
there is no choice but to fail the job. The standard allows this. The Forum
considered it to be better to fail a job than to deadlock it.

If each sender sends its first 10 messages eagerly the send side tokens
will be used up and the buffer space at the destination will fill up but
not overflow.  The senders now fall back to rendevous for their remaining
4990 MPI_Isends. The MPI_Isends cannot block.  They send envelops as fast
as the loop can run but the user send buffers involved cannot be altered
until the waits occur.  Once the last sent envelop  with tag 5000 arrives
and matches the posted MPI_Recv, an OK_to_send goes back to the sender and
the data can be moved from the still intact send buffer to the waiting
receive buffer.  The MPI_Waits for the Isend requests can be done in any
order but no send buffer can be changed until the corresponding MPI_Wait
returns. No system buffer needed for massage data.

The MPI_Recvs being posted in reverse order (5000,4999 .. 11. ) each ship
OK_to_send and data flows directly from send to recv buffers.  Finally the
MPI_Recvs for tags (10 ... 1) get posted and pull their message data from
the early arrival space. The program has unwound correctly and as the early
arrival space frees up, credits can be returned to the sender.

Performance discussions aside - the semantic is clean and reliable.

  Thanks - Dick

PS - If anyone responds to this I hope you will state clearly whether you
want to talk about:

- What does the standard require?
or
- What should the standard require?

Dick Treumann  -  MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 02/04/2008 06:04:22 PM:

> Richard,
>
> You're absolutely right. What a shame :) If I have spent less time
> drawing the boxes around the code I might have noticed the typo. The
> Send should be an Isend.
>
>george.
>
> On Feb 4, 2008, at 5:32 PM, Richard Treumann wrote:
>
> > Hi George
> >
> > Sorry - This is not a valid MPI program. It violates the requirement
> > that a program not depend on there being any system buffering. See
> > page 32-33 of MPI 1.1
> >
> > Lets simplify to:
> > Task 0:
> > MPI_Recv( from 1 with tag 1)
> > MPI_Recv( from 1 with tag 0)
> >
> > Task 1:
> > MPI_Send(to 0 with tag 0)
> > MPI_Send(to 0 with tag 1)
> >
> > Without any early arrival buffer (or with eager size set to 0) task
> > 0 will hang in the first MPI_Recv and never post a recv with tag 0.
> > Task 1 will hang in the MPI_Send with tag 0 because it cannot get
> > past it until the matching recv is posted by task 0.
> >
> > If there is enough early arrival buffer for the first MPI_Send on
> > task 1 to complete and the second MPI_Send to be posted the example
> > will run. Once both sends are posted by task 1, task 0 will harvest
> > the second send and get out of its first recv. Task 0's second recv
> > can now pick up the message from the early arrival buffer where it
> > had to go to let task 1complete send 1 and post send 2.
> >
> > If an application wants to do this kind of order inversion it should
> > use some non blocking operations. For example, if task 0 posted an
> > MPI_Irecv for tag 1, an MPI_Recv for tag 0 and lastly, an MPI_Wait
> > for the Irecv, the example is valid.
> >
> > I am not aware of any case where the standard allows a correct MPI
> > program to be deadlocked by an implementation limit. It can be
> > failed if it exceeds a limit but I do not think it is ever OK to hang.
> >
> > Dick
> >
> > Dick Treumann - MPI Team/TCEM
> > IBM Systems & Technology Group
> > Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> > Tele (845) 433-7846 Fax (845) 433-8363
> >
> >
> > users-boun...@open-mpi.org wrote on 02/04/2008 04:41:21 PM:
> >
> > > Please allow me to slightly modify your example. It still follow the
> > > rules from the MPI standard, so I think it's a 100% standard
> > compliant
> > > parallel application.
> > >
> > > 

Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread 8mj6tc902
Wow this sparked a much more heated discussion than I was expecting. I
was just commenting that the behaviour the original author (Federico
Sacerdoti) mentioned would explain something I observed in one of my
early trials of OpenMPI. But anyway, because it seems that quite a few
people were interested, I've attached a simplified version of the test I
was describing (with all the timing checks and some of the crazier
output removed).

Now that I go back and retest this it turns out that it wasn't actually
a segfault that was killing it, but running out of memory as you and
others have predicted.

Brian W. Barrett brbarret-at-open-mpi.org |openmpi-users/Allow| wrote:
> Now that this discussion has gone way off into the MPI standard woods :).
> 
> Was your test using Open MPI 1.2.4 or 1.2.5 (the one with the segfault)? 
> There was definitely a bug in 1.2.4 that could cause exactly the behavior 
> you are describing when using the shared memory BTL, due to a silly 
> delayed initialization bug/optimization.

I'm still using Open MPI 1.2.4 and actually the SM BTL seems to be the
hardest to break (I guess I'm dodging the bullet on that delayed
initialization bug you're referring to).

> If you are using the OB1 PML (the default), you will still have the 
> possibility of running the receiver out of memory if the unexpected queue 
> grows without bounds.  I'll withold my opinion on what the standard says 
> so that we can perhaps actually help you solve your problem and stay out 
> of the weeds :).  Note however, that in general unexpected messages are a 
> bad idea and thousands of them from one peer to another should be avoided 
> at all costs -- this is just good MPI programming practice.

Actually I was expecting to break something with this test. I just
wanted to find out where it broke. Lesson learned, I wrote my more
serious programs doing exactly that (no unexpected messages). I was just
surprised that the default Open MPI settings allowed me to flood the
system so easily whereas MPICH/MX still finished not matter what I threw
at it (albeit with terrible performance (in the bad cases)).

> Now, if you are using MX, you can replicate MPICH/MX's behavior (including 
> the very slow part) by using the CM PML (--mca pml cm on the mpirun 
> command line), which will use the MX library message matching and 
> unexpected queue and therefore behave exactly like MPICH/MX.

That works exactly as you described, and it does indeed prevent memory
usage from going wild due to the unexpected messages.

Thanks for your help! (and to the others for the educational discussion!)

> 
> Brian
> 
> 
> On Sat, 2 Feb 2008, 8mj6tc...@sneakemail.com wrote:
> 
>> That would make sense. I able to break OpenMPI by having Node A wait for
>> messages from Node B. Node B is in fact sleeping while Node C bombards
>> Node A with a few thousand messages. After a while Node B wakes up and
>> sends Node A the message it's been waiting on, but Node A has long since
>> been buried and seg faults. If I decrease the number of messages C is
>> sending, it works properly. This was on OpenMPI 1.2.4 (using I think the
>> SM BTL (might have been MX or TCP, but certainly not infiniband. I could
>> dig up the test and try again if anyone is seriously curious).
>>
>> Trying the same test on MPICH/MX went very very slow (I don't think they
>> have any clever buffer management) but it didn't crash.
>>
>> Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com
>> |openmpi-users/Allow| wrote:
>>> Hi,
>>>
>>> I am readying an openmpi 1.2.5 software stack for use with a
>>> many-thousand core cluster. I have a question about sending small
>>> messages that I hope can be answered on this list.
>>>
>>> I was under the impression that if node A wants to send a small MPI
>>> message to node B, it must have a credit to do so. The credit assures A
>>> that B has enough buffer space to accept the message. Credits are
>>> required by the mpi layer regardless of the BTL transport layer used.
>>>
>>> I have been told by a Voltaire tech that this is not so, the credits are
>>> used by the infiniband transport layer to reliably send a message, and
>>> is not an openmpi feature.
>>>
>>> Thanks,
>>> Federico
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
--Kris

叶ってしまう夢は本当の夢と言えん。
[A dream that comes true can't really be called a dream.]
#include 
#include 
#include 
#include 

#include  //for atoi (in case someone doesn't have boost)

const int buflen=5000;

int main(int argc, char *argv[]) {
  using namespace std;
  int reps=1000;
  if(argc>1){ //optionally specify number of repeats on the command line
reps=atoi(argv[1]);
  }

  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME];

  

Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread Gleb Natapov
On Tue, Feb 05, 2008 at 08:07:59AM -0500, Richard Treumann wrote:
> There is no misunderstanding of the MPI standard or the definition of
> blocking in the bug3 example.  Both bug 3 and the example I provided are
> valid MPI.
> 
> As you say, blocking means the send buffer can be reused when the MPI_Send
> returns.  This is exactly what bug3 is count on.
> 
> MPI is a reliable protocol which means that once MPI_Send returns, the
> application can assume the message will be delivered once a matching recv
> is posted.  There are only two ways I can think of for MPI to keep that
> guarantee.
> 1) Before return from MPI_Send, copy the envelop and data to some buffer
> that will be preserved until the MPI_Recv gets posted
> 2) delay the return from MPI_Send until the MPI_Recv is posted and then
> move data from the intact send buffer to the posted receive buffer. Return
> from MPI_Send
> 
> The requirement in the standard is that if libmpi takes option 1, the
> return from MPI_Send cannot occur unless there is certainty the buffer
> space exists. Libmpi cannot throw the message over the wall and fail the
> job if it cannit be buffered.
As I said Open MPI has flow control on transport layer to prevent messages
from been dropped by network. This mechanism should allow program like
yours to work, but bug3 is another story because it generate huge
amount of unexpected messages and Open MPI has no mechanism to prevent
unexpected messages to blow memory consumption. Your point is that
according to MPI spec this is not a valid behaviour. I am not going to
argue with that especially as you can get desired behaviour by setting
eager limit to zero.

> users-boun...@open-mpi.org wrote on 02/05/2008 02:28:27 AM:
> 
> > On Mon, Feb 04, 2008 at 04:23:13PM -0500, Sacerdoti, Federico wrote:
> > > Bug3 is a test-case derived from a real, scalable application (desmond
> > > for molecular dynamics) that several experienced MPI developers have
> > > worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the
> > > openmpi silently sends them in the background and overwhelms process 0
> > > due to lack of flow control.
> > MPI_Send is *blocking* in MPI sense of the word i.e when MPI_Send returns
> > send buffer can be reused. MPI spec section 3.4.
> >
> > --
> >  Gleb.
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.


Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread Richard Treumann
Hi Gleb

There is no misunderstanding of the MPI standard or the definition of
blocking in the bug3 example.  Both bug 3 and the example I provided are
valid MPI.

As you say, blocking means the send buffer can be reused when the MPI_Send
returns.  This is exactly what bug3 is count on.

MPI is a reliable protocol which means that once MPI_Send returns, the
application can assume the message will be delivered once a matching recv
is posted.  There are only two ways I can think of for MPI to keep that
guarantee.
1) Before return from MPI_Send, copy the envelop and data to some buffer
that will be preserved until the MPI_Recv gets posted
2) delay the return from MPI_Send until the MPI_Recv is posted and then
move data from the intact send buffer to the posted receive buffer. Return
from MPI_Send

The requirement in the standard is that if libmpi takes option 1, the
return from MPI_Send cannot occur unless there is certainty the buffer
space exists. Libmpi cannot throw the message over the wall and fail the
job if it cannit be buffered.

 Dick


Dick Treumann  -  MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 02/05/2008 02:28:27 AM:

> On Mon, Feb 04, 2008 at 04:23:13PM -0500, Sacerdoti, Federico wrote:
> > Bug3 is a test-case derived from a real, scalable application (desmond
> > for molecular dynamics) that several experienced MPI developers have
> > worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the
> > openmpi silently sends them in the background and overwhelms process 0
> > due to lack of flow control.
> MPI_Send is *blocking* in MPI sense of the word i.e when MPI_Send returns
> send buffer can be reused. MPI spec section 3.4.
>
> --
>  Gleb.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread Gleb Natapov
On Mon, Feb 04, 2008 at 04:23:13PM -0500, Sacerdoti, Federico wrote:
> Bug3 is a test-case derived from a real, scalable application (desmond
> for molecular dynamics) that several experienced MPI developers have
> worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the
> openmpi silently sends them in the background and overwhelms process 0
> due to lack of flow control.
MPI_Send is *blocking* in MPI sense of the word i.e when MPI_Send returns
send buffer can be reused. MPI spec section 3.4.

--
Gleb.


Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread George Bosilca

Richard,

You're absolutely right. What a shame :) If I have spent less time  
drawing the boxes around the code I might have noticed the typo. The  
Send should be an Isend.


  george.

On Feb 4, 2008, at 5:32 PM, Richard Treumann wrote:


Hi George

Sorry - This is not a valid MPI program. It violates the requirement  
that a program not depend on there being any system buffering. See  
page 32-33 of MPI 1.1


Lets simplify to:
Task 0:
MPI_Recv( from 1 with tag 1)
MPI_Recv( from 1 with tag 0)

Task 1:
MPI_Send(to 0 with tag 0)
MPI_Send(to 0 with tag 1)

Without any early arrival buffer (or with eager size set to 0) task  
0 will hang in the first MPI_Recv and never post a recv with tag 0.  
Task 1 will hang in the MPI_Send with tag 0 because it cannot get  
past it until the matching recv is posted by task 0.


If there is enough early arrival buffer for the first MPI_Send on  
task 1 to complete and the second MPI_Send to be posted the example  
will run. Once both sends are posted by task 1, task 0 will harvest  
the second send and get out of its first recv. Task 0's second recv  
can now pick up the message from the early arrival buffer where it  
had to go to let task 1complete send 1 and post send 2.


If an application wants to do this kind of order inversion it should  
use some non blocking operations. For example, if task 0 posted an  
MPI_Irecv for tag 1, an MPI_Recv for tag 0 and lastly, an MPI_Wait  
for the Irecv, the example is valid.


I am not aware of any case where the standard allows a correct MPI  
program to be deadlocked by an implementation limit. It can be  
failed if it exceeds a limit but I do not think it is ever OK to hang.


Dick

Dick Treumann - MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 02/04/2008 04:41:21 PM:

> Please allow me to slightly modify your example. It still follow the
> rules from the MPI standard, so I think it's a 100% standard  
compliant

> parallel application.
>
> ++
> | task 0:|
> ++
> | MPI_Init() |
> | sleep(3000)|
> | for( msg = 0; msg < 5000; msg++ ) {|
> |   for( peer = 0; peer < com_size; peer++ ) {   |
> | MPI_Recv( ..., from = peer, tag = (5000 - msg),... );  |
> |   }|
> | }  |
> ++
>
> ++
> |   task 1 to com_size:  |
> ++
> | MPI_Init() |
> | for( msg = 0; msg < 5000; msg++ ) {|
> |   MPI_Send( ..., 0, tag = msg, ... );  |
> | }  |
> ++
>
> Isn't that the flow control will stop the application to run to
> completion ? It's easy to write an application that break a  
particular

> MPI implementation. It doesn't necessarily make this implementation
> non standard compliant.
>
> george.
>
> On Feb 4, 2008, at 9:08 AM, Richard Treumann wrote:
>
> > Is what George says accurate? If so, it sounds to me like OpenMPI
> > does not comply with the MPI standard on the behavior of eager
> > protocol. MPICH is getting dinged in this discussion because they
> > have complied with the requirements of the MPI standard. IBM MPI
> > also complies with the standard.
> >
> > If there is any debate about whether the MPI standard does (or
> > should) require the behavior I describe below then we should move
> > the discussion to the MPI 2.1 Forum and get a clarification.
> >
> > To me, the MPI standard is clear that a program like this:
> >
> > task 0:
> > MPI_Init
> > sleep(3000);
> > start receiving messages
> >
> > each of tasks 1 to n-1:
> > MPI_Init
> > loop 5000 times
> > MPI_Send(small message to 0)
> > end loop
> >
> > May send some small messages eagerly if there is space at task 0  
but

> > must block each task 1 to n-1 before allowing task 0 to run out of
> > eager buffer space. Doing this requires a token or credit  
management
> > system in which each task has credits for known buffer space at  
task

> > 0. Each task will send eagerly to task 0 until the sender runs out
> > of credits and then must switch to rendezvous protocol. Tasks 1to
> > n-1 might each do 3 MPI_Sends or 300 MPI_Sends before blocking,
> > depending on how much buffer space there is at task 0 but they  
would

Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Richard Treumann

Hi George

Sorry - This is not a valid MPI program.  It violates the requirement that
a program not depend on there being any system buffering.  See page 32-33
of MPI 1.1

 Lets simplify to:
Task 0:
MPI_Recv( from 1 with tag 1)
MPI_Recv( from 1 with tag 0)

Task 1:
MPI_Send(to 0 with tag 0)
MPI_Send(to 0 with tag 1)

Without any early arrival buffer (or with eager size set to 0) task 0 will
hang in the first MPI_Recv and never post a recv with tag 0.  Task 1 will
hang in the MPI_Send with tag 0 because it cannot get past it until the
matching recv is posted by task 0.

If there is enough early arrival buffer for the first MPI_Send on task 1 to
complete and the second MPI_Send to be posted the example will run. Once
both sends are posted by task 1, task 0 will harvest the second send and
get out of its first recv. Task 0's second recv can now pick up the message
from the early arrival buffer where it had to go to let task 1complete send
1 and post send 2.

If an application wants to do this kind of order inversion it should use
some non blocking operations.  For example, if task 0 posted an MPI_Irecv
for tag 1, an MPI_Recv for tag 0 and lastly, an MPI_Wait for the Irecv, the
example is valid.

I am not aware of any case where the standard allows a correct MPI program
to be deadlocked by an implementation limit.  It can be failed if it
exceeds a limit but I do not think it is ever OK to hang.

 Dick

Dick Treumann  -  MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 02/04/2008 04:41:21 PM:

> Please allow me to slightly modify your example. It still follow the
> rules from the MPI standard, so I think it's a 100% standard compliant
> parallel application.
>
> ++
> | task 0:|
> ++
> | MPI_Init() |
> | sleep(3000)|
> | for( msg = 0; msg < 5000; msg++ ) {|
> |   for( peer = 0; peer < com_size; peer++ ) {   |
> | MPI_Recv( ..., from = peer, tag = (5000 - msg),... );  |
> |   }|
> | }  |
> ++
>
> ++
> |   task 1 to com_size:  |
> ++
> | MPI_Init() |
> | for( msg = 0; msg < 5000; msg++ ) {|
> |   MPI_Send( ..., 0, tag = msg, ... );  |
> | }  |
> ++
>
> Isn't that the flow control will stop the application to run to
> completion ? It's easy to write an application that break a particular
> MPI implementation. It doesn't necessarily make this implementation
> non standard compliant.
>
> george.
>
> On Feb 4, 2008, at 9:08 AM, Richard Treumann wrote:
>
> > Is what George says accurate? If so, it sounds to me like OpenMPI
> > does not comply with the MPI standard on the behavior of eager
> > protocol. MPICH is getting dinged in this discussion because they
> > have complied with the requirements of the MPI standard. IBM MPI
> > also complies with the standard.
> >
> > If there is any debate about whether the MPI standard does (or
> > should) require the behavior I describe below then we should move
> > the discussion to the MPI 2.1 Forum and get a clarification.
> >
> > To me, the MPI standard is clear that a program like this:
> >
> > task 0:
> > MPI_Init
> > sleep(3000);
> > start receiving messages
> >
> > each of tasks 1 to n-1:
> > MPI_Init
> > loop 5000 times
> > MPI_Send(small message to 0)
> > end loop
> >
> > May send some small messages eagerly if there is space at task 0 but
> > must block each task 1 to n-1 before allowing task 0 to run out of
> > eager buffer space. Doing this requires a token or credit management
> > system in which each task has credits for known buffer space at task
> > 0. Each task will send eagerly to task 0 until the sender runs out
> > of credits and then must switch to rendezvous protocol. Tasks 1to
> > n-1 might each do 3 MPI_Sends or 300 MPI_Sends before blocking,
> > depending on how much buffer space there is at task 0 but they would
> > need to block in some MPI_Send before task 0 blows up.
> >
> > When task 0 wakes up and begins receiving the early arrivals, tasks
> > 1 to n-1 will unblock and resume looping.. Allowing the user to shut
> > off eager protocol by setting eager size to 0 does 

Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Sacerdoti, Federico
To keep this out of the weeds, I have attached a program called "bug3"
that illustrates this problem on openmpi 1.2.5 using the openib BTL. In
bug3 process with rank 0 uses all available memory buffering
"unexpected" messages from its neighbors.

Bug3 is a test-case derived from a real, scalable application (desmond
for molecular dynamics) that several experienced MPI developers have
worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the
openmpi silently sends them in the background and overwhelms process 0
due to lack of flow control.

It may not be hard to change desmond to work around openmpi's small
message semantics, but a programmer should reasonably be allowed to
think a blocking send will block if the receiver cannot handle it yet.

Federico

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Brightwell, Ronald
Sent: Monday, February 04, 2008 3:30 PM
To: Patrick Geoffray
Cc: Open MPI Users
Subject: Re: [OMPI users] openmpi credits for eager messages

> > I'm looking at a network where the number of endpoints is large
enough that
> > everybody can't have a credit to start with, and the "offender"
isn't any
> > single process, but rather a combination of processes doing N-to-1
where N
> > is sufficiently large.  I can't just tell one process to slow down.
I have
> > to tell them all to slow down and do it quickly...
> 
> When you have N->1 patterns, then the hardware flow-control will
> throttle the senders, or drop packets if there is no hardware
> flow-control. If you don't have HOL blocking but the receiver does not
> consume for any reasons (busy, sleeping, dead, whatever), then you can
> still drop packets on the receiver (NIC, driver, thread) at a last
> resort, this is what TCP does. The key is have exponential backoff (or
a
> reasonably large resend timeout) to no continue the hammering.
> 
> It costs nothing in the common case (unlike the credits approach), but
> it does handle corner cases without affecting too much other nodes
> (unlike hardware flow-control).

Right.  For a sufficiently large number of endpoints, flow control has
to get
pushed out of MPI and down into the network, which is why I don't
necesarily
want an MPI that does flow control at the user-level.

> 
> But you know all that. You are just being mean to your users because
you
> can :-) The sick part is that I think I envy you...

You know it :)

-Ron


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


bug3.c
Description: bug3.c


Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Brightwell, Ronald
> > I'm looking at a network where the number of endpoints is large enough that
> > everybody can't have a credit to start with, and the "offender" isn't any
> > single process, but rather a combination of processes doing N-to-1 where N
> > is sufficiently large.  I can't just tell one process to slow down.  I have
> > to tell them all to slow down and do it quickly...
> 
> When you have N->1 patterns, then the hardware flow-control will
> throttle the senders, or drop packets if there is no hardware
> flow-control. If you don't have HOL blocking but the receiver does not
> consume for any reasons (busy, sleeping, dead, whatever), then you can
> still drop packets on the receiver (NIC, driver, thread) at a last
> resort, this is what TCP does. The key is have exponential backoff (or a
> reasonably large resend timeout) to no continue the hammering.
> 
> It costs nothing in the common case (unlike the credits approach), but
> it does handle corner cases without affecting too much other nodes
> (unlike hardware flow-control).

Right.  For a sufficiently large number of endpoints, flow control has to get
pushed out of MPI and down into the network, which is why I don't necesarily
want an MPI that does flow control at the user-level.

> 
> But you know all that. You are just being mean to your users because you
> can :-) The sick part is that I think I envy you...

You know it :)

-Ron




Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Patrick Geoffray

Brightwell, Ronald wrote:

I'm looking at a network where the number of endpoints is large enough that
everybody can't have a credit to start with, and the "offender" isn't any
single process, but rather a combination of processes doing N-to-1 where N
is sufficiently large.  I can't just tell one process to slow down.  I have
to tell them all to slow down and do it quickly...


When you have N->1 patterns, then the hardware flow-control will 
throttle the senders, or drop packets if there is no hardware 
flow-control. If you don't have HOL blocking but the receiver does not 
consume for any reasons (busy, sleeping, dead, whatever), then you can 
still drop packets on the receiver (NIC, driver, thread) at a last 
resort, this is what TCP does. The key is have exponential backoff (or a 
reasonably large resend timeout) to no continue the hammering.


It costs nothing in the common case (unlike the credits approach), but 
it does handle corner cases without affecting too much other nodes 
(unlike hardware flow-control).


But you know all that. You are just being mean to your users because you 
can :-) The sick part is that I think I envy you...


Patrick


Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Gleb Natapov
On Mon, Feb 04, 2008 at 02:54:46PM -0500, Richard Treumann wrote:
> In my example, each sender task 1 to n-1 will have one rendezvous message
> to task 0 at a time.  The MPI standard suggests descriptors be small enough
> and  there be enough descriptor space for reasonable programs . The
> standard is clear that unreasonable programs can run out of space and fail.
> The standard does not try to quantify reasonableness.
You are right about your example, but I was not talking specifically about it.
Your example should work with Open MPI over IB/TCP because while rank 0 sleeps
without calling progress, transport layer flow control should throttle senders.
(SM doesn't have flow control that is why it fails.) What I was trying to say 
that
in MPI a process can't fully control its resource usage.

--
Gleb.


Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Brian W. Barrett

Now that this discussion has gone way off into the MPI standard woods :).

Was your test using Open MPI 1.2.4 or 1.2.5 (the one with the segfault)? 
There was definitely a bug in 1.2.4 that could cause exactly the behavior 
you are describing when using the shared memory BTL, due to a silly 
delayed initialization bug/optimization.


If you are using the OB1 PML (the default), you will still have the 
possibility of running the receiver out of memory if the unexpected queue 
grows without bounds.  I'll withold my opinion on what the standard says 
so that we can perhaps actually help you solve your problem and stay out 
of the weeds :).  Note however, that in general unexpected messages are a 
bad idea and thousands of them from one peer to another should be avoided 
at all costs -- this is just good MPI programming practice.


Now, if you are using MX, you can replicate MPICH/MX's behavior (including 
the very slow part) by using the CM PML (--mca pml cm on the mpirun 
command line), which will use the MX library message matching and 
unexpected queue and therefore behave exactly like MPICH/MX.


Brian


On Sat, 2 Feb 2008, 8mj6tc...@sneakemail.com wrote:


That would make sense. I able to break OpenMPI by having Node A wait for
messages from Node B. Node B is in fact sleeping while Node C bombards
Node A with a few thousand messages. After a while Node B wakes up and
sends Node A the message it's been waiting on, but Node A has long since
been buried and seg faults. If I decrease the number of messages C is
sending, it works properly. This was on OpenMPI 1.2.4 (using I think the
SM BTL (might have been MX or TCP, but certainly not infiniband. I could
dig up the test and try again if anyone is seriously curious).

Trying the same test on MPICH/MX went very very slow (I don't think they
have any clever buffer management) but it didn't crash.

Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com
|openmpi-users/Allow| wrote:

Hi,

I am readying an openmpi 1.2.5 software stack for use with a
many-thousand core cluster. I have a question about sending small
messages that I hope can be answered on this list.

I was under the impression that if node A wants to send a small MPI
message to node B, it must have a credit to do so. The credit assures A
that B has enough buffer space to accept the message. Credits are
required by the mpi layer regardless of the BTL transport layer used.

I have been told by a Voltaire tech that this is not so, the credits are
used by the infiniband transport layer to reliably send a message, and
is not an openmpi feature.

Thanks,
Federico

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Richard Treumann

Gleb

In my example, each sender task 1 to n-1 will have one rendezvous message
to task 0 at a time.  The MPI standard suggests descriptors be small enough
and  there be enough descriptor space for reasonable programs . The
standard is clear that unreasonable programs can run out of space and fail.
The standard does not try to quantify reasonableness.

This gets really interesting when we talk about hundreds of thousands of
tasks.  If  on a general purpose MPI there are 16 tasks and task 0 cannot
hold 1 envelop from each of the other 15, it is probably a poor quality
MPI.If there are a million tasks and task 0 can only hold 100,000
envelops then it is fair to argue that holding 100,000 evelopes is generous
and the million task job is not being reasonable.  This little example
could be reasonable for small task counts and unreasonable for huge task
counts.

If there are 2 tasks and and the single sender posts 15 MPI_ISENDs to task
0, a quality MPI should probably handle that too.  If the sender tries to
post a million MPI_ISENDs and either sender or receiver run out of
descriptor space after 100,000 it is again fair to fail the job and argue
the program is not being reasonable.  The line between reasonable and
unreasonable application behavior is not a bright, sharp line.

A big part of my reason for prodding this is that I think it is bettter to
have the MPI Forum discuss changes to the standard than to have MPI
implementors deciding what parts to ignore.  If the MPI Forum does bless a
mode that allows my example to crash, IBM MPI will support that mode and
some of our users will chose to run in that mode.  If their applications
are "well structured" in certain specific ways they will never have a
problem with early arrival oveflow.

If the standard is unclear then this is the time to make it clear.

  DIck

Dick Treumann  -  MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 02/04/2008 02:03:20 PM:

> On Mon, Feb 04, 2008 at 09:08:45AM -0500, Richard Treumann wrote:
> > To me, the MPI standard is clear that a program like this:
> >
> > task 0:
> > MPI_Init
> > sleep(3000);
> > start receiving messages
> >
> > each of tasks 1 to n-1:
> > MPI_Init
> > loop 5000 times
> >MPI_Send(small message to 0)
> > end loop
> >
> > May send some small messages eagerly if there is space at task 0 but
must
> > block each task 1 to  n-1 before allowing task 0 to run out of eager
buffer
> > space.  Doing this requires a token or credit management system in
which
> > each task has credits for known buffer space at task 0. Each task will
send
> > eagerly to task 0 until the sender runs out of credits and then must
switch
> > to rendezvous protocol.
> And rendezvous messages are not free either. So this approach will only
> postpone failure a little bit.
>
> --
>  Gleb.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Brightwell, Ronald
> > Not to muddy the point, but if there's enough ambiguity in the Standard
> > for people to ignore the progress rule, then I think (hope) there's enough
> > ambiguity for people to ignore the sender throttling issue too ;)
> 
> I understand your position, and I used to agree until I was forced to
> change my mind by naive users :-)

Right.  That's what I meant by:

  "Most of the vendors aren't allowed to have this perspective".

> 
> Poorly written MPI codes won't likely segfault or deadlock because the
> progress rule was ignored. However, users will proudly tell you that you
> have a memory leak if you don't limit the size of the unexpected queue
> and their codes with no flow control blow up.

Yep.  I don't lose money when I tell these people to go fix their code.  I like
to think that I actually get paid to tell these people to go fix their code

> 
> You don't have to make it very efficient (per-sender credits
> definitively does not scale), but you need to have a way to stall/slow
> the sender when the unexpected queue gets too big. That's quite easy to
> do without affecting the common case.

Not on my network.  I don't have the nice situation that the Standard refers
to where one producer is overwhelming the consumer.  For a reasonable number
of endpoints and a known offending sender, it's pretty straightforward to
do a user-level credit-based flow control.

I'm looking at a network where the number of endpoints is large enough that
everybody can't have a credit to start with, and the "offender" isn't any
single process, but rather a combination of processes doing N-to-1 where N
is sufficiently large.  I can't just tell one process to slow down.  I have
to tell them all to slow down and do it quickly...

-Ron




Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Patrick Geoffray

Ron,

Brightwell, Ronald wrote:

Not to muddy the point, but if there's enough ambiguity in the Standard
for people to ignore the progress rule, then I think (hope) there's enough
ambiguity for people to ignore the sender throttling issue too ;)


I understand your position, and I used to agree until I was forced to 
change my mind by naive users :-)


Poorly written MPI codes won't likely segfault or deadlock because the 
progress rule was ignored. However, users will proudly tell you that you 
have a memory leak if you don't limit the size of the unexpected queue 
and their codes with no flow control blow up.


You don't have to make it very efficient (per-sender credits 
definitively does not scale), but you need to have a way to stall/slow 
the sender when the unexpected queue gets too big. That's quite easy to 
do without affecting the common case.


Patrick


Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Gleb Natapov
On Mon, Feb 04, 2008 at 09:08:45AM -0500, Richard Treumann wrote:
> To me, the MPI standard is clear that a program like this:
> 
> task 0:
> MPI_Init
> sleep(3000);
> start receiving messages
> 
> each of tasks 1 to n-1:
> MPI_Init
> loop 5000 times
>MPI_Send(small message to 0)
> end loop
> 
> May send some small messages eagerly if there is space at task 0 but must
> block each task 1 to  n-1 before allowing task 0 to run out of eager buffer
> space.  Doing this requires a token or credit management system in which
> each task has credits for known buffer space at task 0. Each task will send
> eagerly to task 0 until the sender runs out of credits and then must switch
> to rendezvous protocol.
And rendezvous messages are not free either. So this approach will only
postpone failure a little bit.

--
Gleb.


Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Richard Treumann

Hi Ron -

I am well aware of the scaling problems related to the standard send
requirements in MPI.  I t is a very difficult issue.

However, here is what the standard says: MPI 1.2, page 32 lines 29-37

===
a standard send operation that cannot complete because of lack of buffer
space will merely block, waiting for buffer space to become available or
for a matching receive to be posted. This behavior is preferable in many
situations. Consider a situation where a producer repeatedly produces new
values and sends them to a consumer. Assume that the producer produces new
values faster than the consumer can consume them. If buffered sends are
used, then a buffer overflow will result. Additional synchronization has to
be added to the program so as to prevent this from occurring. If standard
sends are used, then the producer will be
automatically throttled, as its send operations will block when buffer
space is unavailable.


If there are people who want to argue that this is unclear or that it
should be changed, the MPI Forum can and should take up the discussion.  I
think this particular wording is pretty clear.

The piece of MPI standard wording you quote is somewhat ambiguous:

The amount
of space available for buffering will be much smaller than program data
memory on many systems. Then, it will be easy to write programs that
overrun available buffer space.

But note that this wording mentions a problem that an application can
create but does not say the MPI implementation can fail the job.  The
language I have pointed to is where the standard says what the MPI
implementation must do.

The "lack of resource" statement is more about send and receive descriptors
than buffer space.  If I write a program with an infinite loop of MPI_IRECV
postings  the standard allows that to fail.


Dick

Dick Treumann  -  MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 02/04/2008 12:24:11 PM:

>
> > Is what George says accurate? If so, it sounds to me like OpenMPI
> > does not comply with the MPI standard on the behavior of eager
> > protocol. MPICH is getting dinged in this discussion because they
> > have complied with the requirements of the MPI standard. IBM MPI
> > also complies with the standard.
> >
> > If there is any debate about whether the MPI standard does (or
> > should) require the behavior I describe below then we should move
> > the discussion to the MPI 2.1 Forum and get a clarification.
> > [...]
>
> The MPI Standard also says the following about resource limitations:
>
>   Any pending communication operation consumes system resources that are
>   limited. Errors may occur when lack of resources prevent the execution
>   of an MPI call. A quality implementation will use a (small) fixed
amount
>   of resources for each pending send in the ready or synchronous mode and
>   for each pending receive. However, buffer space may be consumed to
store
>   messages sent in standard mode, and must be consumed to store messages
>   sent in buffered mode, when no matching receive is available. The
amount
>   of space available for buffering will be much smaller than program data
>   memory on many systems. Then, it will be easy to write programs that
>   overrun available buffer space.
>
> Since I work on MPI implementations that are expected to allow
applications
> to scale to tens of thousands of processes, I don't want the overhead of
> a user-level flow control protocol that penalizes scalable applications
in
> favor of non-scalable ones.  I also don't want an MPI implementation that
> hides such non-scalable application behavior, but rather exposes it at
lower
> processor counts -- preferably in a way that makes the application
developer
> aware of the resources requirements of their code and allows them to make
> the appropriate choice regarding the structure of their code, the
underlying
> protocols, and the amount of buffer resources.
>
> But I work in a place where codes are expected to scale and not just
work.
> Most of the vendors aren't allowed to have this perspective
>
> -Ron
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Ron Brightwell

> Is what George says accurate? If so, it sounds to me like OpenMPI
> does not comply with the MPI standard on the behavior of eager
> protocol. MPICH is getting dinged in this discussion because they
> have complied with the requirements of the MPI standard. IBM MPI
> also complies with the standard.
> 
> If there is any debate about whether the MPI standard does (or
> should) require the behavior I describe below then we should move
> the discussion to the MPI 2.1 Forum and get a clarification.
> [...]

The MPI Standard also says the following about resource limitations:

  Any pending communication operation consumes system resources that are
  limited. Errors may occur when lack of resources prevent the execution
  of an MPI call. A quality implementation will use a (small) fixed amount
  of resources for each pending send in the ready or synchronous mode and
  for each pending receive. However, buffer space may be consumed to store
  messages sent in standard mode, and must be consumed to store messages
  sent in buffered mode, when no matching receive is available. The amount
  of space available for buffering will be much smaller than program data
  memory on many systems. Then, it will be easy to write programs that
  overrun available buffer space.

Since I work on MPI implementations that are expected to allow applications
to scale to tens of thousands of processes, I don't want the overhead of
a user-level flow control protocol that penalizes scalable applications in
favor of non-scalable ones.  I also don't want an MPI implementation that
hides such non-scalable application behavior, but rather exposes it at lower
processor counts -- preferably in a way that makes the application developer
aware of the resources requirements of their code and allows them to make
the appropriate choice regarding the structure of their code, the underlying
protocols, and the amount of buffer resources.

But I work in a place where codes are expected to scale and not just work.
Most of the vendors aren't allowed to have this perspective

-Ron




Re: [OMPI users] openmpi credits for eager messages

2008-02-01 Thread 8mj6tc902
That would make sense. I able to break OpenMPI by having Node A wait for
messages from Node B. Node B is in fact sleeping while Node C bombards
Node A with a few thousand messages. After a while Node B wakes up and
sends Node A the message it's been waiting on, but Node A has long since
been buried and seg faults. If I decrease the number of messages C is
sending, it works properly. This was on OpenMPI 1.2.4 (using I think the
SM BTL (might have been MX or TCP, but certainly not infiniband. I could
dig up the test and try again if anyone is seriously curious).

Trying the same test on MPICH/MX went very very slow (I don't think they
have any clever buffer management) but it didn't crash.

Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com
|openmpi-users/Allow| wrote:
> Hi,
> 
> I am readying an openmpi 1.2.5 software stack for use with a
> many-thousand core cluster. I have a question about sending small
> messages that I hope can be answered on this list. 
> 
> I was under the impression that if node A wants to send a small MPI
> message to node B, it must have a credit to do so. The credit assures A
> that B has enough buffer space to accept the message. Credits are
> required by the mpi layer regardless of the BTL transport layer used.
> 
> I have been told by a Voltaire tech that this is not so, the credits are
> used by the infiniband transport layer to reliably send a message, and
> is not an openmpi feature.
> 
> Thanks,
> Federico
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
--Kris

叶ってしまう夢は本当の夢と言えん。
[A dream that comes true can't really be called a dream.]


Re: [OMPI users] openmpi credits for eager messages

2008-02-01 Thread George Bosilca
The Voltaire tech was right. There is no credit management at the  
upper level, each BTL is allowed to do it (if needed). This doesn't  
means the transport is not reliable. Most of the devices have internal  
flow control, and Open MPI rely on it instead of implementing our own.  
However, the devices that do not provide in their low level drivers or  
hardware such feature, have it implemented at the BTL layer. As an  
example, infiniband have a flow control mechanism implemented in the  
BTL.


  george.

On Feb 1, 2008, at 3:05 PM, Sacerdoti, Federico wrote:


Hi,

I am readying an openmpi 1.2.5 software stack for use with a
many-thousand core cluster. I have a question about sending small
messages that I hope can be answered on this list.

I was under the impression that if node A wants to send a small MPI
message to node B, it must have a credit to do so. The credit  
assures A

that B has enough buffer space to accept the message. Credits are
required by the mpi layer regardless of the BTL transport layer used.

I have been told by a Voltaire tech that this is not so, the credits  
are

used by the infiniband transport layer to reliably send a message, and
is not an openmpi feature.

Thanks,
Federico

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




smime.p7s
Description: S/MIME cryptographic signature


[OMPI users] openmpi credits for eager messages

2008-02-01 Thread Sacerdoti, Federico
Hi,

I am readying an openmpi 1.2.5 software stack for use with a
many-thousand core cluster. I have a question about sending small
messages that I hope can be answered on this list. 

I was under the impression that if node A wants to send a small MPI
message to node B, it must have a credit to do so. The credit assures A
that B has enough buffer space to accept the message. Credits are
required by the mpi layer regardless of the BTL transport layer used.

I have been told by a Voltaire tech that this is not so, the credits are
used by the infiniband transport layer to reliably send a message, and
is not an openmpi feature.

Thanks,
Federico