Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread Richard Treumann

Ron's comments are probably dead on for an application like bug3.

If bug3 is long running and libmpi is doing eager protocol buffer
management as I contend the standard requires then the producers will not
get far ahead of the consumer before they are forced to synchronous send
under the covers anyway.  From then on, producers will run no faster than
their output can be absorbed.  They will spent the nonproductive parts of
their time blocked on either MPI_Send or MPI_Ssend.  The job will not
finish until the consumer finishes because the consumer is a constant
bottleneck anyway.  The slow consumer is the major drag on scalability. As
long as the producers can be expected to outrun the consumer for the life
of the job you will probably find it hard to measure a difference between
synchronous send and flow controlled standard send.

Eager protocol gets more interesting when the pace of the consumer and of
the producers is variable.  If the consumer can absorb a message per
millisecond and the production rate is close to one message per millisecond
but fluctuates a bit then eager protocol may speed the whole job
significantly. The producers can never get ahead with synchronous send even
in a phase when they might be able to create a message every 1/2
millisecond. The producers spend half this easy phase blocked in MPI_Ssend.
If producers now enter a compute intensive phase where messages can only be
generated once every 2 milliseconds the consumer spends time idle.  If the
consumer had been able to accumulate queued messages with eager protocol
when the producers were able to run faster it could now make itself useful
catching up.

Both producers and consumer would come closer to 100% productive work and
the job would finish sooner..

   Dick


Dick Treumann  -  MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 02/05/2008 01:26:24 PM:

> > Re: MPI_Ssend(). This indeed fixes bug3, the process at rank 0 has
> > reasonable memory usage and the execution proceeds normally.
> >
> > Re scalable: One second. I know well bug3 is not scalable, and when to
> > use MPI_Isend. The point is programmers want to count on the MPI spec
as
> > written, as Richard pointed out. We want to send small messages quickly
> > and efficiently, without the danger of overloading the receiver's
> > resources. We can use MPI_Ssend() but it is slow compared MPI_Send().
>
> Your last statement is not necessarily true.  By synchronizing processes
> using MPI_Ssend(), you can potentially avoid large numbers of unexpected
> messages that need to be buffered and copied, and that also need to be
> searched every time a receive is posted.  There is no guarantee that the
> protocol overhead on each message incurred with MPI_Ssend() slows down an
> application more than the buffering, copying, and searching overhead of a
> large number of unexpected messages.
>
> It is true that MPI_Ssend() is slower than MPI_Send() for ping-pong
> micro-benchmarks, but the length of the unexpected message queue doesn't
> have to get very long before they are about the same.
>
> >
> > Since identifying this behavior we have implemented the desired flow
> > control in our application.
>
> It would be interesting to see performance results comparing doing flow
> control in the application versus having MPI do it for you
>
> -Ron
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread Brightwell, Ronald
> Re: MPI_Ssend(). This indeed fixes bug3, the process at rank 0 has
> reasonable memory usage and the execution proceeds normally.
> 
> Re scalable: One second. I know well bug3 is not scalable, and when to
> use MPI_Isend. The point is programmers want to count on the MPI spec as
> written, as Richard pointed out. We want to send small messages quickly
> and efficiently, without the danger of overloading the receiver's
> resources. We can use MPI_Ssend() but it is slow compared MPI_Send().

Your last statement is not necessarily true.  By synchronizing processes
using MPI_Ssend(), you can potentially avoid large numbers of unexpected
messages that need to be buffered and copied, and that also need to be
searched every time a receive is posted.  There is no guarantee that the
protocol overhead on each message incurred with MPI_Ssend() slows down an
application more than the buffering, copying, and searching overhead of a
large number of unexpected messages.

It is true that MPI_Ssend() is slower than MPI_Send() for ping-pong
micro-benchmarks, but the length of the unexpected message queue doesn't
have to get very long before they are about the same.

> 
> Since identifying this behavior we have implemented the desired flow
> control in our application.

It would be interesting to see performance results comparing doing flow
control in the application versus having MPI do it for you

-Ron




Re: [OMPI users] mpirun, paths and xterm again

2008-02-05 Thread Tim Prins

Jody,

jody wrote:

Hi Tim


Your desktop is plankton, and you want
to run a job on both plankton and nano, and have xterms show up on nano.


Not on nano, but on plankton, but ithink this was just a typo :)

Correct.


It looks like you are already doing this, but to make sure, the way I
would use xhost is:
plankton$ xhost +nano_00
plankton$ mpirun -np 4 --hostfile testhosts -x DISPLAY=plankton:0.0
xterm -hold -e ../MPITest

This gives me 2 lines of
  xterm Xt error: Can't open display: plankton:0.0


Can you try running:
plankton$ mpirun -np 1 -host nano_00 -x DISPLAY=plankton:0.0 printenv

This yields
DISPLAY=plankton:0.0





just to make sure the environment variable is being properly set.

You might also try:
in terminal 1:
plankton$ xhost +nano_00

in terminal 2:
plankton$ ssh -x nano_00
nano_00$ export DISPLAY="plankton:0.0"
nano_00$ xterm


This experiment also gives
xterm Xt error: Can't open display: plankton:0.0


This will ssh into nano, disabling ssh X forwarding, and try to launch
an xterm. If this does not work, then something is wrong with your x
setup. If it does work, it should work with Open MPI as well.


So i guess something is wrong with my X setup.
I wonder what it could be ...


So this is an X issue, not an Open MPI issue then. I do not know enough 
about X setup to help here...




Doing the same with X11 forwarding works perfectly.
But why is X11 forwarding bad?  Or differently asked,
does Opem MPI make the ssh connection in such a way
that X11 forwarding is  disabled?


What Open MPI does is it uses ssh to launch a daemon on a remote node, 
then it disconnects the ssh session. This is done to prevent running out 
of resources at scale. We then send a message to the daemon to launch 
the client application. So we are not doing anything to prevent ssh X11 
forwarding, it is just that by the time the application launched the ssh 
sessions are no longer around.


There is a way to force the ssh sessions to stay open. However doing so 
will result in a bunch of excess debug output. If you add 
"--debug-daemons" to the mpirun command line, the ssh connections should 
stay open.


Hope this helps,

Tim


Re: [OMPI users] mpirun, paths and xterm again

2008-02-05 Thread jody
Hi Tim

> Your desktop is plankton, and you want
> to run a job on both plankton and nano, and have xterms show up on nano.

Not on nano, but on plankton, but ithink this was just a typo :)

> It looks like you are already doing this, but to make sure, the way I
> would use xhost is:
> plankton$ xhost +nano_00
> plankton$ mpirun -np 4 --hostfile testhosts -x DISPLAY=plankton:0.0
> xterm -hold -e ../MPITest
This gives me 2 lines of
  xterm Xt error: Can't open display: plankton:0.0

>
> Can you try running:
> plankton$ mpirun -np 1 -host nano_00 -x DISPLAY=plankton:0.0 printenv
This yields
DISPLAY=plankton:0.0
OMPI_MCA_orte_precondition_transports=4a0f9ccb4c13cd0e-6255330fbb0289f9
OMPI_MCA_rds=proxy
OMPI_MCA_ras=proxy
OMPI_MCA_rmaps=proxy
OMPI_MCA_pls=proxy
OMPI_MCA_rmgr=proxy
SHELL=/bin/bash
SSH_CLIENT=130.60.49.141 59524 22
USER=jody
LD_LIBRARY_PATH=/opt/openmpi/lib
SSH_AUTH_SOCK=/tmp/ssh-enOzt24653/agent.24653
MAIL=/var/mail/jody
PATH=/opt/openmpi/bin:/usr/local/bin:/bin:/usr/bin
PWD=/home/jody
SHLVL=1
HOME=/home/jody
LOGNAME=jody
SSH_CONNECTION=130.60.49.141 59524 130.60.49.128 22
_=/opt/openmpi/bin/orted
OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_mpi_paffinity_processor=0
OMPI_MCA_universe=j...@aim-plankton.unizh.ch:default-universe-10265
OMPI_MCA_ns_replica_uri=0.0.0;tcp://130.60.49.141:50310
OMPI_MCA_gpr_replica_uri=0.0.0;tcp://130.60.49.141:50310
OMPI_MCA_orte_app_num=0
OMPI_MCA_orte_base_nodename=nano_00
OMPI_MCA_ns_nds=env
OMPI_MCA_ns_nds_cellid=0
OMPI_MCA_ns_nds_jobid=1
OMPI_MCA_ns_nds_vpid=0
OMPI_MCA_ns_nds_vpid_start=0
OMPI_MCA_ns_nds_num_procs=1


>
> just to make sure the environment variable is being properly set.
>
> You might also try:
> in terminal 1:
> plankton$ xhost +nano_00
>
> in terminal 2:
> plankton$ ssh -x nano_00
> nano_00$ export DISPLAY="plankton:0.0"
> nano_00$ xterm
>
This experiment also gives
xterm Xt error: Can't open display: plankton:0.0

> This will ssh into nano, disabling ssh X forwarding, and try to launch
> an xterm. If this does not work, then something is wrong with your x
> setup. If it does work, it should work with Open MPI as well.
>
So i guess something is wrong with my X setup.
I wonder what it could be ...
Doing the same with X11 forwarding works perfectly.
But why is X11 forwarding bad?  Or differently asked,
does Opem MPI make the ssh connection in such a way
that X11 forwarding is  disabled?

Thank YOu
  Jody


Re: [OMPI users] mpirun, paths and xterm again

2008-02-05 Thread Tim Prins

Hi Jody,

Just to make sure I understand. Your desktop is plankton, and you want 
to run a job on both plankton and nano, and have xterms show up on nano.


It looks like you are already doing this, but to make sure, the way I 
would use xhost is:

plankton$ xhost +nano_00
plankton$ mpirun -np 4 --hostfile testhosts -x DISPLAY=plankton:0.0 
xterm -hold -e ../MPITest


Can you try running:
plankton$ mpirun -np 1 -host nano_00 -x DISPLAY=plankton:0.0 printenv

just to make sure the environment variable is being properly set.

You might also try:
in terminal 1:
plankton$ xhost +nano_00

in terminal 2:
plankton$ ssh -x nano_00
nano_00$ export DISPLAY="plankton:0.0"
nano_00$ xterm

This will ssh into nano, disabling ssh X forwarding, and try to launch 
an xterm. If this does not work, then something is wrong with your x 
setup. If it does work, it should work with Open MPI as well.


For your second question: I'm not sure why there would be a difference 
in finding the shared libraries in gdb vs. with the xterm.


Tim

jody wrote:

Hi
Sorry to bring this subject up again -
but i have a problem getting xterms
running for all of my processes (for debugging purposes).
There are actually two problem involved:
display, and paths.


my ssh is set up so that X forwarding is allowed,
and, indeed,
  ssh nano_00 xterm
opens an xterm from the remote machine nano_00.

When i run my program normally, it works ok:
 [jody]:/mnt/data1/neander:$mpirun -np 4 --hostfile testhosts ./MPITest
[aim-plankton.unizh.ch]I am #0/4 global
[aim-plankton.unizh.ch]I am #1/4 global
[aim-nano_00]I am #2/4 global
[aim-nano_00]I am #3/4 global

But when i try to see it in xterms
[jody]:/mnt/data1/neander:$mpirun -np 4 --hostfile testhosts -x
DISPLAY xterm -hold -e  ./MPITest
xterm Xt error: Can't open display: :0.0
xterm Xt error: Can't open display: :0.0

(same happens, if i set DISPLAY=plankton:0.0, or if i use plankton's
ip address;
and xhost is enabled for nano_00)

the other two (the "local") xterms open, but they display the message:
 ./MPITest: error while loading shared libraries: libmpi_cxx.so.0:
cannot open shared object file: No such file or directory
(This also happens if i only have local processes)

So my first question is: what do i do to enable nano_00 to display an xterm
on plankton? Using normal ssh there seems to be no problem.

Second question: why does the use of xterm "hide" the open-mpi libs?
Interestingly: if i use xterm with gdb to start my application, it works.

Any ideas?

Thank you
  Jody
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread Richard Treumann

So with an Isend your program becomes valid MPI and a very nice
illustrarion of why the MPI standard cannot limit envelops (or send/recv
descriptors) and why at some point the number of descriptors can blow the
limits. It also illustrates how the management of eager messages remains
workable. (Not the same as affordable or appropriate. I agree it has
serious scaling issues) Let's assume there is managed early arrival space
for 10 messages per sender.

Each MPI_Isend generates an envelop that goes to the destination. For your
program to unwind properly, every envelop must be delivered to the
destination.  The first (blocking) MPI_Recv is looking for the tag in the
last envelop so if libmpi does not deliver all 5000 envelops per sender,
the first MPI_Recv will block forever.  It is not acceptable for a valid
MPI program to deadlock.  If the destination cannot hold all the envelops
there is no choice but to fail the job. The standard allows this. The Forum
considered it to be better to fail a job than to deadlock it.

If each sender sends its first 10 messages eagerly the send side tokens
will be used up and the buffer space at the destination will fill up but
not overflow.  The senders now fall back to rendevous for their remaining
4990 MPI_Isends. The MPI_Isends cannot block.  They send envelops as fast
as the loop can run but the user send buffers involved cannot be altered
until the waits occur.  Once the last sent envelop  with tag 5000 arrives
and matches the posted MPI_Recv, an OK_to_send goes back to the sender and
the data can be moved from the still intact send buffer to the waiting
receive buffer.  The MPI_Waits for the Isend requests can be done in any
order but no send buffer can be changed until the corresponding MPI_Wait
returns. No system buffer needed for massage data.

The MPI_Recvs being posted in reverse order (5000,4999 .. 11. ) each ship
OK_to_send and data flows directly from send to recv buffers.  Finally the
MPI_Recvs for tags (10 ... 1) get posted and pull their message data from
the early arrival space. The program has unwound correctly and as the early
arrival space frees up, credits can be returned to the sender.

Performance discussions aside - the semantic is clean and reliable.

  Thanks - Dick

PS - If anyone responds to this I hope you will state clearly whether you
want to talk about:

- What does the standard require?
or
- What should the standard require?

Dick Treumann  -  MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 02/04/2008 06:04:22 PM:

> Richard,
>
> You're absolutely right. What a shame :) If I have spent less time
> drawing the boxes around the code I might have noticed the typo. The
> Send should be an Isend.
>
>george.
>
> On Feb 4, 2008, at 5:32 PM, Richard Treumann wrote:
>
> > Hi George
> >
> > Sorry - This is not a valid MPI program. It violates the requirement
> > that a program not depend on there being any system buffering. See
> > page 32-33 of MPI 1.1
> >
> > Lets simplify to:
> > Task 0:
> > MPI_Recv( from 1 with tag 1)
> > MPI_Recv( from 1 with tag 0)
> >
> > Task 1:
> > MPI_Send(to 0 with tag 0)
> > MPI_Send(to 0 with tag 1)
> >
> > Without any early arrival buffer (or with eager size set to 0) task
> > 0 will hang in the first MPI_Recv and never post a recv with tag 0.
> > Task 1 will hang in the MPI_Send with tag 0 because it cannot get
> > past it until the matching recv is posted by task 0.
> >
> > If there is enough early arrival buffer for the first MPI_Send on
> > task 1 to complete and the second MPI_Send to be posted the example
> > will run. Once both sends are posted by task 1, task 0 will harvest
> > the second send and get out of its first recv. Task 0's second recv
> > can now pick up the message from the early arrival buffer where it
> > had to go to let task 1complete send 1 and post send 2.
> >
> > If an application wants to do this kind of order inversion it should
> > use some non blocking operations. For example, if task 0 posted an
> > MPI_Irecv for tag 1, an MPI_Recv for tag 0 and lastly, an MPI_Wait
> > for the Irecv, the example is valid.
> >
> > I am not aware of any case where the standard allows a correct MPI
> > program to be deadlocked by an implementation limit. It can be
> > failed if it exceeds a limit but I do not think it is ever OK to hang.
> >
> > Dick
> >
> > Dick Treumann - MPI Team/TCEM
> > IBM Systems & Technology Group
> > Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> > Tele (845) 433-7846 Fax (845) 433-8363
> >
> >
> > users-boun...@open-mpi.org wrote on 02/04/2008 04:41:21 PM:
> >
> > > Please allow me to slightly modify your example. It still follow the
> > > rules from the MPI standard, so I think it's a 100% standard
> > compliant
> > > parallel application.
> > >
> > > 

Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread 8mj6tc902
Wow this sparked a much more heated discussion than I was expecting. I
was just commenting that the behaviour the original author (Federico
Sacerdoti) mentioned would explain something I observed in one of my
early trials of OpenMPI. But anyway, because it seems that quite a few
people were interested, I've attached a simplified version of the test I
was describing (with all the timing checks and some of the crazier
output removed).

Now that I go back and retest this it turns out that it wasn't actually
a segfault that was killing it, but running out of memory as you and
others have predicted.

Brian W. Barrett brbarret-at-open-mpi.org |openmpi-users/Allow| wrote:
> Now that this discussion has gone way off into the MPI standard woods :).
> 
> Was your test using Open MPI 1.2.4 or 1.2.5 (the one with the segfault)? 
> There was definitely a bug in 1.2.4 that could cause exactly the behavior 
> you are describing when using the shared memory BTL, due to a silly 
> delayed initialization bug/optimization.

I'm still using Open MPI 1.2.4 and actually the SM BTL seems to be the
hardest to break (I guess I'm dodging the bullet on that delayed
initialization bug you're referring to).

> If you are using the OB1 PML (the default), you will still have the 
> possibility of running the receiver out of memory if the unexpected queue 
> grows without bounds.  I'll withold my opinion on what the standard says 
> so that we can perhaps actually help you solve your problem and stay out 
> of the weeds :).  Note however, that in general unexpected messages are a 
> bad idea and thousands of them from one peer to another should be avoided 
> at all costs -- this is just good MPI programming practice.

Actually I was expecting to break something with this test. I just
wanted to find out where it broke. Lesson learned, I wrote my more
serious programs doing exactly that (no unexpected messages). I was just
surprised that the default Open MPI settings allowed me to flood the
system so easily whereas MPICH/MX still finished not matter what I threw
at it (albeit with terrible performance (in the bad cases)).

> Now, if you are using MX, you can replicate MPICH/MX's behavior (including 
> the very slow part) by using the CM PML (--mca pml cm on the mpirun 
> command line), which will use the MX library message matching and 
> unexpected queue and therefore behave exactly like MPICH/MX.

That works exactly as you described, and it does indeed prevent memory
usage from going wild due to the unexpected messages.

Thanks for your help! (and to the others for the educational discussion!)

> 
> Brian
> 
> 
> On Sat, 2 Feb 2008, 8mj6tc...@sneakemail.com wrote:
> 
>> That would make sense. I able to break OpenMPI by having Node A wait for
>> messages from Node B. Node B is in fact sleeping while Node C bombards
>> Node A with a few thousand messages. After a while Node B wakes up and
>> sends Node A the message it's been waiting on, but Node A has long since
>> been buried and seg faults. If I decrease the number of messages C is
>> sending, it works properly. This was on OpenMPI 1.2.4 (using I think the
>> SM BTL (might have been MX or TCP, but certainly not infiniband. I could
>> dig up the test and try again if anyone is seriously curious).
>>
>> Trying the same test on MPICH/MX went very very slow (I don't think they
>> have any clever buffer management) but it didn't crash.
>>
>> Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com
>> |openmpi-users/Allow| wrote:
>>> Hi,
>>>
>>> I am readying an openmpi 1.2.5 software stack for use with a
>>> many-thousand core cluster. I have a question about sending small
>>> messages that I hope can be answered on this list.
>>>
>>> I was under the impression that if node A wants to send a small MPI
>>> message to node B, it must have a credit to do so. The credit assures A
>>> that B has enough buffer space to accept the message. Credits are
>>> required by the mpi layer regardless of the BTL transport layer used.
>>>
>>> I have been told by a Voltaire tech that this is not so, the credits are
>>> used by the infiniband transport layer to reliably send a message, and
>>> is not an openmpi feature.
>>>
>>> Thanks,
>>> Federico
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
--Kris

叶ってしまう夢は本当の夢と言えん。
[A dream that comes true can't really be called a dream.]
#include 
#include 
#include 
#include 

#include  //for atoi (in case someone doesn't have boost)

const int buflen=5000;

int main(int argc, char *argv[]) {
  using namespace std;
  int reps=1000;
  if(argc>1){ //optionally specify number of repeats on the command line
reps=atoi(argv[1]);
  }

  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME];

  

Re: [OMPI users] MPI piggyback mechanism

2008-02-05 Thread Josh Hursey

Oleg,

Interesting work. You mentioned late in your email that you believe  
that adding support for piggybacking to the MPI standard would be the  
best solution. As you may know, the MPI Forum has reconvened and there  
is a working group for Fault Tolerance. This working group is  
discussing a piggybacking interface proposal for the standard, amongst  
other things. If you are interested in contributing to this  
conversation you can find the mailing list here:

 http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft

Best,
Josh

On Feb 5, 2008, at 4:58 AM, Oleg Morajko wrote:


Hi,

I've been working on MPI piggyback technique as a part of my PhD work.

Although MPI does not provide a native support, there are several  
different
solutions to transmit piggyback data over every MPI communication.  
You may
find a brief overview in papers [1, 2]. This includes copying the  
original
message and the extra data to a bigger buffer, sending additional  
message or

changing the sendtype to a dynamically created wrapper datatype that
contains a pointer to the original data and the piggyback data. I  
have tried
all mechanisms and they work, but considering the overhead, there is  
no "the
best" technique that outperforms the others in all scenarios. Jeff  
Squyres
had interesting comments on this subject before (in this mailing  
list).


Finally after some benchmarking, I have implemented *a *hybrid  
technique

that combines existing mechanisms. For small, point-to-point messages
datatype wrapping seems to be the less intrusive, at least considering
OpenMPI implementation of derived datatypes. For large, point-to-point
messages, experiments confirmed that sending an additional message  
is much
cheaper than wrapping (and besides the intrusion is small as we are  
already
sending a large message). Moreover, the implementation may  
interleave the
original send with an asynchronous send of piggyback data. This  
optimization
partially hides the latency of additional send and lowers overall  
intrusion.
The  same criteria can be applied for collective operations, except  
barrier
and reduce operations. As the former does not transmit any data and  
the

latter transforms the data, the only solution is to send additional
messages.

There is a penalty of course. Especially for collective operations  
with very

small messages the intrusion may reach 15% and that's a lot. It than
decreases down to 0.1% for bigger messages, but anyway it's still  
there. I
don't know what are your requirements/expectations for that issue.  
The only
work that reported lower overheads is [3] but they added native  
piggyback

support by changing underlying MPI implementation.

I think the best possible option is to add piggyback support for MPI  
as a

part of the standard. A growing number of runtime tools use this
functionality for multiple reasons and certainly PMPI itself is not  
enough.

References of interest:

  -

  [1] Shende, S., Malony, A., Morris, A., Wolf, F. "Performance
  Profiling Overhead Compensation for MPI Programs". 12th EuroPVM-MPI
  Conference, LNCS, vol. 3666, pp. 359-367, 2005.  They review various
  techniques and  come up with datatype wrapping.

  -

  [2] Schulz, M., "Extracting Critical Path Graphs from MPI
  Applications". Cluster Computing 2005, IEEE International, pp. 1-10,
  September 2005. They use datatype wrapping.
  - [3] Jeffrey Vetter, "Dynamic Statistical Profiling of  
Communication
  Activity in Distributed Applications". They add support for  
piggyback at MPI

  implementation level and report very low overheads (no surprise).

Regards,
Oleg Morajko


On Feb 1, 2008 5:08 PM, Aurélien Bouteiller   
wrote:


I don't know of any work in that direction for now. Indeed, we plan  
to

eventually integrate at least causal message logging in the pml-v,
which also includes piggybacking. Therefore we are open for
collaboration with you on this matter. Please let us know :)

Aurelien



Le 1 févr. 08 à 09:51, Thomas Ropars a écrit :


Hi,

I'm currently working on optimistic message logging and I would like
to
implement an optimistic message logging protocol in OpenMPI.
Optimistic
message logging protocols piggyback information about dependencies
between processes on the application messages to be able to find a
consistent global state after a failure. That's why I'm interested  
in

the problem of piggybacking information on MPI messages.

Is there some works on this problem at the moment ?
Has anyone already implemented some mechanisms in OpenMPI to  
piggyback

data on MPI messages?

Regards,

Thomas

Oleg Morajko wrote:

Hi,

I'm developing a causality chain tracking library and need a
mechanism
to attach an extra data to every MPI message, so called piggyback
mechanism.

As far as I know there are a few solutions to this problem from  
which

the two fundamental ones are the following:

  * Dynamic datatype wrapping - if a user MPI_Send, let's say 1024
doubles, the 

Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread Gleb Natapov
On Tue, Feb 05, 2008 at 08:07:59AM -0500, Richard Treumann wrote:
> There is no misunderstanding of the MPI standard or the definition of
> blocking in the bug3 example.  Both bug 3 and the example I provided are
> valid MPI.
> 
> As you say, blocking means the send buffer can be reused when the MPI_Send
> returns.  This is exactly what bug3 is count on.
> 
> MPI is a reliable protocol which means that once MPI_Send returns, the
> application can assume the message will be delivered once a matching recv
> is posted.  There are only two ways I can think of for MPI to keep that
> guarantee.
> 1) Before return from MPI_Send, copy the envelop and data to some buffer
> that will be preserved until the MPI_Recv gets posted
> 2) delay the return from MPI_Send until the MPI_Recv is posted and then
> move data from the intact send buffer to the posted receive buffer. Return
> from MPI_Send
> 
> The requirement in the standard is that if libmpi takes option 1, the
> return from MPI_Send cannot occur unless there is certainty the buffer
> space exists. Libmpi cannot throw the message over the wall and fail the
> job if it cannit be buffered.
As I said Open MPI has flow control on transport layer to prevent messages
from been dropped by network. This mechanism should allow program like
yours to work, but bug3 is another story because it generate huge
amount of unexpected messages and Open MPI has no mechanism to prevent
unexpected messages to blow memory consumption. Your point is that
according to MPI spec this is not a valid behaviour. I am not going to
argue with that especially as you can get desired behaviour by setting
eager limit to zero.

> users-boun...@open-mpi.org wrote on 02/05/2008 02:28:27 AM:
> 
> > On Mon, Feb 04, 2008 at 04:23:13PM -0500, Sacerdoti, Federico wrote:
> > > Bug3 is a test-case derived from a real, scalable application (desmond
> > > for molecular dynamics) that several experienced MPI developers have
> > > worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the
> > > openmpi silently sends them in the background and overwhelms process 0
> > > due to lack of flow control.
> > MPI_Send is *blocking* in MPI sense of the word i.e when MPI_Send returns
> > send buffer can be reused. MPI spec section 3.4.
> >
> > --
> >  Gleb.
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.


Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread Richard Treumann
Hi Gleb

There is no misunderstanding of the MPI standard or the definition of
blocking in the bug3 example.  Both bug 3 and the example I provided are
valid MPI.

As you say, blocking means the send buffer can be reused when the MPI_Send
returns.  This is exactly what bug3 is count on.

MPI is a reliable protocol which means that once MPI_Send returns, the
application can assume the message will be delivered once a matching recv
is posted.  There are only two ways I can think of for MPI to keep that
guarantee.
1) Before return from MPI_Send, copy the envelop and data to some buffer
that will be preserved until the MPI_Recv gets posted
2) delay the return from MPI_Send until the MPI_Recv is posted and then
move data from the intact send buffer to the posted receive buffer. Return
from MPI_Send

The requirement in the standard is that if libmpi takes option 1, the
return from MPI_Send cannot occur unless there is certainty the buffer
space exists. Libmpi cannot throw the message over the wall and fail the
job if it cannit be buffered.

 Dick


Dick Treumann  -  MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 02/05/2008 02:28:27 AM:

> On Mon, Feb 04, 2008 at 04:23:13PM -0500, Sacerdoti, Federico wrote:
> > Bug3 is a test-case derived from a real, scalable application (desmond
> > for molecular dynamics) that several experienced MPI developers have
> > worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the
> > openmpi silently sends them in the background and overwhelms process 0
> > due to lack of flow control.
> MPI_Send is *blocking* in MPI sense of the word i.e when MPI_Send returns
> send buffer can be reused. MPI spec section 3.4.
>
> --
>  Gleb.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] MPI piggyback mechanism

2008-02-05 Thread Oleg Morajko
Hi,

I've been working on MPI piggyback technique as a part of my PhD work.

Although MPI does not provide a native support, there are several different
solutions to transmit piggyback data over every MPI communication. You may
find a brief overview in papers [1, 2]. This includes copying the original
message and the extra data to a bigger buffer, sending additional message or
changing the sendtype to a dynamically created wrapper datatype that
contains a pointer to the original data and the piggyback data. I have tried
all mechanisms and they work, but considering the overhead, there is no "the
best" technique that outperforms the others in all scenarios. Jeff Squyres
had interesting comments on this subject before (in this mailing list).

Finally after some benchmarking, I have implemented *a *hybrid technique
that combines existing mechanisms. For small, point-to-point messages
datatype wrapping seems to be the less intrusive, at least considering
OpenMPI implementation of derived datatypes. For large, point-to-point
messages, experiments confirmed that sending an additional message is much
cheaper than wrapping (and besides the intrusion is small as we are already
sending a large message). Moreover, the implementation may interleave the
original send with an asynchronous send of piggyback data. This optimization
partially hides the latency of additional send and lowers overall intrusion.
The  same criteria can be applied for collective operations, except barrier
and reduce operations. As the former does not transmit any data and the
latter transforms the data, the only solution is to send additional
messages.

There is a penalty of course. Especially for collective operations with very
small messages the intrusion may reach 15% and that's a lot. It than
decreases down to 0.1% for bigger messages, but anyway it's still there. I
don't know what are your requirements/expectations for that issue. The only
work that reported lower overheads is [3] but they added native piggyback
support by changing underlying MPI implementation.

I think the best possible option is to add piggyback support for MPI as a
part of the standard. A growing number of runtime tools use this
functionality for multiple reasons and certainly PMPI itself is not enough.
References of interest:

   -

   [1] Shende, S., Malony, A., Morris, A., Wolf, F. "Performance
   Profiling Overhead Compensation for MPI Programs". 12th EuroPVM-MPI
   Conference, LNCS, vol. 3666, pp. 359-367, 2005.  They review various
   techniques and  come up with datatype wrapping.

   -

   [2] Schulz, M., "Extracting Critical Path Graphs from MPI
   Applications". Cluster Computing 2005, IEEE International, pp. 1-10,
   September 2005. They use datatype wrapping.
   - [3] Jeffrey Vetter, "Dynamic Statistical Profiling of Communication
   Activity in Distributed Applications". They add support for piggyback at MPI
   implementation level and report very low overheads (no surprise).

Regards,
Oleg Morajko


On Feb 1, 2008 5:08 PM, Aurélien Bouteiller  wrote:

> I don't know of any work in that direction for now. Indeed, we plan to
> eventually integrate at least causal message logging in the pml-v,
> which also includes piggybacking. Therefore we are open for
> collaboration with you on this matter. Please let us know :)
>
> Aurelien
>
>
>
> Le 1 févr. 08 à 09:51, Thomas Ropars a écrit :
>
> > Hi,
> >
> > I'm currently working on optimistic message logging and I would like
> > to
> > implement an optimistic message logging protocol in OpenMPI.
> > Optimistic
> > message logging protocols piggyback information about dependencies
> > between processes on the application messages to be able to find a
> > consistent global state after a failure. That's why I'm interested in
> > the problem of piggybacking information on MPI messages.
> >
> > Is there some works on this problem at the moment ?
> > Has anyone already implemented some mechanisms in OpenMPI to piggyback
> > data on MPI messages?
> >
> > Regards,
> >
> > Thomas
> >
> > Oleg Morajko wrote:
> >> Hi,
> >>
> >> I'm developing a causality chain tracking library and need a
> >> mechanism
> >> to attach an extra data to every MPI message, so called piggyback
> >> mechanism.
> >>
> >> As far as I know there are a few solutions to this problem from which
> >> the two fundamental ones are the following:
> >>
> >>* Dynamic datatype wrapping - if a user MPI_Send, let's say 1024
> >>  doubles, the wrapped send call implementation dynamically
> >>  creates a derived datatype that is a structure composed of a
> >>  pointer to 1024 doubles and extra fields to be piggybacked. The
> >>  datatype is constructed with absolute addresses to avoid copying
> >>  the original buffer. The receivers side creates the equivalent
> >>  datatype to receive the original data and extra data. The
> >>  performance of this solution depends on the how good is derived
> >>  data type 

Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread Gleb Natapov
On Mon, Feb 04, 2008 at 04:23:13PM -0500, Sacerdoti, Federico wrote:
> Bug3 is a test-case derived from a real, scalable application (desmond
> for molecular dynamics) that several experienced MPI developers have
> worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the
> openmpi silently sends them in the background and overwhelms process 0
> due to lack of flow control.
MPI_Send is *blocking* in MPI sense of the word i.e when MPI_Send returns
send buffer can be reused. MPI spec section 3.4.

--
Gleb.