Have a look at the FAQ; we discuss quite a few of these kinds of issues:

- http://www.open-mpi.org/faq/?category=tuning
- http://www.open-mpi.org/faq/?category=openfabrics

More specifically, what Eugene is saying is correct -- OMPI has made tradeoffs for various, complicated reasons. One of the things that we sacrificed in the common case was communication/computation overlap on OpenFabrics networks (in the common case).

If you want good overlap, set the MCA parameter mpi_leave_pinned to 1 (on OpenFabrics networks). This will effectively move the bulk of the message passing progress (but not all of it) down to the hardware. Hence, when you sleep/do real computations while looping over MPI_TEST, the message is probably actually being progressed in the background. You won't see this kind of overlap with other transports such as shared memory or TCP because we don't have hardware assist in these cases.


On Nov 6, 2008, at 12:52 PM, Eugene Loh wrote:

vladimir marjanovic wrote:

In order to overlap communication and computation I don't want to use MPI_Wait.
Right. One thing to keep in mind is that there are two ways of overlapping communication and computation. One is you start a send (MPI_Isend), you do a bunch of computation while the message is being sent, and then after the message has been sent you call MPI_Wait just to clean up. This assumes that the MPI implementation can send a message while control of the program has been returned to you. The experts can give you the fine print, but my simple assertion is, "This doesn't usually happen."

Rather, the MPI implementation typically will send data only when your code is in some MPI call. That's why you have to call MPI_Test periodically... or some other MPI function.
For sure the message is being decomposed into chucks and the size of chuck is probably defined by environment variable.
Maybe do you know how can I control size of chuck?
I don't. Try running "ompi_info -a" and looking through the parameters. For the shared-memory BTL, it's mca_btl_sm_max_frag_size. I also see something like coll_sm_fragment_size. Maybe look at the parameters that have "btl_openib_max" in their names.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to