Hello Jeff, Eugene: On Fri, Dec 12, 2008 at 04:47:11PM -0500, Jeff Squyres wrote:
...<snip>... > The "P" is MPI's profiling interface. See chapter 14 in the MPI-2.1 > doc. Ah...Thank you, both Jeff and Eugene, for pointing that out. I think there is a typo in chapter 14 - the first sentence isn't a sentence - but that's another story. > >Based on my re-read of the MPI standard, it appears that I may have > >slightly mis-stated my issue. The spin is probably taking place in > >"mpi_send". "mpi_send", according to my understanding of the MPI > >standard, may not exit until a matching "mpi_recv" has been initiated, > >or completed. At least that is the conclusion I came to. > > Perhaps something like this: > > int MPI_Send(...) { > MPI_Request req; > int flag; > PMPI_Isend(..., &req); > do { > nanosleep(short); > PMPI_Test(&req, &flag, MPI_STATUS_IGNORE); > } while (!flag); > } > > That is, *you* provide MPI_Send and intercept all your apps calls to > MPI_Send. But you implement it by doing a non-blocking send and > sleeping and polling MPI to know when it's done. Of course, you don't > have to implement this as MPI_Send -- you could always have > your_func_prefix_send(...) instead of explicitly using the MPI > profiling interface. But using the profiling interface allows you to > swap in/out different implementations of MPI_Send (etc.) at link time, > if that's desirable to you. > > Looping over sleep/test is not the most efficient way of doing it, but > it may be suitable for your purposes. Indeed, it is very suitable. Thank you, both Jeff and Eugene, for pointing the way. That solution changes the load for my job from 2.0 to 1.0, as indicated by "xload" over a 40-minute run. That means I can *double* the throughput of my machine. Some gory details: I ignored the suggestion to use MPI_STATUS_IGNORE, and that got me some trouble, as you may not be surprised to hear. The solution was to use MPI_Request_get_status instead of MPI_Test. As some of my waits (both in MPI_SEND and MPI_RECV) will be very short, and some will be up to 4 minutes, I implemented a graduated sleep time; it starts at 1 millisecond, and doubles after each sleep up to a maximum of 100 milliseconds. Interestingly, when I left the sleep time at a constant 1 millisecond, the run load went up significantly; it varied over the range 1.3 -> 1.7 . I have attached my MPI_Send.c and MPI_Recv.c . Comments welcome and appreciated. Regards, Douglas. -- Douglas Guptill Research Assistant, LSC 4640 email: douglas.gupt...@dal.ca Oceanography Department fax: 902-494-3877 Dalhousie University Halifax, NS, B3H 4J1, Canada
/* * Intercept MPI_Recv, and * call PMPI_Irecv, loop over PMPI_Request_get_status and sleep, until done * * Revision History: * 2008-12-17: copied from MPI_Send.c * 2008-12-18: tweaking. * * See MPI_Send.c for additional comments, * especially w.r.t. PMPI_Request_get_status. **/ #include "mpi.h" #define _POSIX_C_SOURCE 199309 #include <time.h> int MPI_Recv(void *buff, int count, MPI_Datatype datatype, int from, int tag, MPI_Comm comm, MPI_Status *status) { int flag, nsec_start=1000, nsec_max=100000; struct timespec ts; MPI_Request req; ts.tv_sec = 0; ts.tv_nsec = nsec_start; PMPI_Irecv(buff, count, datatype, from, tag, comm, &req); do { nanosleep(&ts, NULL); ts.tv_nsec *= 2; ts.tv_nsec = (ts.tv_nsec > nsec_max) ? nsec_max : ts.tv_nsec; PMPI_Request_get_status(req, &flag, status); } while (!flag); return (*status).MPI_ERROR; }
/* * Intercept MPI_Send, and * call PMPI_Isend, loop over PMPI_Request_get_status and sleep, until done * * Revision History: * 2008-12-12: skeleton by Jeff Squyres <jsquy...@cisco.com> * 2008-12-16->18: adding parameters, variable wait, * change MPI_Test to MPI_Request_get_status * Douglas Guptill <douglas.gupt...@dal.ca> **/ /* When we use this: * PMPI_Test(&req, &flag, &status); * we get: * dguptill@DOME:$ mpirun -np 2 mpi_send_recv_test_mine * This is process 0 of 2 . * This is process 1 of 2 . * error: proc 0 ,mpi_send returned -1208109376 * error: proc 1 ,mpi_send returned -1208310080 * 1 changed to 3 * * Using MPI_request_get_status cures the problem. * * A read of mpi21-report.pdf confirms that MPI_Request_get_status * is the appropriate choice, since there seems to be something * between the call to MPI_SEND (MPI_RECV) in my FORTRAN program * and MPI_Send.c (MPI_Recv.c) **/ #include "mpi.h" #define _POSIX_C_SOURCE 199309 #include <time.h> int MPI_Send(void *buff, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) { int flag, nsec_start=1000, nsec_max=100000; struct timespec ts; MPI_Request req; MPI_Status status; ts.tv_sec = 0; ts.tv_nsec = nsec_start; PMPI_Isend(buff, count, datatype, dest, tag, comm, &req); do { nanosleep(&ts, NULL); ts.tv_nsec *= 2; ts.tv_nsec = (ts.tv_nsec > nsec_max) ? nsec_max : ts.tv_nsec; PMPI_Request_get_status(req, &flag, &status); } while (!flag); return status.MPI_ERROR; }