Re: [OMPI users] MPI_Recv question

Josh Hursey Sun, 19 Feb 2006 11:26:06 -0500

Abhishek,

What you are trying to do is not exactly supported by the MPIstandard. If a process in a MPI communicator is killed (by a nodefailure, 'kill' command, segmentation fault, or other unexpectedfailure) and you are blocking on a MPI call, you are not alwaysguaranteed to receive an error. So in the case you cite:

--------------------------
val = MPI_Recv(&ans, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG,
                                       newcomm[i], &stat[i]);
      if (val != MPI_SUCCESS )
        printf("Manager: error in Recv\n");

--------------------------

You are using MPI_ANY_SOURCE and MPI_ANY_TAG, so it is reasonable forthe MPI_RECV to continue blocking, since we could receive a messagefrom another process in the communicator.

Since fault tolerance is not in the MPI standard, when a processexits unexpectedly the state of the MPI library is undefined by thestandard. Some MPI implementations will not allow you to call backinto them, others will allow you to with very limited functionality(you may be able to only call MPI_FINALIZE), and others will allowyou to use it with no limitations.

There are implementations of MPI that allow for various degrees ofprocess fault tolerance (many of them are active contributors to theOpen MPI project). For instance, the FT-MPI style of fault tolerance(http://icl.cs.utk.edu/ftmpi/) allows an MPI program to continueexecution even if one process in the communicator fails. We areworking on integrating this style (and a few other styles) of faulttolerance into Open MPI.

There is another model of fault tolerance in which you would useMPI_COMM_SPAWN to dynamically create communication groups and usethose communicators for a form of process fault tolerance. WilliamGropp and Ewing Lusk wrote a good description of this in their 2004paper "Fault Tolerance in Message Passing Interface Programs" (http://dx.doi.org/10.1177/1094342004046045), if you are interested inpursuing this type of program.

So in short, MPI_Recv is behaving as it should in this situationsince it could be waiting for other processes in the communicationgroup to send a message. If you need to support program continuationeven in the face of single process failures take a look at thedynamic process manager-worker model or you might explore FT-MPI'sAPI for dealing with process loss in a communication group.


I hope this helps, good luck!

Josh


On Feb 16, 2006, at 10:11 AM, Abhishek Agarwal wrote:

Hello All,
I am trying to use the MPI_Recv of the open-mpi, but met someproblems with
MPI_Recv.
I have two processes in master slave mode and I killed the slaveprocess butmy MPI_Recv process is still waiting to get a response from slaveand nevertimes out with any error. I am checking the MPI_SUCCESS but itseems to wait
for ever and hence the program hangs.

I am attaching the secition of code which i have used in my program.


--------------------------
val = MPI_Recv(&ans, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG,
                                       newcomm[i], &stat[i]);
      if (val != MPI_SUCCESS )
        printf("Manager: error in Recv\n");

--------------------------

Any advice?

Thanks,

Abhishek Agarwal


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


----
Josh Hursey
jjhur...@open-mpi.org
http://www.open-mpi.org/

Re: [OMPI users] MPI_Recv question

Reply via email to