Abhishek,

What you are trying to do is not exactly supported by the MPI standard. If a process in a MPI communicator is killed (by a node failure, 'kill' command, segmentation fault, or other unexpected failure) and you are blocking on a MPI call, you are not always guaranteed to receive an error. So in the case you cite:

--------------------------
val = MPI_Recv(&ans, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG,
                                       newcomm[i], &stat[i]);
      if (val != MPI_SUCCESS )
        printf("Manager: error in Recv\n");

--------------------------

You are using MPI_ANY_SOURCE and MPI_ANY_TAG, so it is reasonable for the MPI_RECV to continue blocking, since we could receive a message from another process in the communicator.

Since fault tolerance is not in the MPI standard, when a process exits unexpectedly the state of the MPI library is undefined by the standard. Some MPI implementations will not allow you to call back into them, others will allow you to with very limited functionality (you may be able to only call MPI_FINALIZE), and others will allow you to use it with no limitations.

There are implementations of MPI that allow for various degrees of process fault tolerance (many of them are active contributors to the Open MPI project). For instance, the FT-MPI style of fault tolerance (http://icl.cs.utk.edu/ftmpi/) allows an MPI program to continue execution even if one process in the communicator fails. We are working on integrating this style (and a few other styles) of fault tolerance into Open MPI.

There is another model of fault tolerance in which you would use MPI_COMM_SPAWN to dynamically create communication groups and use those communicators for a form of process fault tolerance. William Gropp and Ewing Lusk wrote a good description of this in their 2004 paper "Fault Tolerance in Message Passing Interface Programs" (http:// dx.doi.org/10.1177/1094342004046045), if you are interested in pursuing this type of program.


So in short, MPI_Recv is behaving as it should in this situation since it could be waiting for other processes in the communication group to send a message. If you need to support program continuation even in the face of single process failures take a look at the dynamic process manager-worker model or you might explore FT-MPI's API for dealing with process loss in a communication group.

I hope this helps, good luck!

Josh


On Feb 16, 2006, at 10:11 AM, Abhishek Agarwal wrote:

Hello All,

I am trying to use the MPI_Recv of the open-mpi, but met some problems with
MPI_Recv.

I have two processes in master slave mode and I killed the slave process but my MPI_Recv process is still waiting to get a response from slave and never times out with any error. I am checking the MPI_SUCCESS but it seems to wait
for ever and hence the program hangs.

I am attaching the secition of code which i have used in my program.


--------------------------
val = MPI_Recv(&ans, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG,
                                       newcomm[i], &stat[i]);
      if (val != MPI_SUCCESS )
        printf("Manager: error in Recv\n");

--------------------------

Any advice?

Thanks,

Abhishek Agarwal


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

----
Josh Hursey
jjhur...@open-mpi.org
http://www.open-mpi.org/

Reply via email to