Abhishek,
What you are trying to do is not exactly supported by the MPI
standard. If a process in a MPI communicator is killed (by a node
failure, 'kill' command, segmentation fault, or other unexpected
failure) and you are blocking on a MPI call, you are not always
guaranteed to receive an error. So in the case you cite:
--------------------------
val = MPI_Recv(&ans, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG,
newcomm[i], &stat[i]);
if (val != MPI_SUCCESS )
printf("Manager: error in Recv\n");
--------------------------
You are using MPI_ANY_SOURCE and MPI_ANY_TAG, so it is reasonable for
the MPI_RECV to continue blocking, since we could receive a message
from another process in the communicator.
Since fault tolerance is not in the MPI standard, when a process
exits unexpectedly the state of the MPI library is undefined by the
standard. Some MPI implementations will not allow you to call back
into them, others will allow you to with very limited functionality
(you may be able to only call MPI_FINALIZE), and others will allow
you to use it with no limitations.
There are implementations of MPI that allow for various degrees of
process fault tolerance (many of them are active contributors to the
Open MPI project). For instance, the FT-MPI style of fault tolerance
(http://icl.cs.utk.edu/ftmpi/) allows an MPI program to continue
execution even if one process in the communicator fails. We are
working on integrating this style (and a few other styles) of fault
tolerance into Open MPI.
There is another model of fault tolerance in which you would use
MPI_COMM_SPAWN to dynamically create communication groups and use
those communicators for a form of process fault tolerance. William
Gropp and Ewing Lusk wrote a good description of this in their 2004
paper "Fault Tolerance in Message Passing Interface Programs" (http://
dx.doi.org/10.1177/1094342004046045), if you are interested in
pursuing this type of program.
So in short, MPI_Recv is behaving as it should in this situation
since it could be waiting for other processes in the communication
group to send a message. If you need to support program continuation
even in the face of single process failures take a look at the
dynamic process manager-worker model or you might explore FT-MPI's
API for dealing with process loss in a communication group.
I hope this helps, good luck!
Josh
On Feb 16, 2006, at 10:11 AM, Abhishek Agarwal wrote:
Hello All,
I am trying to use the MPI_Recv of the open-mpi, but met some
problems with
MPI_Recv.
I have two processes in master slave mode and I killed the slave
process but
my MPI_Recv process is still waiting to get a response from slave
and never
times out with any error. I am checking the MPI_SUCCESS but it
seems to wait
for ever and hence the program hangs.
I am attaching the secition of code which i have used in my program.
--------------------------
val = MPI_Recv(&ans, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG,
newcomm[i], &stat[i]);
if (val != MPI_SUCCESS )
printf("Manager: error in Recv\n");
--------------------------
Any advice?
Thanks,
Abhishek Agarwal
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
----
Josh Hursey
jjhur...@open-mpi.org
http://www.open-mpi.org/