Dear Open-MPI user list members,
I am currently having a user with an application where one of the
MPI-processes die, but the openmpi-system does not kill the rest of the
application.
Since the mpirun man page states the following I would expect it to take
care of killing the application if a process exits without calling
MPI_Finalize:
Process Termination / Signal Handling
During the run of an MPI application, if any rank dies abnormally
(either exiting before invoking MPI_FINALIZE, or dying as the
result of a signal), mpirun will print out an error message and
kill the rest of the MPI application.
The following test program demonstrates the behaviour (program hangs until
it is killed by the user or batch system):
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>
#define RANK_DEATH 1
int main(int argc, char **argv)
{
int rank;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
sleep(10);
if (rank==RANK_DEATH)
exit(1);
sleep(10);
MPI_Finalize();
return 0;
}
I have tested this on openmpi 1.2.1 as well as the latest stable 1.2.3. I
am on Linux x86_64.
Is this a bug, or are there some flags I can use to force the mpirun (or
orted, or...) to kill the whole MPI program when this happens?
If one of the application processes die from a signal (I have tested SEGV
and FPE) rather than just exiting the whole application is indeed killed.
Best regards
Daniel Spångberg