Hi all,

I have a question about set timeout limit for MPI data transmissions.  Our 
users run their parallel jobs (with openmpi) on our HPC cluster. Sometimes the 
job may hang due to unknown reason. In such case a job is still in "RUN" 
status, all processes of this job are running. But not output is produced (in 
normal a job writes to the output files periodically). We suspect that is may 
be caused by the broken MPI communication pipe, which stalls the entire job.

For example, all processes exchange data at the end of  computations, and 
synchronize by using MPI_waitall function. If  one of the communication links 
(e.g. Ethernet, Infiniband) fails and the system is not able to detect it, then 
all processes are staying with MPI_waitall indefinitely. The whole job still 
looks "running" but it doesn't do anything useful.

My question is: is it possible to set up "timeout" flag in MPI functions so 
that if the time spent by a function (e.g. MPI_waitall) exceeds the preset 
timeout limit then the function is aborted and the whole job is terminated?

In our environment, we use OpenMPI v1.4.5 and v1.6.5 on Linux platform. The job 
scheduling tool is LSF v8.4.

Thanks for the help,

Qi

Reply via email to