Hi all, I have a question about set timeout limit for MPI data transmissions. Our users run their parallel jobs (with openmpi) on our HPC cluster. Sometimes the job may hang due to unknown reason. In such case a job is still in "RUN" status, all processes of this job are running. But not output is produced (in normal a job writes to the output files periodically). We suspect that is may be caused by the broken MPI communication pipe, which stalls the entire job.
For example, all processes exchange data at the end of computations, and synchronize by using MPI_waitall function. If one of the communication links (e.g. Ethernet, Infiniband) fails and the system is not able to detect it, then all processes are staying with MPI_waitall indefinitely. The whole job still looks "running" but it doesn't do anything useful. My question is: is it possible to set up "timeout" flag in MPI functions so that if the time spent by a function (e.g. MPI_waitall) exceeds the preset timeout limit then the function is aborted and the whole job is terminated? In our environment, we use OpenMPI v1.4.5 and v1.6.5 on Linux platform. The job scheduling tool is LSF v8.4. Thanks for the help, Qi