Hello, I have an MPI code which sometimes hangs, simply stops running. It is not clear why and it uses many large third party libraries so I do not want to try to fix it. The code is easy to restart, but then it needs to be monitored closely by me, and I'd prefer to do it automatically.
Is there a way, within an MPI process, to detect the hang and abort if so? In psuedocode, I would like to do something like for (all time steps) step if (nothing has happened for x minutes) call mpi abort to return control to the shell endif endfor This structure does not work however, as it can no longer do anything, including check itself, when it is stuck. Thank you, Alex