Hello,
I have an MPI code which sometimes hangs, simply stops running. It is not
clear why and it uses many large third party libraries so I do not want to
try to fix it. The code is easy to restart, but then it needs to be
monitored closely by me, and I'd prefer to do it automatically.
Is there a way, within an MPI process, to detect the hang and abort if so?
In psuedocode, I would like to do something like
for (all time steps)
step
if (nothing has happened for x minutes)
call mpi abort to return control to the shell
endif
endfor
This structure does not work however, as it can no longer do anything,
including check itself, when it is stuck.
Thank you,
Alex