Hello,

I have an MPI code which sometimes hangs, simply stops running. It is not
clear why and it uses many large third party libraries so I do not want to
try to fix it. The code is easy to restart, but then it needs to be
monitored closely by me, and I'd prefer to do it automatically.

Is there a way, within an MPI process, to detect the hang and abort if so?

In psuedocode, I would like to do something like

for (all time steps)
    step
    if (nothing has happened for x minutes)

        call mpi abort to return control to the shell

    endif

endfor

This structure does not work however, as it can no longer do anything,
including check itself, when it is stuck.


Thank you,
Alex

Reply via email to