Restartable code goes like this:


read the initial/previous configuration from a file
final_step = first_step + nsteps
time_step = first_step
while ( time_step .le. final_step )
  ... march in time ...
  time_step = time_step +1

save the final_step configuration (or phase space) to a file
[depending on the algorithm you may need to save the
previous config also, or perhaps a few more]


Then restart the job time and again, until the desired
number of time steps is completed.

Job queue systems/resource managers allow a job to resubmit itself,
so unless a job fails it feels like a single time integration.

If a job fails in the middle, you don't lose all work,
just restart from the previous saved configuration.
That is the only situation where you need to "monitor" the code.
Resource managers/ queue systems can also email you in
case the job fails, warning you to do manual intervention.

The time granularity per job (nsteps) is up to you.
Normally it is adjusted to the max walltime of job queues
(in a shared computer/cluster),
but in your case it can be adjusted to how often the program fails.

All atmosphere/ocean/climate/weather_forecast models work
this way (that's what we mostly run here).
I guess most CFD, computational Chemistry, etc, programs also do.

On 06/16/2016 05:25 PM, Alex Kaiser wrote:

I have an MPI code which sometimes hangs, simply stops running. It is
not clear why and it uses many large third party libraries so I do not
want to try to fix it. The code is easy to restart, but then it needs to
be monitored closely by me, and I'd prefer to do it automatically.

Is there a way, within an MPI process, to detect the hang and abort if so?

In psuedocode, I would like to do something like

    for (all time steps)
         if (nothing has happened for x minutes)

             call mpi abort to return control to the shell



This structure does not work however, as it can no longer do anything,
including check itself, when it is stuck.

