Dear Dr. Castain,

I'm using 1.6.5, which is pre-built on NYU's cluster. Is there any other
info which would be helpful? Partial output follows.

Thanks,
Alex

-bash-4.1$ ompi_info

Package: Open MPI l...@soho.es.its.nyu.edu Distribution
Open MPI: 1.6.5
...
C compiler family name: GNU
C compiler version: 4.8.2


On Thu, Jun 16, 2016 at 8:44 PM, Gus Correa <g...@ldeo.columbia.edu> wrote:

> Hi Alex
>
> You know all this, but just in case ...
>
> Restartable code goes like this:
>
> *****************************
> .start
>
> read the initial/previous configuration from a file
> ...
> final_step = first_step + nsteps
> time_step = first_step
> while ( time_step .le. final_step )
>   ... march in time ...
>   time_step = time_step +1
> end
>
> save the final_step configuration (or phase space) to a file
> [depending on the algorithm you may need to save the
> previous config also, or perhaps a few more]
>
> .end
> ************************************************
>
> Then restart the job time and again, until the desired
> number of time steps is completed.
>
> Job queue systems/resource managers allow a job to resubmit itself,
> so unless a job fails it feels like a single time integration.
>
> If a job fails in the middle, you don't lose all work,
> just restart from the previous saved configuration.
> That is the only situation where you need to "monitor" the code.
> Resource managers/ queue systems can also email you in
> case the job fails, warning you to do manual intervention.
>
> The time granularity per job (nsteps) is up to you.
> Normally it is adjusted to the max walltime of job queues
> (in a shared computer/cluster),
> but in your case it can be adjusted to how often the program fails.
>
> All atmosphere/ocean/climate/weather_forecast models work
> this way (that's what we mostly run here).
> I guess most CFD, computational Chemistry, etc, programs also do.
>
> I hope this helps,
> Gus Correa
>
>
>
> On 06/16/2016 05:25 PM, Alex Kaiser wrote:
>
>> Hello,
>>
>> I have an MPI code which sometimes hangs, simply stops running. It is
>> not clear why and it uses many large third party libraries so I do not
>> want to try to fix it. The code is easy to restart, but then it needs to
>> be monitored closely by me, and I'd prefer to do it automatically.
>>
>> Is there a way, within an MPI process, to detect the hang and abort if so?
>>
>> In psuedocode, I would like to do something like
>>
>>     for (all time steps)
>>          step
>>          if (nothing has happened for x minutes)
>>
>>              call mpi abort to return control to the shell
>>
>>          endif
>>
>>     endfor
>>
>> This structure does not work however, as it can no longer do anything,
>> including check itself, when it is stuck.
>>
>>
>> Thank you,
>> Alex
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/06/29471.php
>>
>>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29473.php
>

Reply via email to