Re: [OMPI users] Restart after code hangs

2016-06-17 Thread Alex Kaiser
An outside monitor should work. My outline of the monitor script (with
advice from the sys admin) has opportunities for bugs with environment
variables and such.

I wanted to make sure there was not a simpler solution, or one that is less
error prone. Modifying the main routine which calls the library or external
scripts is no problem, I only meant that I did not want to debug the
library internals, which are huge and complicated!

Appreciate the advice. Thank you,
Alex

On Friday, June 17, 2016, Ralph Castain <r...@open-mpi.org> wrote:

> Sadly, no - there was some possibility of using a file monitor we had for
> awhile, but that isn’t in the 1.6 series. So I fear your best bet is to
> periodically output some kind of marker, and have a separate process that
> monitors to see if it is being updated. Either way would require modifying
> code and that seems to be outside the desired scope of the solution.
>
> Afraid I don’t know how to accomplish what you seek without code
> modification.
>
> On Jun 16, 2016, at 10:16 PM, Alex Kaiser <adkai...@gmail.com> wrote:
>
> Dear Dr. Castain,
>
> I'm using 1.6.5, which is pre-built on NYU's cluster. Is there any other
> info which would be helpful? Partial output follows.
>
> Thanks,
> Alex
>
> -bash-4.1$ ompi_info
>
> Package: Open MPI l...@soho.es.its.nyu.edu Distribution
> Open MPI: 1.6.5
> ...
> C compiler family name: GNU
> C compiler version: 4.8.2
>
>
> On Thu, Jun 16, 2016 at 8:44 PM, Gus Correa <g...@ldeo.columbia.edu> wrote:
>
>> Hi Alex
>>
>> You know all this, but just in case ...
>>
>> Restartable code goes like this:
>>
>> *
>> .start
>>
>> read the initial/previous configuration from a file
>> ...
>> final_step = first_step + nsteps
>> time_step = first_step
>> while ( time_step .le. final_step )
>>   ... march in time ...
>>   time_step = time_step +1
>> end
>>
>> save the final_step configuration (or phase space) to a file
>> [depending on the algorithm you may need to save the
>> previous config also, or perhaps a few more]
>>
>> .end
>> 
>>
>> Then restart the job time and again, until the desired
>> number of time steps is completed.
>>
>> Job queue systems/resource managers allow a job to resubmit itself,
>> so unless a job fails it feels like a single time integration.
>>
>> If a job fails in the middle, you don't lose all work,
>> just restart from the previous saved configuration.
>> That is the only situation where you need to "monitor" the code.
>> Resource managers/ queue systems can also email you in
>> case the job fails, warning you to do manual intervention.
>>
>> The time granularity per job (nsteps) is up to you.
>> Normally it is adjusted to the max walltime of job queues
>> (in a shared computer/cluster),
>> but in your case it can be adjusted to how often the program fails.
>>
>> All atmosphere/ocean/climate/weather_forecast models work
>> this way (that's what we mostly run here).
>> I guess most CFD, computational Chemistry, etc, programs also do.
>>
>> I hope this helps,
>> Gus Correa
>>
>>
>>
>> On 06/16/2016 05:25 PM, Alex Kaiser wrote:
>>
>>> Hello,
>>>
>>> I have an MPI code which sometimes hangs, simply stops running. It is
>>> not clear why and it uses many large third party libraries so I do not
>>> want to try to fix it. The code is easy to restart, but then it needs to
>>> be monitored closely by me, and I'd prefer to do it automatically.
>>>
>>> Is there a way, within an MPI process, to detect the hang and abort if
>>> so?
>>>
>>> In psuedocode, I would like to do something like
>>>
>>> for (all time steps)
>>>  step
>>>  if (nothing has happened for x minutes)
>>>
>>>  call mpi abort to return control to the shell
>>>
>>>  endif
>>>
>>> endfor
>>>
>>> This structure does not work however, as it can no longer do anything,
>>> including check itself, when it is stuck.
>>>
>>>
>>> Thank you,
>>> Alex
>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/06/29471.php
>>>
>>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/06/29473.php
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29474.php
>
>
>


Re: [OMPI users] Restart after code hangs

2016-06-17 Thread Alex Kaiser
Dear Dr. Correa,

This is indeed the structure, it is a CFD program. Most of what you are
suggesting is my current workflow, including saving, sending emails upon a
crash and restarting.

The problem is that the code does not crash but hangs. If it is deadlocked
then it sits there spinning cycles until I happen to check. Monitoring the
code like this has become inefficient -- sometimes an overnight run works
for half an hour and I don't notice until the morning. Also, to restart
from this requires sitting in queue again. I will try to better understand
job system's automatic resubmit, but for now I do not see how to use this
to fix the deadlock problem.

After thinking about your email perhaps I can phrase my question more
precisely -- How can I return control to the shell if the MPI process has
deadlocked?

Thank you,
Alex





On Thu, Jun 16, 2016 at 8:44 PM, Gus Correa <g...@ldeo.columbia.edu> wrote:

> Hi Alex
>
> You know all this, but just in case ...
>
> Restartable code goes like this:
>
> *
> .start
>
> read the initial/previous configuration from a file
> ...
> final_step = first_step + nsteps
> time_step = first_step
> while ( time_step .le. final_step )
>   ... march in time ...
>   time_step = time_step +1
> end
>
> save the final_step configuration (or phase space) to a file
> [depending on the algorithm you may need to save the
> previous config also, or perhaps a few more]
>
> .end
> 
>
> Then restart the job time and again, until the desired
> number of time steps is completed.
>
> Job queue systems/resource managers allow a job to resubmit itself,
> so unless a job fails it feels like a single time integration.
>
> If a job fails in the middle, you don't lose all work,
> just restart from the previous saved configuration.
> That is the only situation where you need to "monitor" the code.
> Resource managers/ queue systems can also email you in
> case the job fails, warning you to do manual intervention.
>
> The time granularity per job (nsteps) is up to you.
> Normally it is adjusted to the max walltime of job queues
> (in a shared computer/cluster),
> but in your case it can be adjusted to how often the program fails.
>
> All atmosphere/ocean/climate/weather_forecast models work
> this way (that's what we mostly run here).
> I guess most CFD, computational Chemistry, etc, programs also do.
>
> I hope this helps,
> Gus Correa
>
>
>
> On 06/16/2016 05:25 PM, Alex Kaiser wrote:
>
>> Hello,
>>
>> I have an MPI code which sometimes hangs, simply stops running. It is
>> not clear why and it uses many large third party libraries so I do not
>> want to try to fix it. The code is easy to restart, but then it needs to
>> be monitored closely by me, and I'd prefer to do it automatically.
>>
>> Is there a way, within an MPI process, to detect the hang and abort if so?
>>
>> In psuedocode, I would like to do something like
>>
>> for (all time steps)
>>  step
>>  if (nothing has happened for x minutes)
>>
>>  call mpi abort to return control to the shell
>>
>>  endif
>>
>> endfor
>>
>> This structure does not work however, as it can no longer do anything,
>> including check itself, when it is stuck.
>>
>>
>> Thank you,
>> Alex
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/06/29471.php
>>
>>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29473.php
>


Re: [OMPI users] Restart after code hangs

2016-06-17 Thread Alex Kaiser
Dear Dr. Castain,

I'm using 1.6.5, which is pre-built on NYU's cluster. Is there any other
info which would be helpful? Partial output follows.

Thanks,
Alex

-bash-4.1$ ompi_info

Package: Open MPI l...@soho.es.its.nyu.edu Distribution
Open MPI: 1.6.5
...
C compiler family name: GNU
C compiler version: 4.8.2


On Thu, Jun 16, 2016 at 8:44 PM, Gus Correa <g...@ldeo.columbia.edu> wrote:

> Hi Alex
>
> You know all this, but just in case ...
>
> Restartable code goes like this:
>
> *
> .start
>
> read the initial/previous configuration from a file
> ...
> final_step = first_step + nsteps
> time_step = first_step
> while ( time_step .le. final_step )
>   ... march in time ...
>   time_step = time_step +1
> end
>
> save the final_step configuration (or phase space) to a file
> [depending on the algorithm you may need to save the
> previous config also, or perhaps a few more]
>
> .end
> 
>
> Then restart the job time and again, until the desired
> number of time steps is completed.
>
> Job queue systems/resource managers allow a job to resubmit itself,
> so unless a job fails it feels like a single time integration.
>
> If a job fails in the middle, you don't lose all work,
> just restart from the previous saved configuration.
> That is the only situation where you need to "monitor" the code.
> Resource managers/ queue systems can also email you in
> case the job fails, warning you to do manual intervention.
>
> The time granularity per job (nsteps) is up to you.
> Normally it is adjusted to the max walltime of job queues
> (in a shared computer/cluster),
> but in your case it can be adjusted to how often the program fails.
>
> All atmosphere/ocean/climate/weather_forecast models work
> this way (that's what we mostly run here).
> I guess most CFD, computational Chemistry, etc, programs also do.
>
> I hope this helps,
> Gus Correa
>
>
>
> On 06/16/2016 05:25 PM, Alex Kaiser wrote:
>
>> Hello,
>>
>> I have an MPI code which sometimes hangs, simply stops running. It is
>> not clear why and it uses many large third party libraries so I do not
>> want to try to fix it. The code is easy to restart, but then it needs to
>> be monitored closely by me, and I'd prefer to do it automatically.
>>
>> Is there a way, within an MPI process, to detect the hang and abort if so?
>>
>> In psuedocode, I would like to do something like
>>
>> for (all time steps)
>>  step
>>  if (nothing has happened for x minutes)
>>
>>  call mpi abort to return control to the shell
>>
>>  endif
>>
>> endfor
>>
>> This structure does not work however, as it can no longer do anything,
>> including check itself, when it is stuck.
>>
>>
>> Thank you,
>> Alex
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/06/29471.php
>>
>>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29473.php
>


[OMPI users] Restart after code hangs

2016-06-16 Thread Alex Kaiser
Hello,

I have an MPI code which sometimes hangs, simply stops running. It is not
clear why and it uses many large third party libraries so I do not want to
try to fix it. The code is easy to restart, but then it needs to be
monitored closely by me, and I'd prefer to do it automatically.

Is there a way, within an MPI process, to detect the hang and abort if so?

In psuedocode, I would like to do something like

for (all time steps)
step
if (nothing has happened for x minutes)

call mpi abort to return control to the shell

endif

endfor

This structure does not work however, as it can no longer do anything,
including check itself, when it is stuck.


Thank you,
Alex