How about sending a 'ping' to a socket periodically which is monitored
by an auxiliary program that runs where the master process runs?
Also, I know you don't want to delve into the third-party libs but have
you actually tried to get to the bottom of the hang, e.g. run an strace,
attach a debugger or if you have the intel tools available you could run
the MPI profiling tool or similar? Maybe it's something more fundamental?!
Good luck,
Cihan
On 18/06/16 01:58, Alex Kaiser wrote:
An outside monitor should work. My outline of the monitor script (with
advice from the sys admin) has opportunities for bugs with environment
variables and such.
I wanted to make sure there was not a simpler solution, or one that is
less error prone. Modifying the main routine which calls the library or
external scripts is no problem, I only meant that I did not want to
debug the library internals, which are huge and complicated!
Appreciate the advice. Thank you,
Alex
On Friday, June 17, 2016, Ralph Castain <r...@open-mpi.org
<mailto:r...@open-mpi.org>> wrote:
Sadly, no - there was some possibility of using a file monitor we
had for awhile, but that isn’t in the 1.6 series. So I fear your
best bet is to periodically output some kind of marker, and have a
separate process that monitors to see if it is being updated. Either
way would require modifying code and that seems to be outside the
desired scope of the solution.
Afraid I don’t know how to accomplish what you seek without code
modification.
On Jun 16, 2016, at 10:16 PM, Alex Kaiser <adkai...@gmail.com> wrote:
Dear Dr. Castain,
I'm using 1.6.5, which is pre-built on NYU's cluster. Is there any
other info which would be helpful? Partial output follows.
Thanks,
Alex
-bash-4.1$ ompi_info
Package: Open MPI l...@soho.es.its.nyu.edu Distribution
Open MPI: 1.6.5
...
C compiler family name: GNU
C compiler version: 4.8.2
On Thu, Jun 16, 2016 at 8:44 PM, Gus Correa
<g...@ldeo.columbia.edu> wrote:
Hi Alex
You know all this, but just in case ...
Restartable code goes like this:
*****************************
.start
read the initial/previous configuration from a file
...
final_step = first_step + nsteps
time_step = first_step
while ( time_step .le. final_step )
... march in time ...
time_step = time_step +1
end
save the final_step configuration (or phase space) to a file
[depending on the algorithm you may need to save the
previous config also, or perhaps a few more]
.end
************************************************
Then restart the job time and again, until the desired
number of time steps is completed.
Job queue systems/resource managers allow a job to resubmit
itself,
so unless a job fails it feels like a single time integration.
If a job fails in the middle, you don't lose all work,
just restart from the previous saved configuration.
That is the only situation where you need to "monitor" the code.
Resource managers/ queue systems can also email you in
case the job fails, warning you to do manual intervention.
The time granularity per job (nsteps) is up to you.
Normally it is adjusted to the max walltime of job queues
(in a shared computer/cluster),
but in your case it can be adjusted to how often the program
fails.
All atmosphere/ocean/climate/weather_forecast models work
this way (that's what we mostly run here).
I guess most CFD, computational Chemistry, etc, programs also do.
I hope this helps,
Gus Correa
On 06/16/2016 05:25 PM, Alex Kaiser wrote:
Hello,
I have an MPI code which sometimes hangs, simply stops
running. It is
not clear why and it uses many large third party libraries
so I do not
want to try to fix it. The code is easy to restart, but
then it needs to
be monitored closely by me, and I'd prefer to do it
automatically.
Is there a way, within an MPI process, to detect the hang
and abort if so?
In psuedocode, I would like to do something like
for (all time steps)
step
if (nothing has happened for x minutes)
call mpi abort to return control to the shell
endif
endfor
This structure does not work however, as it can no longer
do anything,
including check itself, when it is stuck.
Thank you,
Alex
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:
https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/06/29471.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/06/29473.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/06/29474.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/06/29481.php