Re: [OMPI users] Restart after code hangs

2016-06-18 Thread Moody, Adam T.
users] Restart after code hangs How about sending a 'ping' to a socket periodically which is monitored by an auxiliary program that runs where the master process runs? Also, I know you don't want to delve into the third-party libs but have you actually tried to get to the bottom of the hang, e.g

Re: [OMPI users] Restart after code hangs

2016-06-18 Thread Cihan Altinay
How about sending a 'ping' to a socket periodically which is monitored by an auxiliary program that runs where the master process runs? Also, I know you don't want to delve into the third-party libs but have you actually tried to get to the bottom of the hang, e.g. run an strace, attach a

Re: [OMPI users] Restart after code hangs

2016-06-17 Thread Alex Kaiser
An outside monitor should work. My outline of the monitor script (with advice from the sys admin) has opportunities for bugs with environment variables and such. I wanted to make sure there was not a simpler solution, or one that is less error prone. Modifying the main routine which calls the

Re: [OMPI users] Restart after code hangs

2016-06-17 Thread Ralph Castain
Sadly, no - there was some possibility of using a file monitor we had for awhile, but that isn’t in the 1.6 series. So I fear your best bet is to periodically output some kind of marker, and have a separate process that monitors to see if it is being updated. Either way would require modifying

Re: [OMPI users] Restart after code hangs

2016-06-17 Thread Alex Kaiser
Dear Dr. Correa, This is indeed the structure, it is a CFD program. Most of what you are suggesting is my current workflow, including saving, sending emails upon a crash and restarting. The problem is that the code does not crash but hangs. If it is deadlocked then it sits there spinning cycles

Re: [OMPI users] Restart after code hangs

2016-06-17 Thread Alex Kaiser
Dear Dr. Castain, I'm using 1.6.5, which is pre-built on NYU's cluster. Is there any other info which would be helpful? Partial output follows. Thanks, Alex -bash-4.1$ ompi_info Package: Open MPI l...@soho.es.its.nyu.edu Distribution Open MPI: 1.6.5 ... C compiler family name: GNU C compiler

Re: [OMPI users] Restart after code hangs

2016-06-16 Thread Gus Correa
Hi Alex You know all this, but just in case ... Restartable code goes like this: * .start read the initial/previous configuration from a file ... final_step = first_step + nsteps time_step = first_step while ( time_step .le. final_step ) ... march in time ...

Re: [OMPI users] Restart after code hangs

2016-06-16 Thread Ralph Castain
Which version of OMPI are you using? > On Jun 16, 2016, at 2:25 PM, Alex Kaiser wrote: > > Hello, > > I have an MPI code which sometimes hangs, simply stops running. It is not > clear why and it uses many large third party libraries so I do not want to > try to fix it.

[OMPI users] Restart after code hangs

2016-06-16 Thread Alex Kaiser
Hello, I have an MPI code which sometimes hangs, simply stops running. It is not clear why and it uses many large third party libraries so I do not want to try to fix it. The code is easy to restart, but then it needs to be monitored closely by me, and I'd prefer to do it automatically. Is there