You know all this, but just in case ...
Restartable code goes like this:
read the initial/previous configuration from a file
final_step = first_step + nsteps
time_step = first_step
while ( time_step .le. final_step )
... march in time ...
Which version of OMPI are you using?
> On Jun 16, 2016, at 2:25 PM, Alex Kaiser wrote:
> I have an MPI code which sometimes hangs, simply stops running. It is not
> clear why and it uses many large third party libraries so I do not want to
> try to fix it.
I have an MPI code which sometimes hangs, simply stops running. It is not
clear why and it uses many large third party libraries so I do not want to
try to fix it. The code is easy to restart, but then it needs to be
monitored closely by me, and I'd prefer to do it automatically.
After reading a little the FAQ on the methods used by Open MPI to deal with
memory registration (or pinning) with Infiniband adapter, it seems that we
could avoid all the overhead and complexity of memory
registration/deregistration, registration cache access and update, memory
Thank-you Nathan. Since the default btl_openib_receive_queues setting is:
this would mean that, with max_qp = 392632 and 4 QPs above, the "actual" max
would be 392632 / 4 = 98158. Using this value in my prior
XRC support is greatly improved in 1.10.x and 2.0.0. Would be interesting to
see if a newer version fixed the shutdown hang.
When calculating the required number of queue pairs you also have to divide by
the number of queue pairs in the btl_openib_receive_queues parameter.
Thank you for the suggestion. I tried your btl_openib_receive_queues setting
with a 4200+ core IMB job, and the job ran (great!). The shutdown of the job
took such a long time that after 6 minutes, I had to force-terminate the job.
When I tried using this scheme before, with the