Re: [OMPI users] Restart after code hangs

2016-06-16 Thread Gus Correa
Hi Alex You know all this, but just in case ... Restartable code goes like this: * .start read the initial/previous configuration from a file ... final_step = first_step + nsteps time_step = first_step while ( time_step .le. final_step ) ... march in time ...

Re: [OMPI users] Restart after code hangs

2016-06-16 Thread Ralph Castain
Which version of OMPI are you using? > On Jun 16, 2016, at 2:25 PM, Alex Kaiser wrote: > > Hello, > > I have an MPI code which sometimes hangs, simply stops running. It is not > clear why and it uses many large third party libraries so I do not want to > try to fix it.

[OMPI users] Restart after code hangs

2016-06-16 Thread Alex Kaiser
Hello, I have an MPI code which sometimes hangs, simply stops running. It is not clear why and it uses many large third party libraries so I do not want to try to fix it. The code is easy to restart, but then it needs to be monitored closely by me, and I'd prefer to do it automatically. Is there

[OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?

2016-06-16 Thread Audet, Martin
Hi, After reading a little the FAQ on the methods used by Open MPI to deal with memory registration (or pinning) with Infiniband adapter, it seems that we could avoid all the overhead and complexity of memory registration/deregistration, registration cache access and update, memory management

[OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-16 Thread Sasso, John (GE Power, Non-GE)
Thank-you Nathan. Since the default btl_openib_receive_queues setting is: P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 this would mean that, with max_qp = 392632 and 4 QPs above, the "actual" max would be 392632 / 4 = 98158. Using this value in my prior

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-16 Thread Nathan Hjelm
XRC support is greatly improved in 1.10.x and 2.0.0. Would be interesting to see if a newer version fixed the shutdown hang. When calculating the required number of queue pairs you also have to divide by the number of queue pairs in the btl_openib_receive_queues parameter. Additionally Open

[OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-16 Thread Sasso, John (GE Power, Non-GE)
Nathan, Thank you for the suggestion. I tried your btl_openib_receive_queues setting with a 4200+ core IMB job, and the job ran (great!). The shutdown of the job took such a long time that after 6 minutes, I had to force-terminate the job. When I tried using this scheme before, with the