Re: [OMPI users] Restart after code hangs

2016-06-18 Thread Moody, Adam T.
Hello Alex, At LLNL, we use io-watchdog for this kind of capability. https://github.com/grondo/io-watchdog It's a library that you LD_PRELOAD, and it itercepts write calls on a particular rank. Whenever rank 0 issues a write() call it updates a timer value also accessed by a thread. If the

Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?

2016-06-18 Thread Jeff Squyres (jsquyres)
Greetings Martin. Such approaches have been discussed in the past. Indeed, I'm pretty sure that I've heard of some non-commodity systems / network stacks that do this kind of thing. Such approaches have not evolved in the commodity Linux space, however. This kind of support would need

Re: [OMPI users] Restart after code hangs

2016-06-18 Thread Cihan Altinay
How about sending a 'ping' to a socket periodically which is monitored by an auxiliary program that runs where the master process runs? Also, I know you don't want to delve into the third-party libs but have you actually tried to get to the bottom of the hang, e.g. run an strace, attach a