Hello Alex,
At LLNL, we use io-watchdog for this kind of capability.
https://github.com/grondo/io-watchdog
It's a library that you LD_PRELOAD, and it itercepts write calls on a
particular rank. Whenever rank 0 issues a write() call it updates a timer
value also accessed by a thread. If the th
Greetings Martin.
Such approaches have been discussed in the past. Indeed, I'm pretty sure that
I've heard of some non-commodity systems / network stacks that do this kind of
thing.
Such approaches have not evolved in the commodity Linux space, however. This
kind of support would need better
How about sending a 'ping' to a socket periodically which is monitored
by an auxiliary program that runs where the master process runs?
Also, I know you don't want to delve into the third-party libs but have
you actually tried to get to the bottom of the hang, e.g. run an strace,
attach a debu