On Mon, 2020-04-27 at 11:48 +0100, Jérémie Wenger via users wrote:
> Hi,
>
> I recently installed open mpi (4.0.3) using the procedure described
> here, as I'm trying to use Horovod for multiple gpu acceleration.
>
> I am looking for a way to handle a keyboard interrupt (save a deep
> learning model before shutting everything down). I posted a question
> here.
>
I have used SIGUSR1 and write a signal handler in the rank 0
program to do whatever is needed to save data and shutdown cleanly
(using standard MPI messages on an alternative communication
channel that is initialized for just this purpose. Other ranks
test for messages on this channel at suitable points where they
can stop gracefully.).
Then you need to use kill to send the signal instead of CTRL/C.
But I have a note in my code that I never implemented, that in
case running on a remote server, some sort of socket protocol is
needed to initiate the shutdown instead of a signal.
George Reeke