Interesting. I'll try to take a look and see if I can reproduce today.
-- Josh
On Sep 14, 2009, at 4:54 PM, Jean Potsam wrote:
Hi Josh,
Thanks for the response. I am actually testing it on a
single node (though in the near future i will run it on a set of
nodes). Therefore, my application is running on the same machine as
mpirun.
When I run the application and triggers the checkpointing mechanism
from a seperate terminal, it checkpoints fine.
However, when I try to checkpoint it from within the main program as
show below, it hangs.
kind regards,
Jean
--- On Mon, 14/9/09, Josh Hursey <jjhur...@open-mpi.org> wrote:
From: Josh Hursey <jjhur...@open-mpi.org>
Subject: Re: [OMPI users] Application hangs when checkpointing
application (update)
To: "Open MPI Users" <us...@open-mpi.org>
Date: Monday, 14 September, 2009, 1:27 PM
Is your application running on the same machine as mpirun?
How did you configure Open MPI? Note that is program will not work
without the FT thread enabled, which would be one reason why it
would seem to hang (since it is waiting for the application to enter
the MPI library):
--enable-ft-thread --enable-mpi-threads
I do not think the message that you saw is related. Often
orte_checkpoint cannot figure out the jobid on first contact with
the HNP/mpirun process, so this is displayed as an INVALID handle.
-- Josh
On Sep 11, 2009, at 9:50 AM, Jean Potsam wrote:
>
> Hi Everyone,
> I noticed that it hangs just before displaying the
following while trying to checkpoint the application.
>
> ############################
> [sun06:15252] orte_checkpoint: notify_hnp: Requested a checkpoint
of jobid [INVALID]
> ###############################
>
> Can it be related to the above?
>
> Thanks
>
>
>
----------------------------------------------------------------------------------------------------------------------
> Hi Everyone,
> I wrote a small program with a function to
trigger the checkpointing mechanism as follows:
>
> ############################################
>
> #include <mpi.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <signal.h>
> void trigger_checkpoint();
> int main(int argc, char **argv)
> {
> int rank,size;
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Comm_size(MPI_COMM_WORLD, &size);
> printf("I am processor no %d of a total of %d procs \n", rank,
size);
> system("sleep 10");
> trigger_checkpoint();
> printf("I am processor no %d of a total of %d procs \n", rank,
size);
> system("sleep 10");
> printf("I am processor no %d of a total of %d procs \n", rank,
size);
> system("sleep 10");
> printf("bye \n");
> MPI_Finalize();
> return 0;
> }
>
> void trigger_checkpoint()
> {
> printf("hi\n");
> system("ompi-checkpoint -v `pidof mpirun` ");
> }
> #############################################
>
>
> The application works fine on my laptop with ubuntu as the OS.
However, when I tried running it on one of the machines at my uni,
with suse linux installed, the application hangs as soon as the ompi-
checkpoint is triggered. This is what I get:
>
>
>
> ##########################################################
> I am processor no 0 of a total of 1 procs
> hi
> I am processor no 0 of a total of 1 procs
> [sun06:15426] orte_checkpoint: Checkpointing...
> [sun06:15426] PID 15411
> [sun06:15426] Connected to Mpirun [[12727,0],0]
> [sun06:15426] orte_checkpoint: notify_hnp: Contact Head Node
Process PID 15411
> ###################################################
>
> does anyone has some ideas about this?
>
> Thanks a lot
>
> Jean.
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users