Thanks for responding.


#1 I am checkpointing the "wrapper" script (for the scheduler) which sets up 
the mpirun env, builds machinefile etc, then launches mpirun which launches 
orted which launches lmp_mpi ... this gave me an idea to check BLCR, it states

" The '--tree' flag to 'cr_checkpoint' requests a checkpoint of the process 
with the&3 given pid, and all its descendants (excluding those who's parent has 
exited and thus become children of the 'init' process). " This is the default 
blcr > 0.6.0. I explicitly added this to make sure. So everything should be 
checkpointed on down.



#2 & 3 will have to brood over that. maybe I can checkpoint my individual 
lmp_mpi processes directly....



Serial invocations and restarts work just fine. I'll go to the BLCR list, maybe 
they have an idea. As you can tell below, a manual invocation yields the same 
result as via scheduler, with no messages from --kmsg-warning, like everything 
is normal.   I'll report back if I get this to work.



-Henk



[hmeij@cottontail ~]$ ssh petaltail /share/apps/blcr/0.8.5/test/bin/cr_restart 
--kmsg-warning --no-restore-pid --no-restore-pgid --no-restore-sid --relocate 
/sanscratch/612=/sanscratch/619 /sanscratch/checkpoints/612/chk.21839 &


[hmeij@cottontail sharptail]$ ssh petaltail ps -u hmeij
  PID TTY          TIME CMD
24123 ?        00:00:00 sshd
24124 ?        00:00:00 cr_restart
24156 ?        00:00:00 lava.openmpi.wr
24157 ?        00:00:28 mpirun
24176 ?        00:00:00 sshd
24177 ?        00:00:00 ps

________________________________
From: users [users-boun...@open-mpi.org] on behalf of George Bosilca 
[bosi...@icl.utk.edu]
Sent: Wednesday, March 23, 2016 12:27 PM
To: Open MPI Users
Subject: Re: [OMPI users] BLCR & openmpi

Both BLCR and Open MPI work just fine. Independently.

Checkpointing and restarting a parallel application is not as simple as mixing 
2 tools together (especially when we talk about a communication library, aka. 
MPI), they have to cooperate in order to achieve the desired goal of being able 
to continue the execution on another set of resources. Open MPI had support for 
C/R but this feature has been lost.

1. It is not clear from your email what exactly you checkpoint. Are you 
checkpointing the mpirun process, or are you checkpointing all the MPI 
processes ?

2. What are you recovering? Assuming that you checkpoint your MPI processes 
(and not the mpirun), what you can try to do during the recovery is to spawn a 
new set of MPI processes (that will give you new orteds) and then let each one 
of these processes call the corresponding BLCR cr_restart.

3. This will not give you a working MPI environment, as the processes will know 
each other from the original execution, and will be unable to connect to each 
other to resume communications. You will have to dig a little more in the code 
in order to achieve what you want/need.

  George.


On Wed, Mar 23, 2016 at 12:14 PM, Meij, Henk 
<hm...@wesleyan.edu<mailto:hm...@wesleyan.edu>> wrote:

So I've redone this with openmpi 1.10.2 and another piece of software (lammps 
16feb16) and get same results.



Upon cr_restart I see the openlava_wrapper process, the mpirun process 
reappearing but no orted and no lmp_mpi processes. Not obvious error anywhere. 
Using the --save-all feature from BLCR and ignore pids.



Does BLCR and openmpi work? Anybody have any idea as to where to look?



-Henk



________________________________
From: Meij, Henk
Sent: Monday, March 21, 2016 12:24 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: RE: BLCR & openmpi


hmm, I'm not correct. cr_restart starts with no errors, launches some of the 
processes, then suspends itself. strace on mpirun on this manual invocation 
yields the behavior same as below.



-Henk



[hmeij@swallowtail kflaherty]$ ps -u hmeij
  PID TTY          TIME CMD
29481 ?        00:00:00 res
29485 ?        00:00:00 1458575067.384
29488 ?        00:00:00 1458575067.384.
29508 ?        00:00:00 cr_restart
29509 ?        00:00:00 blcr_watcher
29512 ?        00:00:02 lava.openmpi.wr
29514 ?        00:38:35 mpirun
30313 ?        00:00:01 sshd
30314 pts/1    00:00:00 bash
30458 ?        00:00:00 sleep
30483 ?        00:00:00 sleep
30650 pts/1    00:00:00 cr_restart
30652 pts/1    00:00:00 lava.openmpi.wr
30653 pts/1    00:00:00 mpirun
30729 pts/1    00:00:00 ps
[hmeij@swallowtail kflaherty]$ jobs
[1]+  Stopped                 cr_restart --no-restore-pid --no-restore-pgid 
--no-restore-sid --relocate /sanscratch/383=/sanscratch/000 
/sanscratch/checkpoints/383/chk.28244

________________________________
From: Meij, Henk
Sent: Monday, March 21, 2016 12:04 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: BLCR & openmpi


openmpi1.2 (yes, I know old),python 2.6.1 blcr 0.8.5



when I attempt to cr_restart (having performed cr_checkpoint --save-all) I can 
restart the job manually with blcr on a node. but when I go through my openlava 
scheduler, the cr_restart launches mpirun, then nothing. no orted or the python 
processes that were running. the new scheduler job performing the restart puts 
in place the old machinefile and stderr and stdout files. here is what I view 
on an strace of mpirun



What problem is this pointing at?

Thanks,



-Henk



poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=6, events=POLLIN}, 
{fd=11, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, 
events=POLLIN}, {fd=10, events=POLLIN}], 8, 1000) = 8 ([{fd=5, 
revents=POLLNVAL}, {fd=4, revents=POLLNVAL}, {fd=6, revents=POLLNVAL}, {fd=11, 
revents=POLLNVAL}, {fd=7, revents=POLLNVAL}, {fd=8, revents=POLLNVAL}, {fd=9, 
revents=POLLNVAL}, {fd=10, revents=POLLNVAL}])
rt_sigprocmask(SIG_BLOCK, [INT USR1 USR2 TERM CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
rt_sigaction(SIGTERM, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
rt_sigaction(SIGUSR1, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
rt_sigaction(SIGUSR2, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
sched_yield()                           = 0
rt_sigprocmask(SIG_BLOCK, [INT USR1 USR2 TERM CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
rt_sigaction(SIGTERM, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
rt_sigaction(SIGUSR1, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
rt_sigaction(SIGUSR2, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0





_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28806.php

Reply via email to