Re: [OMPI users] BLCR & openmpi

2016-03-23 Thread Meij, Henk
mpirun 24176 ?00:00:00 sshd 24177 ?00:00:00 ps From: users [users-boun...@open-mpi.org] on behalf of George Bosilca [bosi...@icl.utk.edu] Sent: Wednesday, March 23, 2016 12:27 PM To: Open MPI Users Subject: Re: [OMPI users] BLCR & openmpi

Re: [OMPI users] BLCR & openmpi

2016-03-23 Thread George Bosilca
Both BLCR and Open MPI work just fine. Independently. Checkpointing and restarting a parallel application is not as simple as mixing 2 tools together (especially when we talk about a communication library, aka. MPI), they have to cooperate in order to achieve the desired goal of being able to

Re: [OMPI users] BLCR & openmpi

2016-03-23 Thread Ralph Castain
I don’t believe checkpoint/restart is supported in OMPI past the 1.6 series. There was some attempt to restore it, but that person graduated prior to fully completing the work. > On Mar 23, 2016, at 9:14 AM, Meij, Henk wrote: > > So I've redone this with openmpi 1.10.2

Re: [OMPI users] BLCR & openmpi

2016-03-23 Thread Meij, Henk
So I've redone this with openmpi 1.10.2 and another piece of software (lammps 16feb16) and get same results. Upon cr_restart I see the openlava_wrapper process, the mpirun process reappearing but no orted and no lmp_mpi processes. Not obvious error anywhere. Using the --save-all feature from

Re: [OMPI users] BLCR & openmpi

2016-03-21 Thread Meij, Henk
hmm, I'm not correct. cr_restart starts with no errors, launches some of the processes, then suspends itself. strace on mpirun on this manual invocation yields the behavior same as below. -Henk [hmeij@swallowtail kflaherty]$ ps -u hmeij PID TTY TIME CMD 29481 ?00:00:00

[OMPI users] BLCR & openmpi

2016-03-21 Thread Meij, Henk
openmpi1.2 (yes, I know old),python 2.6.1 blcr 0.8.5 when I attempt to cr_restart (having performed cr_checkpoint --save-all) I can restart the job manually with blcr on a node. but when I go through my openlava scheduler, the cr_restart launches mpirun, then nothing. no orted or the python