Dear Open MPI Developers,
We've been working on using Torque's checkpoint/restart support, along with BLCR
and Open MPI's C/R support, to perform C/R on parallel jobs running under
Torque. The main issue here is that Open MPI requires the use of
ompi-checkpoint and ompi-restart commands to check
Ralph,
Looking good so far. I did notice that ompi-ps always seems to have an exit
code of 243. Is that on purpose?
Greg
On Jul 25, 2011, at 4:44 PM, Ralph Castain wrote:
> r24944 - let me know how it works!
>
>
> On Jul 25, 2011, at 1:01 PM, Greg Watson wrote:
>
>> That would probably be m
Hmmmno, can't imagine why. I'll fix - thanks!
On Jul 27, 2011, at 3:14 PM, Greg Watson wrote:
> Ralph,
>
> Looking good so far. I did notice that ompi-ps always seems to have an exit
> code of 243. Is that on purpose?
>
> Greg
>
> On Jul 25, 2011, at 4:44 PM, Ralph Castain wrote:
>
>> r2
Hmmm...I'm not seeing that behavior. I get a 0 exit code every time.
You'll get a 243 if there are stale session directories laying around as it
indicates that the mpirun's in those dirs are not reachable. Perhaps that is
what's happening?
On Jul 27, 2011, at 3:14 PM, Greg Watson wrote:
> Ral