Re: [OMPI users] How to restart a job twice

Tamer Fri, 18 Apr 2008 13:12:02 -0400

Hi Josh:

I am running on linux fedora core 7 kernel: 2.6.23.15-80.fc7


The machine is dual-core with shared memory so it's not even a cluster.

I downloaded r18208 and built it with the following options:

./configure --prefix=/usr/local/openmpi-with-checkpointing-r18208 --with-ft=cr --with-blcr=/usr/local/blcr


when I run mpirun I pass the following command:

mpirun -np 2 -am ft-enable-cr ./ares-openmpi -c -f madonna-13760

I was able to checkpoint and restart successfully and was able tocheckpoint the restarted job (mpirun showed up with ps-efa |grepmpirun under r18208) but was unable to restart again; here's the errormessage:


mpi-restart ompi_global_snapshot_23865.ckpt

[dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)[dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:Connection to lifeline [[45670,0],0] lost[dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)[dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:Connection to lifeline [[45670,0],0] lost[dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)[dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:Connection to lifeline [[45670,0],0] lost[dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)[dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:Connection to lifeline [[45670,0],0] lost

--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 24012 on

node dhcp-119-202.caltech.edu exiting without calling "finalize". Thismay

have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

Thank you in advance for your help.

Tamer


On Apr 18, 2008, at 7:07 AM, Josh Hursey wrote:

This problem has come up in the past and may have been fixed since
r14519. Can you update to r18208 and see if the error still occurs?

A few other questions that will help me try to reproduce the problem.
Can you tell me more about the configuration of the system you are
running on (number of machines, if there is a resource manager)? How
did you configure Open MPI and what command line options are you
passing to 'mpirun'?

-- Josh

On Apr 18, 2008, at 9:36 AM, Tamer wrote:

Thanks Josh, I tried what you suggested with my existing r14519,and Iwas able to checkpoint the restarted job but was never able torestart

it. I looked up the PID for 'orterun' and checkpointed the restarted
job but when I try to restart from that point I get the following
error:

ompi-restart ompi_global_snapshot_7704.ckpt
[dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
Connection to lifeline [[61851,0],0] lost
--------------------------------------------------------------------------
orterun has exited due to process rank 1 with PID 7737 on

node dhcp-119-202.caltech.edu exiting without calling "finalize".This

may
have caused other processes in the application to be
terminated by signals sent by orterun (as reported here).

Do I have to run the copenmpi clean command after the firstcheckpoint

and before restarting the checkpointed job so I can checkpoint it
again or is there something I am missing in this version completely
and I would have to go to r18208? Thank you in advance for your help.

Tamer

On Apr 18, 2008, at 6:03 AM, Josh Hursey wrote:

When you use 'ompi-restart' to restart a job it fork/execs the
completely new job using the restarted processes for the ranks.
However instead of calling the 'mpirun' process ompi-restart
currently

calls 'orterun'. These two programs are exactly the same (mpirunis a

symbolic link to orterun). So if you look for the PID of 'orterun'
that can be used to checkpoint the process.

However it is confusing that Open MPI makes this switch. So I

committed in r18208 a fix for this that uses the 'mpirun' binaryname

instead of the 'orterun' binary name. This fits with the typical use
case of checkpoint/restart in Open MPI in which users expect to find

the 'mpirun' process on restart instead of the lesser known'orterun'

process.

Sorry for the confusion.

Josh

On Apr 18, 2008, at 1:14 AM, Tamer wrote:

Dear all, I installed the developer's version r14519 and was ableto

get it running. I successfully checkpointed a parallel job and

restarted it. My question is how can I checkpoint the restartedjob?

The problem is once the original job is terminated and restarted
later
on, the mpirun does not exist anymore (ps -efa|grep mpirun) and
hence
I do not know which PID I should use when I run the ompi-checkpoint
on
the restarted job. Any help would be greatly appreciated.

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] How to restart a job twice

Reply via email to