I am having the same problem when I want to checkpoint manually: "HNP with PID xxxx Not found!", though I am sure I put the right PID
--- On Mon, 11/2/09, Sergio Díaz <sd...@cesga.es> wrote: From: Sergio Díaz <sd...@cesga.es> Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62 To: "Open MPI Users" <us...@open-mpi.org> List-Post: users@lists.open-mpi.org Date: Monday, November 2, 2009, 6:43 PM Hi again, I found a C program to test ompi-checkpoint/restart an it works fine. The program was written by Alan Woodland and shared in the following distribution list: debian-bugs-d...@lists.debian.org This program starts a countdown from 10 to 0 and when the countdown is 6, do a checkpoint, kill the process and restart the process. However, I still have the problem when I try to do (by hand) checkpointing directly into a node Any ideas? :-( Best regards Sergio Sergio Díaz escribió: > Hello, > > I have achieved the checkpoint of an easy program without SGE. Now, I'm > trying to do the integration openmpi+sge but I have some problems... When I > try to do checkpoint of the mpirun PID, I got an error similar to the error > gotten when the PID doesn't exit. The example below. > > Any ideas? > Somebody have a script to do it automatic with SGE?. For example I have one > to do checkpoint each X seconds with BLCR and non-mpi jobs. It is launched by > SGE if you have configured the queue and the ckpt environment. > > Is it possible choose the name of the ckpt folder when you do the > ompi-checkpoint? I can't find the option to do it. > > > Regards, > Sergio > > > -------------------------------- > > [sdiaz@compute-3-17 ~]$ ps auxf > .... > root 20044 0.0 0.0 4468 1224 ? S 13:28 0:00 \_ > sge_shepherd-2645150 -bg > sdiaz 20072 0.0 0.0 53172 1212 ? Ss 13:28 0:00 \_ -bash > /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150 > sdiaz 20112 0.2 0.0 41028 2480 ? S 13:28 0:00 \_ > mpirun -np 2 -am ft-enable-cr pi3 > sdiaz 20113 0.0 0.0 36484 1824 ? Sl 13:28 0:00 > \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit -nostdin -V > compute-3-18.......... > sdiaz 20116 1.2 0.0 99464 4616 ? Sl 13:28 0:00 > \_ pi3 > > > [sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112 > [compute-3-17.local:20124] HNP with PID 20112 Not found! > > [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112 > [compute-3-17.local:20135] HNP with PID 20112 Not found! > > [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112 > [compute-3-17.local:20136] HNP with PID 20112 Not found! > > [sdiaz@compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112 > -------------------------------------------------------------------------- > ompi-checkpoint PID_OF_MPIRUN > Open MPI Checkpoint Tool > > -am <arg0> Aggregate MCA parameter set file list > -gmca|--gmca <arg0> <arg1> > Pass global MCA parameters that are applicable to > all contexts (arg0 is the parameter name; arg1 is > the parameter value) > -h|--help This help message > --hnp-jobid <arg0> This should be the jobid of the HNP whose > applications you wish to checkpoint. > --hnp-pid <arg0> This should be the pid of the mpirun whose > applications you wish to checkpoint. > -mca|--mca <arg0> <arg1> > Pass context-specific MCA parameters; they are > considered global if --gmca is not used and only > one context is specified (arg0 is the parameter > name; arg1 is the parameter value) > -s|--status Display status messages describing the progression > of the checkpoint > --term Terminate the application after checkpoint > -v|--verbose Be Verbose > -w|--nowait Do not wait for the application to finish > checkpointing before returning > > -------------------------------------------------------------------------- > [sdiaz@compute-3-17 ~]$ exit > logout > Connection to c3-17 closed. > [sdiaz@svgd mpi_test]$ ssh c3-18 > Last login: Wed Oct 28 13:24:12 2009 from svgd.local > -bash-3.00$ ps auxf |grep sdiaz > > sdiaz 14412 0.0 0.0 1888 560 ? Ss 13:28 0:00 \_ > /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter > /opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18 > sdiaz 14419 0.0 0.0 35728 2260 ? S 13:28 0:00 \_ > orted -mca ess env -mca orte_ess_jobid 2295267328 -mca orte_ess_vpid 1 -mca > orte_ess_num_procs 2 --hnp-uri 2295267328.0;tcp://192.168.4.144:36596 -mca > mca_base_param_file_prefix ft-enable-cr -mca mca_base_param_file_path > /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test > -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test > sdiaz 14420 0.0 0.0 99452 4596 ? Sl 13:28 0:00 > \_ pi3 > > > > > > -- Sergio Díaz Montes > Centro de Supercomputacion de Galicia > Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain) > Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16 > email: sd...@cesga.es ; http://www.cesga.es/ > > ------------------------------------------------ > ------------------------------------------------------------------------ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Sergio Díaz Montes Centro de Supercomputacion de Galicia Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain) Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16 email: sd...@cesga.es ; http://www.cesga.es/ ------------------------------------------------ _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users