I am having the same problem when I want to checkpoint manually: "HNP with PID 
xxxx Not found!", though I am sure I put the right PID 

--- On Mon, 11/2/09, Sergio Díaz <sd...@cesga.es> wrote:

From: Sergio Díaz <sd...@cesga.es>
Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62
To: "Open MPI Users" <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date: Monday, November 2, 2009, 6:43 PM

Hi again,

I found a C program to test ompi-checkpoint/restart an it works fine. The 
program was written by Alan Woodland and shared in the following distribution 
list: debian-bugs-d...@lists.debian.org
This program starts a countdown from 10 to 0 and when the countdown is 6, do a 
checkpoint, kill the process and restart the process.

However, I still have the problem when I try to do (by hand) checkpointing 
directly into a node

Any ideas? :-(

Best regards
Sergio



Sergio Díaz escribió:
> Hello,
> 
> I have achieved the checkpoint of an easy program without SGE. Now, I'm 
> trying to do the integration openmpi+sge but I have some problems... When I 
> try to do checkpoint of the mpirun PID, I got an error similar to the error 
> gotten when the PID doesn't exit. The example below.
> 
> Any ideas?
> Somebody have a script to do it automatic with SGE?. For example I have one 
> to do checkpoint each X seconds with BLCR and non-mpi jobs. It is launched by 
> SGE if you have configured the queue and the ckpt environment.
> 
> Is it possible choose the name of the ckpt folder when you do the 
> ompi-checkpoint? I can't find the option to do it.
> 
> 
> Regards,
> Sergio
> 
> 
> --------------------------------
> 
> [sdiaz@compute-3-17 ~]$ ps auxf
> ....
> root     20044  0.0  0.0  4468 1224 ?        S    13:28   0:00  \_ 
> sge_shepherd-2645150 -bg
> sdiaz    20072  0.0  0.0 53172 1212 ?        Ss   13:28   0:00      \_ -bash 
> /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150
> sdiaz    20112  0.2  0.0 41028 2480 ?        S    13:28   0:00          \_ 
> mpirun -np 2 -am ft-enable-cr pi3
> sdiaz    20113  0.0  0.0 36484 1824 ?        Sl   13:28   0:00              
> \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit -nostdin -V 
> compute-3-18..........
> sdiaz    20116  1.2  0.0 99464 4616 ?        Sl   13:28   0:00              
> \_ pi3
> 
> 
> [sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112
> [compute-3-17.local:20124] HNP with PID 20112 Not found!
> 
> [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112
> [compute-3-17.local:20135] HNP with PID 20112 Not found!
> 
> [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112
> [compute-3-17.local:20136] HNP with PID 20112 Not found!
> 
> [sdiaz@compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
> --------------------------------------------------------------------------
> ompi-checkpoint PID_OF_MPIRUN
>   Open MPI Checkpoint Tool
> 
>    -am <arg0>            Aggregate MCA parameter set file list
>    -gmca|--gmca <arg0> <arg1>
>                          Pass global MCA parameters that are applicable to
>                          all contexts (arg0 is the parameter name; arg1 is
>                          the parameter value)
> -h|--help                This help message
>    --hnp-jobid <arg0>    This should be the jobid of the HNP whose
>                          applications you wish to checkpoint.
>    --hnp-pid <arg0>      This should be the pid of the mpirun whose
>                          applications you wish to checkpoint.
>    -mca|--mca <arg0> <arg1>
>                          Pass context-specific MCA parameters; they are
>                          considered global if --gmca is not used and only
>                          one context is specified (arg0 is the parameter
>                          name; arg1 is the parameter value)
> -s|--status              Display status messages describing the progression
>                          of the checkpoint
>    --term                Terminate the application after checkpoint
> -v|--verbose             Be Verbose
> -w|--nowait              Do not wait for the application to finish
>                          checkpointing before returning
> 
> --------------------------------------------------------------------------
> [sdiaz@compute-3-17 ~]$ exit
> logout
> Connection to c3-17 closed.
> [sdiaz@svgd mpi_test]$ ssh c3-18
> Last login: Wed Oct 28 13:24:12 2009 from svgd.local
> -bash-3.00$ ps auxf |grep sdiaz
> 
> sdiaz    14412  0.0  0.0  1888  560 ?        Ss   13:28   0:00      \_ 
> /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter 
> /opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18
> sdiaz    14419  0.0  0.0 35728 2260 ?        S    13:28   0:00          \_ 
> orted -mca ess env -mca orte_ess_jobid 2295267328 -mca orte_ess_vpid 1 -mca 
> orte_ess_num_procs 2 --hnp-uri 2295267328.0;tcp://192.168.4.144:36596 -mca 
> mca_base_param_file_prefix ft-enable-cr -mca mca_base_param_file_path 
> /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test
>  -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
> sdiaz    14420  0.0  0.0 99452 4596 ?        Sl   13:28   0:00              
> \_ pi3
> 
> 
> 
> 
> 
> -- Sergio Díaz Montes
> Centro de Supercomputacion de Galicia
> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
> email: sd...@cesga.es ; http://www.cesga.es/
> 
> ------------------------------------------------
> ------------------------------------------------------------------------
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sd...@cesga.es ; http://www.cesga.es/

------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



      

Reply via email to