Many thanks!,
Sergio
P.S. Sorry about these long emails. I just try to show you
useful information to identify my problems.
ERROR 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
> [sdiaz@compute-3-18 ~]$ ompi-restart
ompi_global_snapshot_28454.ckpt
>
-----------------------------------------------------------------
---------
> Error: Unable to obtain the proper restart command to
restart from the
> checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>
>
-----------------------------------------------------------------
---------
>
-----------------------------------------------------------------
---------
> Error: Unable to obtain the proper restart command to
restart from the
> checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
>
-----------------------------------------------------------------
---------
> [compute-3-18:28792] *** Process received signal ***
> [compute-3-18:28792] Signal: Segmentation fault (11)
> [compute-3-18:28792] Signal code: (128)
> [compute-3-18:28792] Failing at address: (nil)
> [compute-3-18:28792] [ 0] /lib64/tls/libpthread.so.0
[0x33bbf0c430]
> [compute-3-18:28792] [ 1] /lib64/tls/libc.so.6(__libc_free
+0x25) [0x33bb669135]
> [compute-3-18:28792] [ 2] /opt/cesga/openmpi-1.3.3/lib/
libopen-pal.so.0(opal_argv_free+0x2e) [0x2a95586658]
> [compute-3-18:28792] [ 3] /opt/cesga/openmpi-1.3.3/lib/
libopen-pal.so.0(opal_event_fini+0x1e) [0x2a9557906e]
> [compute-3-18:28792] [ 4] /opt/cesga/openmpi-1.3.3/lib/
libopen-pal.so.0(opal_finalize+0x36) [0x2a9556bcfa]
> [compute-3-18:28792] [ 5] opal-restart [0x40312a]
> [compute-3-18:28792] [ 6] /lib64/tls/libc.so.6
(__libc_start_main+0xdb) [0x33bb61c3fb]
> [compute-3-18:28792] [ 7] opal-restart [0x40272a]
> [compute-3-18:28792] *** End of error message ***
> [compute-3-18:28793] *** Process received signal ***
> [compute-3-18:28793] Signal: Segmentation fault (11)
> [compute-3-18:28793] Signal code: (128)
> [compute-3-18:28793] Failing at address: (nil)
> [compute-3-18:28793] [ 0] /lib64/tls/libpthread.so.0
[0x33bbf0c430]
> [compute-3-18:28793] [ 1] /lib64/tls/libc.so.6(__libc_free
+0x25) [0x33bb669135]
> [compute-3-18:28793] [ 2] /opt/cesga/openmpi-1.3.3/lib/
libopen-pal.so.0(opal_argv_free+0x2e) [0x2a95586658]
> [compute-3-18:28793] [ 3] /opt/cesga/openmpi-1.3.3/lib/
libopen-pal.so.0(opal_event_fini+0x1e) [0x2a9557906e]
> [compute-3-18:28793] [ 4] /opt/cesga/openmpi-1.3.3/lib/
libopen-pal.so.0(opal_finalize+0x36) [0x2a9556bcfa]
> [compute-3-18:28793] [ 5] opal-restart [0x40312a]
> [compute-3-18:28793] [ 6] /lib64/tls/libc.so.6
(__libc_start_main+0xdb) [0x33bb61c3fb]
> [compute-3-18:28793] [ 7] opal-restart [0x40272a]
> [compute-3-18:28793] *** End of error message ***
>
-----------------------------------------------------------------
---------
> mpirun noticed that process rank 0 with PID 28792 on node
compute-3-18.local exited on signal 11 (Segmentation fault).
>
-----------------------------------------------------------------
---------
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
ERROR 2
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
> [sdiaz@compute-3-18 ~]$ ompi-restart -v
ompi_global_snapshot_28454.ckpt
>[compute-3-18.local:28941] Checking for the existence of (/
home/cesga/sdiaz/ompi_global_snapshot_28454.ckpt)
> [compute-3-18.local:28941] Restarting from file
(ompi_global_snapshot_28454.ckpt)
> [compute-3-18.local:28941] Exec in self
> .......
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
ERROR3
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>[sdiaz@compute-3-18 ~]$ ompi_info --all|grep "plm_rsh_agent"
> How many plm_rsh_agent instances to invoke
concurrently (must be > 0)
> MCA plm: parameter "plm_rsh_agent" (current value:
"ssh : rsh", data source: default value, synonyms: pls_rsh_agent)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
ERROR4
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>/usr/bin/ssh -x compute-3-17.local orted --debug-daemons -
mca ess env -mca orte_ess_jobid 2152464384 -mca orte_ess_vpid
1 -mca orte_ess_num_procs 2 --hnp-uri >"2152464384.0;tcp://
192.168.4.143:59176" -mca mca_base_param_file_prefix ft-enable-
cr -mca mca_base_param_file_path >/opt/cesga/openmpi-1.3.3/
share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/
mpi_test -mca mca_base_param_file_path_force /home_no_usc/
cesga/sdiaz/mpi_test
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
Josh Hursey escribió:
On Nov 9, 2009, at 5:33 AM, Sergio Díaz wrote:
Hi Josh,
The OpenMPI version is 1.3.3.
The command ompi-ps doesn't work.
[root@compute-3-18 ~]# ompi-ps -j 2726959 -p 16241
[root@compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241
[compute-3-18.local:16254] orte_ps: Acquiring list of HNPs
and setting contact info into RML...
[root@compute-3-18 ~]# ompi-ps -v -j 2726959
[compute-3-18.local:16255] orte_ps: Acquiring list of HNPs
and setting contact info into RML...
[root@compute-3-18 ~]# ps uaxf | grep sdiaz
root 16260 0.0 0.0 51084 680 pts/0 S+ 13:38
0:00 \_ grep sdiaz
sdiaz 16203 0.0 0.0 53164 1220 ? Ss 13:37
0:00 \_ -bash /opt/cesga/sge62/default/spool/
compute-3-18/job_scripts/2726959
sdiaz 16241 0.0 0.0 41028 2480 ? S 13:37
0:00 \_ mpirun -np 2 -am ft-enable-cr ./pi3
sdiaz 16242 0.0 0.0 36484 1840 ? Sl 13:37
0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -
inherit -nostdin -V compute-3-17.local orted -mca ess env -
mca orte_ess_jobid 2769879040 -mca orte_ess_vpid 1 -mca
orte_ess_num_procs 2 --hnp-uri "2769879040.0;tcp://
192.168.4.143:57010" -mca mca_base_param_file_prefix ft-
enable-cr -mca mca_base_param_file_path /opt/cesga/
openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/
cesga/sdiaz/mpi_test -mca mca_base_param_file_path_force /
home_no_usc/cesga/sdiaz/mpi_test
sdiaz 16245 0.1 0.0 99464 4616 ? Sl 13:37
0:00 \_ ./pi3
[root@compute-3-18 ~]# ompi-ps -n c3-18
[root@compute-3-18 ~]# ompi-ps -n compute-3-18
[root@compute-3-18 ~]# ompi-ps -n
There is not directory on the /tmp of the node. However, if
the application is run without SGE, the directory is created
This may be the core of the problem. ompi-ps and other
command line tools (e.g., ompi-checkpoint) look for the Open
MPI session directory in /tmp in order to find the connection
information to connect to the mpirun process (internally
called the HNP or Head Node Process).
Can you change the location of the temporary directory in
SGE? The temporary directory is usually set via an
environment variable (e.g., TMPDIR, or TMP). So removing the
environment variable or setting it to /tmp might help.
but if I do ompi-ps -j MPIRUN_PID, it seems hanged and I
interrupt it. Does it take long time?
It should not take a long time. It is just querying the
mpirun process for state information.
what means the option -j of ompi-ps command? isn't it
related to a batch system(like sge, condor...), is it?
The '-j' option allows the user to specify the Open MPI
jobid. This is completely different than the jobid provided
by the batch system. In general, users should not need to
specify the -j option. It is useful when you have multiple
Open MPI jobs, and want a summary of just one of them.
Thanks for the ticket. I will follow it.
Talking with Alan, I realized that there are few transport
protocols that are supported. And maybe it is the problem.
Currently, SGE is using qrsh to expand mpi process. I can
change this protocol and use ssh. So, I'm going to test it
this afternoon and I will comment to you the results.
Try 'ssh' and see if that helps. I suspect the problem is
with the session directory location though.
Regards,
Sergio
Josh Hursey escribió:
On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote:
Hello,
I have achieved the checkpoint of an easy program without
SGE. Now, I'm trying to do the integration openmpi+sge but
I have some problems... When I try to do checkpoint of the
mpirun PID, I got an error similar to the error gotten
when the PID doesn't exit. The example below.
I do not have any experience with the SGE environment, so I
suspect that there may something 'special' about the
environment that is tripping up the ompi-checkpoint tool.
First of all, what version of Open MPI are you using?
Somethings to check:
- Does 'ompi-ps' work when your application is running?
- Is there an /tmp/openmpi-sessions-* directory on the
node where mpirun is currently running? This directory
contains information on how to connect to the mpirun
process from an external tool, if it's missing then this
could be the cause of the problem.
Any ideas?
Somebody have a script to do it automatic with SGE?. For
example I have one to do checkpoint each X seconds with
BLCR and non-mpi jobs. It is launched by SGE if you have
configured the queue and the ckpt environment.
I do not know of any integration of the Open MPI
checkpointing work with SGE at the moment.
As far as time triggered checkpointing, I have a feature
ticket open about this:
https://svn.open-mpi.org/trac/ompi/ticket/1961
It is not available yet, but in the works.
Is it possible choose the name of the ckpt folder when you
do the ompi-checkpoint? I can't find the option to do it.
Not at this time. Though I could see it as a useful
feature, and shouldn't be too hard to implement. I filed a
ticket if you want to follow the progress:
https://svn.open-mpi.org/trac/ompi/ticket/2098
-- Josh
Regards,
Sergio
--------------------------------
[sdiaz@compute-3-17 ~]$ ps auxf
....
root 20044 0.0 0.0 4468 1224 ? S 13:28
0:00 \_ sge_shepherd-2645150 -bg
sdiaz 20072 0.0 0.0 53172 1212 ? Ss 13:28
0:00 \_ -bash /opt/cesga/sge62/default/spool/
compute-3-17/job_scripts/2645150
sdiaz 20112 0.2 0.0 41028 2480 ? S 13:28
0:00 \_ mpirun -np 2 -am ft-enable-cr pi3
sdiaz 20113 0.0 0.0 36484 1824 ? Sl 13:28
0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -
inherit -nostdin -V compute-3-18..........
sdiaz 20116 1.2 0.0 99464 4616 ? Sl 13:28
0:00 \_ pi3
[sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112
[compute-3-17.local:20124] HNP with PID 20112 Not found!
[sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112
[compute-3-17.local:20135] HNP with PID 20112 Not found!
[sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112
[compute-3-17.local:20136] HNP with PID 20112 Not found!
[sdiaz@compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
-------------------------------------------------------------
-------------
ompi-checkpoint PID_OF_MPIRUN
Open MPI Checkpoint Tool
-am <arg0> Aggregate MCA parameter set file
list
-gmca|--gmca <arg0> <arg1>
Pass global MCA parameters that
are applicable to
all contexts (arg0 is the
parameter name; arg1 is
the parameter value)
-h|--help This help message
--hnp-jobid <arg0> This should be the jobid of the
HNP whose
applications you wish to checkpoint.
--hnp-pid <arg0> This should be the pid of the
mpirun whose
applications you wish to checkpoint.
-mca|--mca <arg0> <arg1>
Pass context-specific MCA
parameters; they are
considered global if --gmca is
not used and only
one context is specified (arg0 is
the parameter
name; arg1 is the parameter value)
-s|--status Display status messages
describing the progression
of the checkpoint
--term Terminate the application after
checkpoint
-v|--verbose Be Verbose
-w|--nowait Do not wait for the application
to finish
checkpointing before returning
-------------------------------------------------------------
-------------
[sdiaz@compute-3-17 ~]$ exit
logout
Connection to c3-17 closed.
[sdiaz@svgd mpi_test]$ ssh c3-18
Last login: Wed Oct 28 13:24:12 2009 from svgd.local
-bash-3.00$ ps auxf |grep sdiaz
sdiaz 14412 0.0 0.0 1888 560 ? Ss 13:28
0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/
qrsh_starter /opt/cesga/sge62/default/spool/compute-3-18/
active_jobs/2645150.1/1.compute-3-18
sdiaz 14419 0.0 0.0 35728 2260 ? S 13:28
0:00 \_ orted -mca ess env -mca orte_ess_jobid
2295267328 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
--hnp-uri 2295267328.0;tcp://192.168.4.144:36596 -mca
mca_base_param_file_prefix ft-enable-cr -mca
mca_base_param_file_path /opt/cesga/openmpi-1.3.3/share/
openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test -
mca mca_base_param_file_path_force /home_no_usc/cesga/
sdiaz/mpi_test
sdiaz 14420 0.0 0.0 99452 4596 ? Sl 13:28
0:00 \_ pi3
--
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de
Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sd...@cesga.es ; http://www.cesga.es/
<image002.jpg>
------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
(Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sd...@cesga.es ; http://www.cesga.es/
<image002.jpg>
------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela
(Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sd...@cesga.es ; http://www.cesga.es/
<image002.jpg>
------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users